n0b3l1a: March 2011

Tuesday, March 29, 2011

Stormo Lab Publications RSSVM with bibtex

http://ural.wustl.edu/pubs.html

Saturday, March 26, 2011

Common R Mistakes

http://www.springerlink.com/content/x730205j9241l531/

Friday, March 25, 2011

1. check physical write protect on usb drive

It is quite clear that your disk is 1) full and 2) damaged.

OPTION 1: Repair

* restore the file system running "dosfsck -a /dev/sdb1"

The -a option will automatically repair the errors to make the filesystem consistent again. Lost clusters will be converted to files and keep taking space. Perhaps also space will be freed. After running the disk check, run "df -h /dev/sdb1" again to check the free space. Chances are it will already be more than 1% (I added the -h swicht, means "human readable) to have sizes displayed in kB, MB). After that, clean up the drive, deleting what isn't necessary.

OPTION 2. Reformat altogether

sudo umount /dev/sdb1
sudo mkfs -t vfat /dev/sda1

Warning: You are probably aware that reformatting destroys all data on the disk !

Thursday, March 24, 2011

Dr. Seuss' Quotes

I meant what I said and I said what I meant.

I like nonsense; it wakes up the brain cells.

# Be who you are and say what you feel because those who mind don't matter and those who matter don't mind.

# Today was good. Today was fun. Tomorrow is another one.

# You know you're in love when you can't fall asleep because reality is finally better than your dreams.

http://quotations.about.com/od/bookquotes/a/seuss1.htm

Trees and Strength - J. Willard Marriott

"Good timber does not grow with ease; the stronger the wind, the stronger the trees."

Tuesday, March 22, 2011

Episode Ratings

http://wiki.d-addicts.com/I_Believe_in_Love/Episode_Ratings

Flash Player Square for Ubuntu 64

http://labs.adobe.com/downloads/flashplayer10_square.html

Multiz Threaded-Block-Alignment

http://www.ncbi.nlm.nih.gov/pubmed/15060014
MULTIZ uses pairwise alignments produced by BLASTZ. These are filtered to select the best match in the genome, so that each base in human aligns to a single base in the other species. We tried two different methods to filter these alignments, axtBest (Schwartz et al. 2003a) and the “net” approach described in Kent et al. 2003. In general axtBest aligns slightly more human bases, and the net approach is somewhat better at identifying orthologs. We prefer the latter.

http://www.bx.psu.edu/miller_lab/
http://www.bx.psu.edu/miller_lab/dist/tba_howto.pdf

The starting material for running TBA includes:
1. A set(s) of sequences from different species
2. An evolutionary tree describing the relationship of these species
3. A parameter file describing how to run blastz for different species. For example,
you might want to treat alignments involving non-placental mammals differently
from placental mammals. This file is optional.

3 Overview of running TBA
There are typically three steps involved with generating a multiple alignment:
1. generating a series of pair-wise alignments to “seed” the multiple alignment process (This process is performed with the program all bz, which is essentially a program that executes a series of blastz commands., outputs a "lav" format that is converted to MAF)
2. generating the multiple alignment (tba, calls multiz)
3. “projecting” the alignment onto a reference sequence

http://genome.csdb.cn/cgi-bin/hgTrackUi?hgsid=446052&c=chrX&g=mostConserved28way

Trails - George Bernard Shaw

"Do not follow where the path may lead. Go instead where there is no path and leave a trail."

George Bernard Shaw

Monday, March 21, 2011

Tools for managing and analyzing microarray data

http://bib.oxfordjournals.org.proxy.lib.sfu.ca/content/early/2011/03/20/bib.bbr010.full.pdf+html

Sunday, March 20, 2011

Top 100 Fantasy Movies

http://www.fantasybooksandmovies.com/best-fantasy-movies.html

Top 10 Best Cinematography Films

http://www.runningwithscissors.com/top-10-best-cinematography-films

http://moviemaniac14.blogspot.ca/2011/03/top-10-movies-with-best-cinematography.html

HOME (homeproject)
http://www.youtube.com/watch?v=jqxENMKaeCU&list=PLECFB679284982CA6&feature=plpp

Saturday, March 19, 2011

Repeats

http://rsat.ulb.ac.be/help.retrieve-ensembl-seq.html
The presence of repetitive elements hampers the detection of motifs, especially for vertebrate genomes, because these repetitive sequences have a very distinct composition than the rest of the genome.

http://genes.mit.edu/Repeats.html
The presence of certain types of repetitive elements in a sequence may sometimes distort the results of GENSCAN. In particular, L1 elements are often predicted as genes. To avoid this potential problem, you may wish to pre-screen for repetitive elements with a program like RepeatMasker or censor which replace sequence segments matching any of a set of elements common to your organism (e.g., Alu, L1, etc.) by the same number of asterisks or `N's.

Another option is to filter out repeats after running GENSCAN, e.g. to screen GENSCAN predicted peptides against a database of repeat sequences translated in all six frames.

Rogozin et al. Brieifings in Bioinformatics. 2000
This model suggests that
most SINEs in the mammalian
genomes are pseudogenes, and not
capable of producing copies.

SINE can affect the
function and
recombination of
surrounding sequences

Thus, insertions of SINEs (as
well as other types of genome
mutations) can affect the long-term
adaptability of the species in various
ways.1,58

The database of repetitive elements
(Repbase)66

SINEs are almost always
a signture of non-coding
DNA

Incomplete SINEs or highly divergent
sequences can create some problems for
prediction.

Undiscovered repetitive elements, in
poorly characterised genomes, can
create serious problems for computational
functional mapping of sequences.

The presence of repetitive elements can
create serious problems for sequence
analysis, especially in homology
searches in nucleotide sequence
databases. It is difficult to interpret
database search results if the output is
saturated by a number of highly scored
matches with repetitive elements which
are widely present in nucleotide
sequence databases.

Wednesday, March 16, 2011

Open world assumption (OWA) vs Closed world assumption (CWA)

the closed world assumption implies that everything we don’t know is false, while the open world assumption states that everything we don’t know is undefined
—Stefano Mazzocchi, Closed World vs. Open World: the First Semantic Web Battle^[36]

Open world assumption (OWA) - eg. OWL (web ontology language)

Closed world assumption (CWA) - eg. SQL (structured query language)

http://www.w3.org/2007/OWL/wiki/Implementations

http://protege.stanford.edu/overview/protege-owl.html

Tuesday, March 15, 2011

BLASTZ, Whole Genome Alignments

http://genomewiki.ucsc.edu/index.php/Whole_genome_alignment_howto

Monday, March 14, 2011

Genome sizes

http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/G/GenomeSizes.html

species, genome size (bp), gene count
Humans   3.3 x 10⁹   ~21,000
Mouse     3.4 x 10⁹     ~23,000

Zebrafish   1.2 x 10⁹   15,761

Caenorhabditis elegans      100,258,171 21,733
E. coli K-12   4,639,221   4,377

Mycobacterium tuberculosis   4,411,532   3,959

http://cbs.ym.edu.tw/cbs-01/index.php?option=com_content&view=article&id=184&Itemid=323

de novo assembly

Velvet

de Bruijn / eulerian - convert hamiltonian path problem to Eulerian because Eularian has an approximation algorithm, takes a lot of memory

greedy

overlay, overlap, consensus - used in Sanger, can't handle large number of sequences, not good for gigs of short reads, but theoretically better at assembling as it allows for more parameters to configure

hybrid approaches - use overlay overlap for sanger reads as scaffolds and extend with de bruijn

ABYSS (Assembly By short sequences) (http://genome.cshlp.org/content/19/6/1117.long, http://seqanswers.com/wiki/ABySS)
- parallelized
- Uniform coverage is key
Coverage can be of two types, expected / theoretical coverage and actual coverage
lowest coverage bias: 3rd gen sequence (only single molecule, no PCR amplification needed) < Illumina ~ 454 < Solid < Sanger < highest coverage bias

Trans-ABySS
- http://www.nature.com/nmeth/journal/v7/n11/full/nmeth.1517.html
- http://www.bcgsc.ca/platform/bioinfo/software/trans-abyss
- from transcriptomes (RNA-seq), non-uniform coverage
- uses a range of k-values (26-50bp) (to handle variable transcript expression)
- k optimization by iterative decreasing k, subtracting out matched reads at each step
number of unique k-mers thresholds at the length of the genome
- Assembly N50 values, the contig lengths for which 50% of the sequence in an assembly is in contigs of this size or larger, were highest for intermediate k values, with a maximum of 1,458 bp at k = 39 bp

One of the challenges with most assemblers is figuring out which parameters to use, picking the right length k bp (k-mer / overlapping substring)

Immunotrends

http://immunotrends.blogspot.com/

Sunday, March 13, 2011

News

March 11, 2011 - Northern Japan 9.0 magnitude earthquake
http://www.cnn.com/2011/WORLD/asiapcf/03/13/japan.quake/index.html?hpt=T1

January 2011 - Queensland, Australia flood
http://www.bbc.co.uk/news/world-asia-pacific-12260724

Saturday, March 12, 2011

454 Illumina Solexa comparisons

Next-gen sequencing
http://seqanswers.com/forums/attachment.php?attachmentid=365&d=1274818435
http://www.nature.com.proxy.lib.sfu.ca/nrg/journal/v11/n1/abs/nrg2626.html
http://www.nature.com.proxy.lib.sfu.ca/nature/journal/v452/n7189/abs/nature06884.html
http://hmg.oxfordjournals.org.proxy.lib.sfu.ca/content/19/R2/R227.full (Third gen sequencing)
http://www.biostat.jhsph.edu/~hji/courses/genomics/Sequencing.ppt Excellent summary of Next-gen

emulsion PCR
- ABI Solid and 454 Pyro
- only one primer is used in a microbead

bridge PCR / cluster PCR / immobilized PCR
- Solexa / Illumina
- both primers / adpaters attached on immobilized flow cell surface are used

Long-Range PCR - has proofreading
http://www.springerprotocols.com.proxy.lib.sfu.ca/Abstract/doi/10.1385/1-59259-273-2:051

Ligation-mediated PCR / Linker-mediated PCR
http://nar.oxfordjournals.org.proxy.lib.sfu.ca/content/24/8/1547.full

Wednesday, March 9, 2011

Puppy linux optimizations / speed up

http://www.murga-linux.com/puppy/viewtopic.php?p=319533
http://www.puppylinux.com/hard-puppy.htm
http://pupweb.org/wikka/BootParms?show_comments=1

syslinux.cfg and isolinux.cfg

default puppy
display /boot.msg
prompt 1
timeout 80

F1 /boot.msg
F2 /help.msg

label puppy
kernel /vmlinuz
append initrd=/initrd.gz pmedia=usbflash

# Use usbflash for SDHC cards

http://puppeee.com/web/documentation/p_3/

Choose Xvesa over Xorg.

Try the new Zdrv Cutter utility

Use Flwm instead of IceWM,

Gnumeric - lightweight calc / spreadsheet

http://projects.gnome.org/gnumeric/announcements/1.10/gnumeric-1.10.shtml

ABySS: A parallel assembler for short read sequence data

ABySS: A parallel assembler for short read sequence data

1. Jared T. Simpson,1,
2. Kim Wong,
3. Shaun D. Jackman,
4. Jacqueline E. Schein,
5. Steven J.M. Jones and
6. İnanç Birol,2

+ Author Affiliations

1.
Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia V5Z 4E6, Canada

Abstract

Widespread adoption of massively parallel deoxyribonucleic acid (DNA) sequencing instruments has prompted the recent development of de novo short read assembly algorithms. A common shortcoming of the available tools is their inability to efficiently assemble vast amounts of data generated from large-scale sequencing projects, such as the sequencing of individual human genomes to catalog natural genetic variation. To address this limitation, we developed ABySS (Assembly By Short Sequences), a parallelized sequence assembler. As a demonstration of the capability of our software, we assembled 3.5 billion paired-end reads from the genome of an African male publicly released by Illumina, Inc. Approximately 2.76 million contigs ≥100 base pairs (bp) in length were created with an N50 size of 1499 bp, representing 68% of the reference human genome. Analysis of these contigs identified polymorphic and novel sequences not present in the human reference assembly, which were validated by alignment to alternate human assemblies and to other primate genomes.

Prognosis vs Diagnosis

Whereas diagnostic models are usually used for classification, cause and effect,
prognostic models incorporate the dimension of time, adding a stochastic element., eg. "45% of patients with severe septic shock will die within 28 days"

Tuesday, March 8, 2011

Microarray

http://www.cost873.ch/_uploads/_files/Microarray_Course_statistics_for_microarrays_basics.pdf

Ergodic - any tree topology can be transformed into any other tree topology

SimulFold: Simultaneously Inferring RNA

Structures Including Pseudoknots, Alignments,

and Trees Using a Bayesian MCMC Framework

Irmtraud M. Meyer1,2*, Istvan Miklos3,4,5

PLoS Computational Biology August 2007 | Volume 3 | Issue 8 | e149

To summarize, all of the existing RNA structure prediction
programs face at least one of the following challenges: (1) the
MFE structure rather than the evolutionarily conserved
structure that is likely to correspond to the functional
structure is predicted, (2) unstructured regions of the RNA
are not explicitly modeled, (3) input alignments are fixed and
cannot be altered and improved, (4) pseudoknotted struc-
tures are either completely ignored or computationally too
expensive to predict, (5) only two evolutionarily related RNA
sequences are used as input, or (6) the evolutionary relation-
ship between the RNA sequences is not explicitly modeled.

The idea of co-estimating RNA secondary structures,
multiple sequence alignments, and evolutionary trees was
first suggested in a theory paper by David Sankoff in 1985
[50].

We introduce a joint distribution of RNA
structures, alignments, and trees in a Bayesian framework. As
it is not feasible to analytically calculate any interesting
statistics in this model in reasonable computational time, we
propose a Markov chain Monte Carlo (MCMC) method with
which we can sample from the posterior distribution.

For changing the topology of the tree, we pick a tree node at
random and swap this node and its aunt node to alter its
topology (see Figure 2). These moves have been shown [68,71]
to be ergodic, i.e., any tree topology can be transformed into
any other tree topology using these moves.

Monday, March 7, 2011

Epigenetics, microarrays

http://www.biomedcentral.com/1471-2164/7/181

http://www.ncbi.nlm.nih.gov/epigenomics

http://www.ncbi.nlm.nih.gov/pubmed/21183072

Sunday, March 6, 2011

Mutations

- Chris Baldi, Soochin Cho, and Ronald E. Ellis, “Mutations in Two Independent Pathways Are Sufficient to Create Hermaphroditic Nematodes,” Science 326, no. 5955 (November 13, 2009): 1002 -1005.

- Alan Charest et al., “Oncogenic targeting of an activated tyrosine kinase to the Golgi apparatus in a glioblastoma,” Proceedings of the National Academy of Sciences of the United States of America 100, no. 3 (February 4, 2003): 916 -921.

- People with one Sickle-cell allele is more resistant to malaria

Genome dictionary

http://www.theodora.com/genetics/#somaticcellgeneticmutation

Thursday, March 3, 2011

OOMPA: Object-Oriented Microarray and Proteomic Analysis

http://bioinformatics.mdanderson.org/Software/OOMPA/

source("http://bioinformatics.mdanderson.org/OOMPA/oompaLite.R")
oompaLite() #A package needed for plotting colored dendrograms

Package Class Discovery

Clorored dendrogram

source("http://bioinformatics.mdanderson.org/OOMPA/oompaLite.R")
oompaLite()

oompainstall()

library(ClassDiscovery)

# simulate data from three different groups
d1 <- matrix(rnorm(100*10, rnorm(100, 0.5)), nrow=100, ncol=10, byrow=FALSE)
d2 <- matrix(rnorm(100*10, rnorm(100, 0.5)), nrow=100, ncol=10, byrow=FALSE)
d3 <- matrix(rnorm(100*10, rnorm(100, 0.5)), nrow=100, ncol=10, byrow=FALSE)
dd <- cbind(d1, d2, d3)

# perform hierarchical clustering using correlation
hc <- hclust(distanceMatrix(dd, 'pearson'), method='average')
cols <- rep(c('red', 'green', 'blue'), each=10)
labs <- paste('X', 1:30, sep='')

# plot the dendrogram with color-coded groups
plotColoredClusters(hc, labs=labs, cols=cols)

#cleanup
rm(d1, d2, d3, dd, hc, cols, labs)

Limma: Linear Models for Microarray Data

www.statsci.org/smyth/pubs/limma-biocbook-reprint.pdf

www.mas.ncl.ac.uk/~ngl9/topics/inotes/TutorialMicroarrayAnalysis.pdf

http://www2.warwick.ac.uk/fac/sci/moac/students/peter_cock/r/geo/

Wednesday, March 2, 2011

R color map

types <- as.factor(c("a","b","b","a"))
c("red","blue")[types]
[1] "red" "blue" "blue" "red"