Just a collection of some random cool stuff. PS. Almost 99% of the contents here are not mine and I don't take credit for them, I reference and copy part of the interesting sections.
Tuesday, March 29, 2011
Saturday, March 26, 2011
Friday, March 25, 2011
Can't write to usb drive
1. check physical write protect on usb drive
It is quite clear that your disk is 1) full and 2) damaged.
OPTION 1: Repair
* restore the file system running "dosfsck -a /dev/sdb1"
The -a option will automatically repair the errors to make the filesystem consistent again. Lost clusters will be converted to files and keep taking space. Perhaps also space will be freed. After running the disk check, run "df -h /dev/sdb1" again to check the free space. Chances are it will already be more than 1% (I added the -h swicht, means "human readable) to have sizes displayed in kB, MB). After that, clean up the drive, deleting what isn't necessary.
OPTION 2. Reformat altogether
sudo umount /dev/sdb1
sudo mkfs -t vfat /dev/sda1
Warning: You are probably aware that reformatting destroys all data on the disk !
It is quite clear that your disk is 1) full and 2) damaged.
OPTION 1: Repair
* restore the file system running "dosfsck -a /dev/sdb1"
The -a option will automatically repair the errors to make the filesystem consistent again. Lost clusters will be converted to files and keep taking space. Perhaps also space will be freed. After running the disk check, run "df -h /dev/sdb1" again to check the free space. Chances are it will already be more than 1% (I added the -h swicht, means "human readable) to have sizes displayed in kB, MB). After that, clean up the drive, deleting what isn't necessary.
OPTION 2. Reformat altogether
sudo umount /dev/sdb1
sudo mkfs -t vfat /dev/sda1
Warning: You are probably aware that reformatting destroys all data on the disk !
Thursday, March 24, 2011
Dr. Seuss' Quotes
I meant what I said and I said what I meant.
I like nonsense; it wakes up the brain cells.
# Be who you are and say what you feel because those who mind don't matter and those who matter don't mind.
# Today was good. Today was fun. Tomorrow is another one.
# You know you're in love when you can't fall asleep because reality is finally better than your dreams.
http://quotations.about.com/od/bookquotes/a/seuss1.htm
I like nonsense; it wakes up the brain cells.
# Be who you are and say what you feel because those who mind don't matter and those who matter don't mind.
# Today was good. Today was fun. Tomorrow is another one.
# You know you're in love when you can't fall asleep because reality is finally better than your dreams.
http://quotations.about.com/od/bookquotes/a/seuss1.htm
Tuesday, March 22, 2011
Multiz Threaded-Block-Alignment
http://www.ncbi.nlm.nih.gov/pubmed/15060014
MULTIZ uses pairwise alignments produced by BLASTZ. These are filtered to select the best match in the genome, so that each base in human aligns to a single base in the other species. We tried two different methods to filter these alignments, axtBest (Schwartz et al. 2003a) and the “net” approach described in Kent et al. 2003. In general axtBest aligns slightly more human bases, and the net approach is somewhat better at identifying orthologs. We prefer the latter.
http://www.bx.psu.edu/miller_lab/
http://www.bx.psu.edu/miller_lab/dist/tba_howto.pdf
The starting material for running TBA includes:
1. A set(s) of sequences from different species
2. An evolutionary tree describing the relationship of these species
3. A parameter file describing how to run blastz for different species. For example,
you might want to treat alignments involving non-placental mammals differently
from placental mammals. This file is optional.
3 Overview of running TBA
There are typically three steps involved with generating a multiple alignment:
1. generating a series of pair-wise alignments to “seed” the multiple alignment process (This process is performed with the program all bz, which is essentially a program that executes a series of blastz commands., outputs a "lav" format that is converted to MAF)
2. generating the multiple alignment (tba, calls multiz)
3. “projecting” the alignment onto a reference sequence
http://genome.csdb.cn/cgi-bin/hgTrackUi?hgsid=446052&c=chrX&g=mostConserved28way
Trails - George Bernard Shaw
"Do not follow where the path may lead. Go instead where there is no path and leave a trail."
George Bernard Shaw
George Bernard Shaw
Monday, March 21, 2011
Sunday, March 20, 2011
Saturday, March 19, 2011
Repeats
http://rsat.ulb.ac.be/help.retrieve-ensembl-seq.html
The presence of repetitive elements hampers the detection of motifs, especially for vertebrate genomes, because these repetitive sequences have a very distinct composition than the rest of the genome.
http://genes.mit.edu/Repeats.html
The presence of certain types of repetitive elements in a sequence may sometimes distort the results of GENSCAN. In particular, L1 elements are often predicted as genes. To avoid this potential problem, you may wish to pre-screen for repetitive elements with a program like RepeatMasker or censor which replace sequence segments matching any of a set of elements common to your organism (e.g., Alu, L1, etc.) by the same number of asterisks or `N's.
Another option is to filter out repeats after running GENSCAN, e.g. to screen GENSCAN predicted peptides against a database of repeat sequences translated in all six frames.
Rogozin et al. Brieifings in Bioinformatics. 2000
This model suggests that
most SINEs in the mammalian
genomes are pseudogenes, and not
capable of producing copies.
SINE can affect the
function and
recombination of
surrounding sequences
Thus, insertions of SINEs (as
well as other types of genome
mutations) can affect the long-term
adaptability of the species in various
ways.1,58
The database of repetitive elements
(Repbase)66
SINEs are almost always
a signture of non-coding
DNA
Incomplete SINEs or highly divergent
sequences can create some problems for
prediction.
Undiscovered repetitive elements, in
poorly characterised genomes, can
create serious problems for computational
functional mapping of sequences.
The presence of repetitive elements can
create serious problems for sequence
analysis, especially in homology
searches in nucleotide sequence
databases. It is difficult to interpret
database search results if the output is
saturated by a number of highly scored
matches with repetitive elements which
are widely present in nucleotide
sequence databases.
The presence of repetitive elements hampers the detection of motifs, especially for vertebrate genomes, because these repetitive sequences have a very distinct composition than the rest of the genome.
http://genes.mit.edu/Repeats.html
The presence of certain types of repetitive elements in a sequence may sometimes distort the results of GENSCAN. In particular, L1 elements are often predicted as genes. To avoid this potential problem, you may wish to pre-screen for repetitive elements with a program like RepeatMasker or censor which replace sequence segments matching any of a set of elements common to your organism (e.g., Alu, L1, etc.) by the same number of asterisks or `N's.
Another option is to filter out repeats after running GENSCAN, e.g. to screen GENSCAN predicted peptides against a database of repeat sequences translated in all six frames.
Rogozin et al. Brieifings in Bioinformatics. 2000
This model suggests that
most SINEs in the mammalian
genomes are pseudogenes, and not
capable of producing copies.
SINE can affect the
function and
recombination of
surrounding sequences
Thus, insertions of SINEs (as
well as other types of genome
mutations) can affect the long-term
adaptability of the species in various
ways.1,58
The database of repetitive elements
(Repbase)66
SINEs are almost always
a signture of non-coding
DNA
Incomplete SINEs or highly divergent
sequences can create some problems for
prediction.
Undiscovered repetitive elements, in
poorly characterised genomes, can
create serious problems for computational
functional mapping of sequences.
The presence of repetitive elements can
create serious problems for sequence
analysis, especially in homology
searches in nucleotide sequence
databases. It is difficult to interpret
database search results if the output is
saturated by a number of highly scored
matches with repetitive elements which
are widely present in nucleotide
sequence databases.
Wednesday, March 16, 2011
Open world assumption (OWA) vs Closed world assumption (CWA)
the closed world assumption implies that everything we don’t know is false, while the open world assumption states that everything we don’t know is undefined—Stefano Mazzocchi, Closed World vs. Open World: the First Semantic Web Battle[36]
Open world assumption (OWA) - eg. OWL (web ontology language)
Closed world assumption (CWA) - eg. SQL (structured query language)
http://www.w3.org/2007/OWL/wiki/Implementations
http://protege.stanford.edu/overview/protege-owl.html
Tuesday, March 15, 2011
Monday, March 14, 2011
Genome sizes
http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/G/GenomeSizes.html
species, genome size (bp), gene count
Humans 3.3 x 109 ~21,000
Mouse 3.4 x 109 ~23,000
Zebrafish 1.2 x 109 15,761
Caenorhabditis elegans 100,258,171 21,733
E. coli K-12 4,639,221 4,377
Mycobacterium tuberculosis 4,411,532 3,959
http://cbs.ym.edu.tw/cbs-01/index.php?option=com_content&view=article&id=184&Itemid=323
species, genome size (bp), gene count
Humans 3.3 x 109 ~21,000
Mouse 3.4 x 109 ~23,000
Zebrafish 1.2 x 109 15,761
Caenorhabditis elegans 100,258,171 21,733
E. coli K-12 4,639,221 4,377
Mycobacterium tuberculosis 4,411,532 3,959
http://cbs.ym.edu.tw/cbs-01/index.php?option=com_content&view=article&id=184&Itemid=323
de novo assembly
Velvet
de Bruijn / eulerian - convert hamiltonian path problem to Eulerian because Eularian has an approximation algorithm, takes a lot of memory
greedy
overlay, overlap, consensus - used in Sanger, can't handle large number of sequences, not good for gigs of short reads, but theoretically better at assembling as it allows for more parameters to configure
hybrid approaches - use overlay overlap for sanger reads as scaffolds and extend with de bruijn
ABYSS (Assembly By short sequences) (http://genome.cshlp.org/content/19/6/1117.long, http://seqanswers.com/wiki/ABySS)
- parallelized
- Uniform coverage is key
Coverage can be of two types, expected / theoretical coverage and actual coverage
lowest coverage bias: 3rd gen sequence (only single molecule, no PCR amplification needed) < Illumina ~ 454 < Solid < Sanger < highest coverage bias
Trans-ABySS
- http://www.nature.com/nmeth/journal/v7/n11/full/nmeth.1517.html
- http://www.bcgsc.ca/platform/bioinfo/software/trans-abyss
- from transcriptomes (RNA-seq), non-uniform coverage
- uses a range of k-values (26-50bp) (to handle variable transcript expression)
- k optimization by iterative decreasing k, subtracting out matched reads at each step
number of unique k-mers thresholds at the length of the genome
- Assembly N50 values, the contig lengths for which 50% of the sequence in an assembly is in contigs of this size or larger, were highest for intermediate k values, with a maximum of 1,458 bp at k = 39 bp
One of the challenges with most assemblers is figuring out which parameters to use, picking the right length k bp (k-mer / overlapping substring)
de Bruijn / eulerian - convert hamiltonian path problem to Eulerian because Eularian has an approximation algorithm, takes a lot of memory
greedy
overlay, overlap, consensus - used in Sanger, can't handle large number of sequences, not good for gigs of short reads, but theoretically better at assembling as it allows for more parameters to configure
hybrid approaches - use overlay overlap for sanger reads as scaffolds and extend with de bruijn
ABYSS (Assembly By short sequences) (http://genome.cshlp.org/content/19/6/1117.long, http://seqanswers.com/wiki/ABySS)
- parallelized
- Uniform coverage is key
Coverage can be of two types, expected / theoretical coverage and actual coverage
lowest coverage bias: 3rd gen sequence (only single molecule, no PCR amplification needed) < Illumina ~ 454 < Solid < Sanger < highest coverage bias
Trans-ABySS
- http://www.nature.com/nmeth/journal/v7/n11/full/nmeth.1517.html
- http://www.bcgsc.ca/platform/bioinfo/software/trans-abyss
- from transcriptomes (RNA-seq), non-uniform coverage
- uses a range of k-values (26-50bp) (to handle variable transcript expression)
- k optimization by iterative decreasing k, subtracting out matched reads at each step
number of unique k-mers thresholds at the length of the genome
- Assembly N50 values, the contig lengths for which 50% of the sequence in an assembly is in contigs of this size or larger, were highest for intermediate k values, with a maximum of 1,458 bp at k = 39 bp
One of the challenges with most assemblers is figuring out which parameters to use, picking the right length k bp (k-mer / overlapping substring)
Sunday, March 13, 2011
News
March 11, 2011 - Northern Japan 9.0 magnitude earthquake
http://www.cnn.com/2011/WORLD/asiapcf/03/13/japan.quake/index.html?hpt=T1
January 2011 - Queensland, Australia flood
http://www.bbc.co.uk/news/world-asia-pacific-12260724
http://www.cnn.com/2011/WORLD/asiapcf/03/13/japan.quake/index.html?hpt=T1
January 2011 - Queensland, Australia flood
http://www.bbc.co.uk/news/world-asia-pacific-12260724
Saturday, March 12, 2011
454 Illumina Solexa comparisons
Next-gen sequencing
http://seqanswers.com/forums/attachment.php?attachmentid=365&d=1274818435
http://www.nature.com.proxy.lib.sfu.ca/nrg/journal/v11/n1/abs/nrg2626.html
http://www.nature.com.proxy.lib.sfu.ca/nature/journal/v452/n7189/abs/nature06884.html
http://hmg.oxfordjournals.org.proxy.lib.sfu.ca/content/19/R2/R227.full (Third gen sequencing)
http://www.biostat.jhsph.edu/~hji/courses/genomics/Sequencing.ppt Excellent summary of Next-gen
emulsion PCR
- ABI Solid and 454 Pyro
- only one primer is used in a microbead
bridge PCR / cluster PCR / immobilized PCR
- Solexa / Illumina
- both primers / adpaters attached on immobilized flow cell surface are used
Long-Range PCR - has proofreading
http://www.springerprotocols.com.proxy.lib.sfu.ca/Abstract/doi/10.1385/1-59259-273-2:051
Ligation-mediated PCR / Linker-mediated PCR
http://nar.oxfordjournals.org.proxy.lib.sfu.ca/content/24/8/1547.full
http://seqanswers.com/forums/attachment.php?attachmentid=365&d=1274818435
http://www.nature.com.proxy.lib.sfu.ca/nrg/journal/v11/n1/abs/nrg2626.html
http://www.nature.com.proxy.lib.sfu.ca/nature/journal/v452/n7189/abs/nature06884.html
http://hmg.oxfordjournals.org.proxy.lib.sfu.ca/content/19/R2/R227.full (Third gen sequencing)
http://www.biostat.jhsph.edu/~hji/courses/genomics/Sequencing.ppt Excellent summary of Next-gen
emulsion PCR
- ABI Solid and 454 Pyro
- only one primer is used in a microbead
bridge PCR / cluster PCR / immobilized PCR
- Solexa / Illumina
- both primers / adpaters attached on immobilized flow cell surface are used
Long-Range PCR - has proofreading
http://www.springerprotocols.com.proxy.lib.sfu.ca/Abstract/doi/10.1385/1-59259-273-2:051
Ligation-mediated PCR / Linker-mediated PCR
http://nar.oxfordjournals.org.proxy.lib.sfu.ca/content/24/8/1547.full
Wednesday, March 9, 2011
Puppy linux optimizations / speed up
http://www.murga-linux.com/puppy/viewtopic.php?p=319533
http://www.puppylinux.com/hard-puppy.htm
http://pupweb.org/wikka/BootParms?show_comments=1
syslinux.cfg and isolinux.cfg
default puppy
display /boot.msg
prompt 1
timeout 80
F1 /boot.msg
F2 /help.msg
label puppy
kernel /vmlinuz
append initrd=/initrd.gz pmedia=usbflash
# Use usbflash for SDHC cards
http://puppeee.com/web/documentation/p_3/
Choose Xvesa over Xorg.
Try the new Zdrv Cutter utility
Use Flwm instead of IceWM,
http://www.puppylinux.com/hard-puppy.htm
http://pupweb.org/wikka/BootParms?show_comments=1
syslinux.cfg and isolinux.cfg
default puppy
display /boot.msg
prompt 1
timeout 80
F1 /boot.msg
F2 /help.msg
label puppy
kernel /vmlinuz
append initrd=/initrd.gz pmedia=usbflash
# Use usbflash for SDHC cards
http://puppeee.com/web/documentation/p_3/
Choose Xvesa over Xorg.
Try the new Zdrv Cutter utility
Use Flwm instead of IceWM,
ABySS: A parallel assembler for short read sequence data
ABySS: A parallel assembler for short read sequence data
1. Jared T. Simpson,1,
2. Kim Wong,
3. Shaun D. Jackman,
4. Jacqueline E. Schein,
5. Steven J.M. Jones and
6. İnanç Birol,2
+ Author Affiliations
1.
Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia V5Z 4E6, Canada
Abstract
Widespread adoption of massively parallel deoxyribonucleic acid (DNA) sequencing instruments has prompted the recent development of de novo short read assembly algorithms. A common shortcoming of the available tools is their inability to efficiently assemble vast amounts of data generated from large-scale sequencing projects, such as the sequencing of individual human genomes to catalog natural genetic variation. To address this limitation, we developed ABySS (Assembly By Short Sequences), a parallelized sequence assembler. As a demonstration of the capability of our software, we assembled 3.5 billion paired-end reads from the genome of an African male publicly released by Illumina, Inc. Approximately 2.76 million contigs ≥100 base pairs (bp) in length were created with an N50 size of 1499 bp, representing 68% of the reference human genome. Analysis of these contigs identified polymorphic and novel sequences not present in the human reference assembly, which were validated by alignment to alternate human assemblies and to other primate genomes.
1. Jared T. Simpson,1,
2. Kim Wong,
3. Shaun D. Jackman,
4. Jacqueline E. Schein,
5. Steven J.M. Jones and
6. İnanç Birol,2
+ Author Affiliations
1.
Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia V5Z 4E6, Canada
Abstract
Widespread adoption of massively parallel deoxyribonucleic acid (DNA) sequencing instruments has prompted the recent development of de novo short read assembly algorithms. A common shortcoming of the available tools is their inability to efficiently assemble vast amounts of data generated from large-scale sequencing projects, such as the sequencing of individual human genomes to catalog natural genetic variation. To address this limitation, we developed ABySS (Assembly By Short Sequences), a parallelized sequence assembler. As a demonstration of the capability of our software, we assembled 3.5 billion paired-end reads from the genome of an African male publicly released by Illumina, Inc. Approximately 2.76 million contigs ≥100 base pairs (bp) in length were created with an N50 size of 1499 bp, representing 68% of the reference human genome. Analysis of these contigs identified polymorphic and novel sequences not present in the human reference assembly, which were validated by alignment to alternate human assemblies and to other primate genomes.
Prognosis vs Diagnosis
Whereas diagnostic models are usually used for classification, cause and effect,
prognostic models incorporate the dimension of time, adding a stochastic element., eg. "45% of patients with severe septic shock will die within 28 days"
prognostic models incorporate the dimension of time, adding a stochastic element., eg. "45% of patients with severe septic shock will die within 28 days"
Tuesday, March 8, 2011
Ergodic - any tree topology can be transformed into any other tree topology
SimulFold: Simultaneously Inferring RNA
Structures Including Pseudoknots, Alignments,
and Trees Using a Bayesian MCMC Framework
Irmtraud M. Meyer1,2*, Istvan Miklos3,4,5
PLoS Computational Biology August 2007 | Volume 3 | Issue 8 | e149
To summarize, all of the existing RNA structure prediction
programs face at least one of the following challenges: (1) the
MFE structure rather than the evolutionarily conserved
structure that is likely to correspond to the functional
structure is predicted, (2) unstructured regions of the RNA
are not explicitly modeled, (3) input alignments are fixed and
cannot be altered and improved, (4) pseudoknotted struc-
tures are either completely ignored or computationally too
expensive to predict, (5) only two evolutionarily related RNA
sequences are used as input, or (6) the evolutionary relation-
ship between the RNA sequences is not explicitly modeled.
The idea of co-estimating RNA secondary structures,
multiple sequence alignments, and evolutionary trees was
first suggested in a theory paper by David Sankoff in 1985
[50].
We introduce a joint distribution of RNA
structures, alignments, and trees in a Bayesian framework. As
it is not feasible to analytically calculate any interesting
statistics in this model in reasonable computational time, we
propose a Markov chain Monte Carlo (MCMC) method with
which we can sample from the posterior distribution.
For changing the topology of the tree, we pick a tree node at
random and swap this node and its aunt node to alter its
topology (see Figure 2). These moves have been shown [68,71]
to be ergodic, i.e., any tree topology can be transformed into
any other tree topology using these moves.
Monday, March 7, 2011
Sunday, March 6, 2011
Mutations
- Chris Baldi, Soochin Cho, and Ronald E. Ellis, “Mutations in Two Independent Pathways Are Sufficient to Create Hermaphroditic Nematodes,” Science 326, no. 5955 (November 13, 2009): 1002 -1005.
- Alan Charest et al., “Oncogenic targeting of an activated tyrosine kinase to the Golgi apparatus in a glioblastoma,” Proceedings of the National Academy of Sciences of the United States of America 100, no. 3 (February 4, 2003): 916 -921.
- People with one Sickle-cell allele is more resistant to malaria
- Alan Charest et al., “Oncogenic targeting of an activated tyrosine kinase to the Golgi apparatus in a glioblastoma,” Proceedings of the National Academy of Sciences of the United States of America 100, no. 3 (February 4, 2003): 916 -921.
- People with one Sickle-cell allele is more resistant to malaria
Thursday, March 3, 2011
OOMPA: Object-Oriented Microarray and Proteomic Analysis
http://bioinformatics.mdanderson.org/Software/OOMPA/
Package Class Discovery
Clorored dendrogram
library(ClassDiscovery)
# simulate data from three different groups
d1 <- matrix(rnorm(100*10, rnorm(100, 0.5)), nrow=100, ncol=10, byrow=FALSE)
d2 <- matrix(rnorm(100*10, rnorm(100, 0.5)), nrow=100, ncol=10, byrow=FALSE)
d3 <- matrix(rnorm(100*10, rnorm(100, 0.5)), nrow=100, ncol=10, byrow=FALSE)
dd <- cbind(d1, d2, d3)
# perform hierarchical clustering using correlation
hc <- hclust(distanceMatrix(dd, 'pearson'), method='average')
cols <- rep(c('red', 'green', 'blue'), each=10)
labs <- paste('X', 1:30, sep='')
# plot the dendrogram with color-coded groups
plotColoredClusters(hc, labs=labs, cols=cols)
#cleanup
rm(d1, d2, d3, dd, hc, cols, labs)
source("http://bioinformatics.mdanderson.org/OOMPA/oompaLite.R")
oompaLite() #A package needed for plotting colored dendrograms
oompaLite() #A package needed for plotting colored dendrograms
Package Class Discovery
Clorored dendrogram
source("http://bioinformatics.mdanderson.org/OOMPA/oompaLite.R") oompaLite()
oompainstall()
library(ClassDiscovery)
# simulate data from three different groups
d1 <- matrix(rnorm(100*10, rnorm(100, 0.5)), nrow=100, ncol=10, byrow=FALSE)
d2 <- matrix(rnorm(100*10, rnorm(100, 0.5)), nrow=100, ncol=10, byrow=FALSE)
d3 <- matrix(rnorm(100*10, rnorm(100, 0.5)), nrow=100, ncol=10, byrow=FALSE)
dd <- cbind(d1, d2, d3)
# perform hierarchical clustering using correlation
hc <- hclust(distanceMatrix(dd, 'pearson'), method='average')
cols <- rep(c('red', 'green', 'blue'), each=10)
labs <- paste('X', 1:30, sep='')
# plot the dendrogram with color-coded groups
plotColoredClusters(hc, labs=labs, cols=cols)
#cleanup
rm(d1, d2, d3, dd, hc, cols, labs)
Wednesday, March 2, 2011
R color map
types <- as.factor(c("a","b","b","a"))
c("red","blue")[types]
[1] "red" "blue" "blue" "red"
c("red","blue")[types]
[1] "red" "blue" "blue" "red"
Tuesday, March 1, 2011
Subscribe to:
Posts (Atom)