Monday, March 14, 2011

de novo assembly

Velvet

de Bruijn / eulerian - convert hamiltonian path problem to Eulerian because Eularian has an approximation algorithm, takes a lot of memory

greedy

overlay, overlap, consensus - used in Sanger, can't handle large number of sequences, not good for gigs of short reads, but theoretically better at assembling as it allows for more parameters to configure

hybrid approaches - use overlay overlap for sanger reads as scaffolds and extend with de bruijn

ABYSS (Assembly By short sequences) (http://genome.cshlp.org/content/19/6/1117.long, http://seqanswers.com/wiki/ABySS)
- parallelized
- Uniform coverage is key
Coverage can be of two types, expected / theoretical coverage and actual coverage
lowest coverage bias: 3rd gen sequence (only single molecule, no PCR amplification needed) < Illumina ~ 454 < Solid < Sanger < highest coverage bias

Trans-ABySS
- http://www.nature.com/nmeth/journal/v7/n11/full/nmeth.1517.html
- http://www.bcgsc.ca/platform/bioinfo/software/trans-abyss
- from transcriptomes (RNA-seq), non-uniform coverage
- uses a range of k-values (26-50bp) (to handle variable transcript expression)
- k optimization by iterative decreasing k, subtracting out matched reads at each step
number of unique k-mers thresholds at the length of the genome
- Assembly N50 values, the contig lengths for which 50% of the sequence in an assembly is in contigs of this size or larger, were highest for intermediate k values, with a maximum of 1,458 bp at k = 39 bp

One of the challenges with most assemblers is figuring out which parameters to use, picking the right length k bp (k-mer / overlapping substring)

No comments: