Wednesday, June 22, 2011

Fast interval intersection methodologies

http://biostar.stackexchange.com/questions/99/fast-interval-intersection-methodologies/101#101

Most genomic annotations are specified as intervals along the genome.

http://code.google.com/p/bedtools/#Brief_example

Quinlan, AR and Hall, IM, 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 26, 6, pp. 841–842.


The BEDTools utilities allow one to address common genomics tasks such as finding feature overlaps and computing coverage. The utilities are largely based on four widely-used file formats: BED, GFF/GTF, VCF, and SAM/BAM. Using BEDTools, one can develop sophisticated pipelines that answer complicated research questions by "streaming" several BEDTools together. The following are examples of common questions that one can address with BEDTools.
  1. Intersecting two BED files in search of overlapping features.
  2. Culling/refining/computing coverage for BAM alignments based on genome features.
  3. Merging overlapping features.
  4. Screening for paired-end (PE) overlaps between PE sequences and existing genomic features.
  5. Calculating the depth and breadth of sequence coverage across defined "windows" in a genome.
  6. Screening for overlaps between "split" alignments and genomic features. 
But now you subsequently detect an artifact where false positives are enriched in SNPs having coverage > 100. You refine my original query accordingly. 
$ awk '$5 < 100' snps.bed | \
  egrep "A/G|C/T" | \
  intersectBed -a stdin -b exons.bed | \
  intersectBed -v -a stdin -b dbSnp130.bed \
  > bonafideNovelSnpsInExons.bed

No comments: