http://www.ncbi.nlm.nih.gov/sites/books/NBK158900/
./sratoolkit.2.4.2-ubuntu64/bin/fastq-dump --split-3 -L info --gzip -v SRR1131659
Just a collection of some random cool stuff. PS. Almost 99% of the contents here are not mine and I don't take credit for them, I reference and copy part of the interesting sections.
Tuesday, December 16, 2014
Friday, December 12, 2014
Oracle Health Sciences Translational Research Center: A Translational Medicine Platform to Address the Big Data Challenge
http://www.oracle.com/us/industries/healthcare/translational-medicine-platform-wp-1840042.pdf
Oracle, on the Oracle Health Sciences Translational Research Center, a scalable informatics solution for translational research. Its back-end data components seamlessly integrate clinical and omics data from diverse clinical data sources as well as from vendor-specific and modality-specific omics data silos, providing standardized data readily available for the front-end application.
Oracle, on the Oracle Health Sciences Translational Research Center, a scalable informatics solution for translational research. Its back-end data components seamlessly integrate clinical and omics data from diverse clinical data sources as well as from vendor-specific and modality-specific omics data silos, providing standardized data readily available for the front-end application.
Tuesday, December 9, 2014
Finding overlap between two lists in Excel
Function
=COUNTIF(C:C,K4)
Where C:C contains one list and K4 is the value. Simply drag and drop this formula for all other values in the "K" column!
=COUNTIF(C:C,K4)
Where C:C contains one list and K4 is the value. Simply drag and drop this formula for all other values in the "K" column!
Monday, December 8, 2014
Eclipse misbehaving
If eclipse is misbehaving for some reasons, context menu not appearing, right-click does nothing ...
backup and remove ~/workspace/.metadata/.plugins
Or
Switch to a new workspace
backup and remove ~/workspace/.metadata/.plugins
Or
Switch to a new workspace
Friday, December 5, 2014
Compound heterozygote
Compound heterozygote: The presence of two different mutant alleles at a particular gene locus, one on each chromosome of a pair.
The human genome contains two copies of each gene, a paternal and a maternal allele. A mutation affecting only one allele is called heterozygous. Ahomozygous mutation is the presence of the identical mutation on both alleles of a specific gene. However, when both alleles of a gene harbor mutations, but the mutations are different, these mutations are called compound heterozygous. Also called a genetic compound.
An individual who has two different abnormal alleles at a particular locus, one on each chromosome of a pair; usually refers to individuals affected with an autosomal recessive disorder
or they can be mutations at different locus of the same gene.
Syndromic vs non-syndromic (not associated with anything else, specific)
http://ghr.nlm.nih.gov/condition/nonsyndromic-deafness
Nonsyndromic deafness is hearing loss that is not associated with other signs and symptoms. In contrast, syndromic deafness involves hearing loss that occurs with abnormalities in other parts of the body. Different types of nonsyndromic deafness are named according to their inheritance patterns.
Nonsyndromic deafness is hearing loss that is not associated with other signs and symptoms. In contrast, syndromic deafness involves hearing loss that occurs with abnormalities in other parts of the body. Different types of nonsyndromic deafness are named according to their inheritance patterns.
Tuesday, December 2, 2014
Javascript Closures -- callbacks in a loop
http://geekabyte.blogspot.ca/2013/04/callback-functions-in-loops-in.html
Solution for callbacks in a loop:
1. callbacks
2. recursive loop
Solution for callbacks in a loop:
1. callbacks
2. recursive loop
Friday, November 28, 2014
Continuous integration servers
Jenkins
http://jenkins-ci.org/
Bamboo
https://confluence.atlassian.com/display/BAMBOO/Bamboo+Documentation+Home
Wednesday, November 26, 2014
PLOS Computational Biology: Translational Bioinformatics
http://www.ploscollections.org/article/browseIssue.action?issue=info:doi/10.1371/issue.pcol.v03.i11
Let's Make Those Book Chapters Open Too!
Philip E. Bourne
PLOS Computational Biology: published 21 Feb 2013 | info:doi/10.1371/journal.pcbi.1002941
Education Articles
Introduction to Translational Bioinformatics Collection
Russ B. Altman
PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002796
Chapter 1: Biomedical Knowledge Integration
Philip R. O. Payne
PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002826
Chapter 2: Data-Driven View of Disease Biology
Casey S. Greene, Olga G. Troyanskaya
PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002816
Chapter 3: Small Molecules and Disease
David S. Wishart
PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002805
Chapter 4: Protein Interactions and Disease
Mileidy W. Gonzalez, Maricel G. Kann
PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002819
Chapter 5: Network Biology Approach to Complex Diseases
Dong-Yeon Cho, Yoo-Ah Kim, Teresa M. Przytycka
PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002820
Chapter 6: Structural Variation and Medical Genomics
Benjamin J. Raphael
PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002821
Chapter 7: Pharmacogenomics
Konrad J. Karczewski, Roxana Daneshjou, Russ B. Altman
PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002817
PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002858
Chapter 9: Analyses Using Disease Ontologies
Nigam H. Shah, Tyler Cole, Mark A. Musen
PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002827
Chapter 10: Mining Genome-Wide Genetic Markers
Xiang Zhang, Shunping Huang, Zhaojun Zhang, Wei Wang
PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002828
Chapter 11: Genome-Wide Association Studies
William S. Bush, Jason H. Moore
PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002822
Chapter 12: Human Microbiome Analysis
Xochitl C. Morgan, Curtis Huttenhower
PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002808
Chapter 13: Mining Electronic Health Records in the Genomics Era
Joshua C. Denny
PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002823
Chapter 14: Cancer Genome Analysis
Miguel Vazquez, Victor de la Torre, Alfonso Valencia
PLOS Computational Biology: published 27 Dec 2012 | info:doi/10.1371/journal.pcbi.1002824
Chapter 15: Disease Gene Prioritization
Yana Bromberg
PLOS Computational Biology: published 25 Apr 2013 | info:doi/10.1371/journal.pcbi.1002902
Chapter 16: Text Mining for Translational Bioinformatics
K. Bretonnel Cohen, Lawrence E. Hunter
PLOS Computational Biology: published 25 Apr 2013 | info:doi/10.1371/journal.pcbi.1003044
Chapter 17: Bioimage Informatics for Systems Pharmacology
Fuhai Li, Zheng Yin, Guangxu Jin, Hong Zhao, Stephen T. C. Wong
PLOS Computational Biology: published 25 Apr 2013 | info:doi/10.1371/journal.pcbi.1003043
Thursday, November 20, 2014
threejs - makes WebGL - 3D in the browser - very easy
http://threejs.org/docs/index.html#Manual/Introduction/Creating_a_scene
Three.js is a library that makes WebGL - 3D in the browser - very easy. While a simple cube in raw WebGL would turn out hundreds of lines of Javascript and shader code, a Three.js equivalent is only a fraction of that.
Three.js is a library that makes WebGL - 3D in the browser - very easy. While a simple cube in raw WebGL would turn out hundreds of lines of Javascript and shader code, a Three.js equivalent is only a fraction of that.
Google Genomics
https://cloud.google.com/genomics/v1beta2/visualization
Google Genomics provides an API to store, process, explore, and share DNA sequence reads, reference-based alignments, and variant calls, using Google's cloud infrastructure.
Store alignments and variant calls for one genome or a million.
Process genomic data in batch by running principal component analysis or Hardy-Weinberg equilibrium, in minutes or hours, by using parallel computing frameworks like MapReduce.
Explore data by slicing alignments and variants by genomic range across one or multiple samples -- for your own algorithms or for visualization; or interactively process entire cohorts to find transition/transversion ratios, allelic frequency, genome-wide association and more using BigQuery.
Share genomic data with your research group, collaborators, the broader community, or the public. You decide.
Google Genomics is implementing the API defined by the Global Alliance for Genomics and Health for visualization, analysis and more. Compliant software can access Google Genomics, local servers, or any other implementation.
Cluster computing
https://spark.apache.org/downloads.html
Google BigQuery
https://cloud.google.com/bigquery/what-is-bigquery
Google Genomics provides an API to store, process, explore, and share DNA sequence reads, reference-based alignments, and variant calls, using Google's cloud infrastructure.
Store alignments and variant calls for one genome or a million.
Process genomic data in batch by running principal component analysis or Hardy-Weinberg equilibrium, in minutes or hours, by using parallel computing frameworks like MapReduce.
Explore data by slicing alignments and variants by genomic range across one or multiple samples -- for your own algorithms or for visualization; or interactively process entire cohorts to find transition/transversion ratios, allelic frequency, genome-wide association and more using BigQuery.
Share genomic data with your research group, collaborators, the broader community, or the public. You decide.
Google Genomics is implementing the API defined by the Global Alliance for Genomics and Health for visualization, analysis and more. Compliant software can access Google Genomics, local servers, or any other implementation.
Cluster computing
https://spark.apache.org/downloads.html
Google BigQuery
https://cloud.google.com/bigquery/what-is-bigquery
Thursday, November 13, 2014
10 New Breakthrough Technologies 2014
http://www.technologyreview.com/lists/technologies/2014/
Agricultural Drones
Ultraprivate Smartphones
Brain Mapping
Neuromorphic Chips
Genome Editing
Microscale 3-D Printing
Mobile Collaboration
Oculus Rift
Agile Robots
Smart Wind and Solar Power
Agricultural Drones
Ultraprivate Smartphones
Brain Mapping
Neuromorphic Chips
Genome Editing
Microscale 3-D Printing
Mobile Collaboration
Oculus Rift
Agile Robots
Smart Wind and Solar Power
Monday, November 10, 2014
Alzheimer's drug sneaks through blood–brain barrier
Neurobiologist Ryan Watts and his colleagues at the biotechnology company Genentech in South San Francisco have sought to break through the barrier by exploiting transferrin, a protein that sits on the surface of blood vessels and carries iron into the brain. The team created an antibody with two ends. One end binds loosely to transferrin and uses the protein to transport itself into the brain. And once the antibody is inside, its other end targets an enzyme called β-secretase 1 (BACE1), which produces amyloid-β. Crucially, the antibody binds more tightly to BACE1 than to transferrin, and this pulls it off the blood vessel and into the brain. It locks BACE1 shut and prevents it from making amyloid-β.
http://www.nature.com/news/alzheimer-s-drug-sneaks-through-blood-brain-barrier-1.16291
http://www.nature.com/news/alzheimer-s-drug-sneaks-through-blood-brain-barrier-1.16291
Friday, October 10, 2014
R Graph Catalog
http://lsi.ubc.ca/resources/omics-phenotyping-portal/#R_Graph_Catalog
Statistics Training and Online Resources
R Graph Catalog
Recommended by Stefanie Butland, LSI
The R Graph Catalog is a visual index of over 100 graphs from the excellent book "Creating More Effective Graphs" by Naomi Robbins. Click on a graph thumbnail and you'll see the figure AND all the code necessary to reproduce the figure exactly with ggplot2, the R package written by Hadley Wickham. This is a resource for people who want to make a good graph and kind of know what it should look like … but they could really use an example to get started!
You can get the code for ALL figures and the infrastructure that makes the app from this repository on GitHub:https://github.com/jennybc/r-graph-catalog
The R Graph Catalog is maintained by Dr Jennifer Bryan, UBC Department of Statistics, and the initial work was facilitated by an NSERC Undergraduate Student Research Award to Joanna Zhao.
8 Realities of the Sequencing GWAS
http://massgenomics.org/2014/03/gwas-sequencing-realities.html
For several years, the genome-wide association study (GWAS) has served as the flagship discovery tool for genetic research, especially in the arena of common diseases. The wide availability and low cost of high-density SNP arrays made it possible to genotype 500,000 or so informative SNPs in thousands of samples. These studies spurred development of tools and pipelines for managing large-scale GWAS, and thus far they’ve revealed hundreds of new genetic associations.
As we all know, the cost of DNA sequencing has plummeted. Now it’s possible to do targeted, exome, or even whole-genome sequencing in cohorts large enough to power GWAS analyses. While we can leverage many of the same tools and approaches developed for SNP array-based GWAS, the sequencing data comes with some very important differences.
Wednesday, October 8, 2014
Tuesday, October 7, 2014
Monday, October 6, 2014
Friday, September 26, 2014
MySQL Spatial Queries
http://howto-use-mysql-spatial-ext.blogspot.ca/
https://docs.jboss.org/hibernate/orm/3.6/javadocs/org/hibernate/dialect/MySQLDialect.html
drop table test;
create table test (
id INT NOT NULL PRIMARY KEY,
location LINESTRING NOT NULL,
SPATIAL KEY sx_location (location)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
set @a = linestring(point(1,1),point(5,1));
set @b = linestring(point(1,1),point(10,1));
insert into test values(1, @a);
insert into test values(2, @b);
select * from test where MBRtouches(location, linestring(point(1,1),point(5,1)) );
select * from test where MBRtouches(location, linestring(point(2,1),point(3,1)) );
select * from test where MBRtouches(location, linestring(point(1,1),point(2,1)) );
select * from test where MBRtouches(location, linestring(point(1,1),point(7,1)) );
select * from test where MBRtouches(location, linestring(point(-1,1),point(0,1)) );
select * from test where MBRtouches(location, linestring(point(0,1),point(1,1)) );
select * from test where MBRtouches(location, linestring(point(7,1),point(12,1)) );
select * from test where MBRtouches(location, linestring(point(12,1),point(20,1)) );
select MBRcontains(@b,@a);
select MBRcontains(@a,@b);
select MBRoverlaps(@a,@b);
select MBRoverlaps(@b,@a);
select MBRtouches(@a,@b);
org.hibernate.dialect.MySQLDialectMySQL5Dialect, MySQLInnoDBDialect, MySQLMyISAMDialect
https://docs.jboss.org/hibernate/orm/3.6/javadocs/org/hibernate/dialect/MySQLDialect.html
drop table test;
create table test (
id INT NOT NULL PRIMARY KEY,
location LINESTRING NOT NULL,
SPATIAL KEY sx_location (location)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
set @a = linestring(point(1,1),point(5,1));
set @b = linestring(point(1,1),point(10,1));
insert into test values(1, @a);
insert into test values(2, @b);
select * from test where MBRtouches(location, linestring(point(1,1),point(5,1)) );
select * from test where MBRtouches(location, linestring(point(2,1),point(3,1)) );
select * from test where MBRtouches(location, linestring(point(1,1),point(2,1)) );
select * from test where MBRtouches(location, linestring(point(1,1),point(7,1)) );
select * from test where MBRtouches(location, linestring(point(-1,1),point(0,1)) );
select * from test where MBRtouches(location, linestring(point(0,1),point(1,1)) );
select * from test where MBRtouches(location, linestring(point(7,1),point(12,1)) );
select * from test where MBRtouches(location, linestring(point(12,1),point(20,1)) );
select MBRcontains(@b,@a);
select MBRcontains(@a,@b);
select MBRoverlaps(@a,@b);
select MBRoverlaps(@b,@a);
select MBRtouches(@a,@b);
Hibernate SQL statement debug
log4j.logger.org.hibernate.SQL=TRACE
log4j.logger.org.hibernate.type.descriptor.sql.BasicBinder=TRACE
log4j.logger.org.hibernate.type.descriptor.sql.BasicBinder=TRACE
Wednesday, September 10, 2014
Microarray vs RNA-seq, Normalization
John Storey provides his take on the importance of new statistical methods for high-throughput sequencing.
http://www.nature.com/nbt/journal/v29/n4/full/nbt.1831.html
Honing our reading skills
http://www.nature.com/nbt/journal/v32/n9/full/nbt.3021.html?WT.ec_id=NBT-201409
Nature Biotechnology Contents: Volume 32 pp 700 - 960
http://mabsj2.blogspot.ca/2014/09/nature-biotechnology-contents-volume-32.html
The devil in the details of RNA-seq
http://www.nature.com/nbt/journal/v32/n9/full/nbt.3015.html#affil-auth
RNA-seq is clearly superior to microarrays for its ability for de novo discovery and detection of genes, especially those with low expression levels. The detection of alternative splicing patterns is possible, but attention needs to be paid to the underlying gene annotation, and parameters such as mapping and error rates become more important than sequencing depth.
Detecting and correcting systematic variation in large-scale RNA sequencing data
Sheng Li, Paweł P Łabaj, Paul Zumbo, Peter Sykacek, Wei Shi, Leming Shi, John Phan, Po-Yen Wu, May Wang, Charles Wang, Danielle Thierry-Mieg, Jean Thierry-Mieg, David P Kreil & Christopher E Mason
AffiliationsContributionsCorresponding authors
Nature Biotechnology 32, 888–895 (2014) doi:10.1038/nbt.3000
Abstract• Introduction• Results• Discussion• Methods• Accession codes• References• Acknowledgments• Author information• Supplementary information
High-throughput RNA sequencing (RNA-seq) enables comprehensive scans of entire transcriptomes, but best practices for analyzing RNA-seq data have not been fully defined, particularly for data collected with multiple sequencing platforms or at multiple sites. Here we used standardized RNA samples with built-in controls to examine sources of error in large-scale RNA-seq studies and their impact on the detection of differentially expressed genes (DEGs). Analysis of variations in guanine-cytosine content, gene coverage, sequencing error rate and insert size allowed identification of decreased reproducibility across sites. Moreover, commonly used methods for normalization (cqn, EDASeq, RUV2, sva, PEER) varied in their ability to remove these systematic biases, depending on sample complexity and initial data quality. Normalization methods that combine data from genes across sites are strongly recommended to identify and remove site-specific effects and can substantially improve RNA-seq studies.
http://www.nature.com/nbt/journal/v32/n9/full/nbt.3000.html?WT.ec_id=NBT-201409
http://www.nature.com/nbt/journal/v32/n9/full/nbt.2931.html?WT.ec_id=NBT-201409
http://www.nature.com/nbt/journal/v32/n9/full/nbt.2957.html?WT.ec_id=NBT-201409
http://www.nature.com/nbt/journal/v32/n9/full/nbt.2972.html?WT.ec_id=NBT-201409
http://www.nature.com/nbt/journal/v32/n9/full/nbt.3001.html?WT.ec_id=NBT-201409
http://www.nature.com/nbt/journal/v29/n4/full/nbt.1831.html
Honing our reading skills
http://www.nature.com/nbt/journal/v32/n9/full/nbt.3021.html?WT.ec_id=NBT-201409
Nature Biotechnology Contents: Volume 32 pp 700 - 960
http://mabsj2.blogspot.ca/2014/09/nature-biotechnology-contents-volume-32.html
The devil in the details of RNA-seq
http://www.nature.com/nbt/journal/v32/n9/full/nbt.3015.html#affil-auth
RNA-seq is clearly superior to microarrays for its ability for de novo discovery and detection of genes, especially those with low expression levels. The detection of alternative splicing patterns is possible, but attention needs to be paid to the underlying gene annotation, and parameters such as mapping and error rates become more important than sequencing depth.
Detecting and correcting systematic variation in large-scale RNA sequencing data
Sheng Li, Paweł P Łabaj, Paul Zumbo, Peter Sykacek, Wei Shi, Leming Shi, John Phan, Po-Yen Wu, May Wang, Charles Wang, Danielle Thierry-Mieg, Jean Thierry-Mieg, David P Kreil & Christopher E Mason
AffiliationsContributionsCorresponding authors
Nature Biotechnology 32, 888–895 (2014) doi:10.1038/nbt.3000
Abstract• Introduction• Results• Discussion• Methods• Accession codes• References• Acknowledgments• Author information• Supplementary information
High-throughput RNA sequencing (RNA-seq) enables comprehensive scans of entire transcriptomes, but best practices for analyzing RNA-seq data have not been fully defined, particularly for data collected with multiple sequencing platforms or at multiple sites. Here we used standardized RNA samples with built-in controls to examine sources of error in large-scale RNA-seq studies and their impact on the detection of differentially expressed genes (DEGs). Analysis of variations in guanine-cytosine content, gene coverage, sequencing error rate and insert size allowed identification of decreased reproducibility across sites. Moreover, commonly used methods for normalization (cqn, EDASeq, RUV2, sva, PEER) varied in their ability to remove these systematic biases, depending on sample complexity and initial data quality. Normalization methods that combine data from genes across sites are strongly recommended to identify and remove site-specific effects and can substantially improve RNA-seq studies.
http://www.nature.com/nbt/journal/v32/n9/full/nbt.3000.html?WT.ec_id=NBT-201409
http://www.nature.com/nbt/journal/v32/n9/full/nbt.2931.html?WT.ec_id=NBT-201409
http://www.nature.com/nbt/journal/v32/n9/full/nbt.2957.html?WT.ec_id=NBT-201409
http://www.nature.com/nbt/journal/v32/n9/full/nbt.2972.html?WT.ec_id=NBT-201409
http://www.nature.com/nbt/journal/v32/n9/full/nbt.3001.html?WT.ec_id=NBT-201409
Friday, August 29, 2014
Eclipse hotkeys
Shift hover over function shows code
Alt-left, Alt-right previous view
F3
Ctrl+F3 outline
Ctrl+hover open declaration, implementation
Alt-left, Alt-right previous view
F3
Ctrl+F3 outline
Ctrl+hover open declaration, implementation
Monday, August 25, 2014
RESTful web api for getting gene information
NCBI's E-Utils
BRCA1
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gene&id=672
BioMart
http://central.biomart.org/martwizard/#!/Search_by_database_name?mart=Hugo+Gene+Nomenclature+(HGNC)+(EBI%2C+UK)&step=1&datasets=hgnc
<!DOCTYPE Query><Query client="true" processor="TSV" limit="-1" header="1"><Dataset name="hgnc" config="hgnc_config_1"><Filter name="gd_status" value="Approved" filter_list=""/><Filter name="gd_app_sym" value="BRCA1" filter_list=""/><Attribute name="gd_aliases"/></Dataset></Query>
HGNC
http://www.genenames.org/cgi-bin/hgnc_downloads.cgi?title=HGNC+output+data&hgnc_dbtag=on&col=gd_app_sym&col=gd_aliases&status=Approved&status=Entry+Withdrawn&status_opt=2&where=&order_by=gd_app_sym_sort&format=text&limit=&submit=submit&.cgifields=&.cgifields=chr&.cgifields=status&.cgifields=hgnc_dbtag
BRCA1
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gene&id=672
BioMart
http://central.biomart.org/martwizard/#!/Search_by_database_name?mart=Hugo+Gene+Nomenclature+(HGNC)+(EBI%2C+UK)&step=1&datasets=hgnc
<!DOCTYPE Query><Query client="true" processor="TSV" limit="-1" header="1"><Dataset name="hgnc" config="hgnc_config_1"><Filter name="gd_status" value="Approved" filter_list=""/><Filter name="gd_app_sym" value="BRCA1" filter_list=""/><Attribute name="gd_aliases"/></Dataset></Query>
HGNC
http://www.genenames.org/cgi-bin/hgnc_downloads.cgi?title=HGNC+output+data&hgnc_dbtag=on&col=gd_app_sym&col=gd_aliases&status=Approved&status=Entry+Withdrawn&status_opt=2&where=&order_by=gd_app_sym_sort&format=text&limit=&submit=submit&.cgifields=&.cgifields=chr&.cgifields=status&.cgifields=hgnc_dbtag
Monday, August 18, 2014
One Codex Wants To Be The Google For Genomic Data
http://techcrunch.com/2014/08/15/one-codex-wants-to-be-the-google-for-genomic-data/
As hospitals and public health organizations switch to using genomic data for testing, searching through genomic data can still take some time. Y Combinator-backed startup, One Codex, wants to help researchers, clinicians and public health officials, who have sequenced more than 100,000 genomes and created petabytes of data, to search this data.
As hospitals and public health organizations switch to using genomic data for testing, searching through genomic data can still take some time. Y Combinator-backed startup, One Codex, wants to help researchers, clinicians and public health officials, who have sequenced more than 100,000 genomes and created petabytes of data, to search this data.
For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights
http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html?_r=0
Yet far too much handcrafted work — what data scientists call “data wrangling,” “data munging” and “data janitor work” — is still required. Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.
ClearStory Data, a start-up in Palo Alto, Calif., makes software that recognizes many data sources, pulls them together and presents the results visually as charts, graphics or data-filled maps. Its goal is to reach a wider market of business users beyond data masters.
Trifacta makes a tool for data professionals. Its software employs machine-learning technology to find, present and suggest types of data that might be useful for a data scientist to see and explore, depending on the task at hand.
Yet far too much handcrafted work — what data scientists call “data wrangling,” “data munging” and “data janitor work” — is still required. Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.
ClearStory Data, a start-up in Palo Alto, Calif., makes software that recognizes many data sources, pulls them together and presents the results visually as charts, graphics or data-filled maps. Its goal is to reach a wider market of business users beyond data masters.
Trifacta makes a tool for data professionals. Its software employs machine-learning technology to find, present and suggest types of data that might be useful for a data scientist to see and explore, depending on the task at hand.
Friday, August 1, 2014
Human Longevity Project
http://www.genomeweb.com/blog/deeper-bench
With the launch of his new company, Human Longevity, this year, Venter aims to not only sequence tens of thousands of people, but also collect physiological data such as how much blood their heart can pump and brain size. So far, he tells Tech Review that his company has sequenced 500 people who are now beginning to undergo those additional tests.
"Google Translate started as a slow algorithm that took hours or days to run and was not very accurate. But Franz [Och] built a machine-learning version that could go out on the Web and find every article translated from German to English or vice versa, and learn from those," Venter says. "And then it was optimized, so it works in milliseconds."
With the launch of his new company, Human Longevity, this year, Venter aims to not only sequence tens of thousands of people, but also collect physiological data such as how much blood their heart can pump and brain size. So far, he tells Tech Review that his company has sequenced 500 people who are now beginning to undergo those additional tests.
"Google Translate started as a slow algorithm that took hours or days to run and was not very accurate. But Franz [Och] built a machine-learning version that could go out on the Web and find every article translated from German to English or vice versa, and learn from those," Venter says. "And then it was optimized, so it works in milliseconds."
Thursday, July 31, 2014
Cancer biomarkers: Written in blood
http://www.nature.com/news/cancer-biomarkers-written-in-blood-1.15624
DNA circulating in the bloodstream could guide cancer treatment — if researchers can work out how best to use it.
But researchers have found ways to get a richer view of a patient's cancer, and even track it over time. When cancer cells rupture and die, they release their contents, including circulating tumour DNA (ctDNA): genome fragments that float freely through the bloodstream. Debris from normal cells is normally mopped up and destroyed by 'cleaning cells' such as macrophages, but tumours are so large and their cells multiply so quickly that the cleaners cannot cope completely.
The first practical use of circulating DNA came in another field. Dennis Lo, a chemical pathologist now at the Chinese University of Hong Kong, reasoned that if tumours could flood the blood with DNA, surely fetuses could, too. In 1997, he successfully showed that pregnant women carrying male babies had fetal Y chromosomes in their blood6. That discovery allowed doctors to check a baby's sex early in gestation without disturbing the fetus, and ultimately to screen for developmental disorders such as Down's syndrome without resorting to invasive testing. It has revolutionized the field of prenatal diagnostics (see Nature 507, 19; 2014).
Despite its promise, ctDNA is not yet ready for a starring role in the clinic. For one thing, the most sensitive techniques for detecting it, such as BEAMing, rely on some knowledge of which mutations to look for. This knowledge can be provided by taking a biopsy, sequencing its mutations, designing patient-specific molecular probes that target them, and using those probes to analyse later blood samples — a laborious approach that must be repeated for each patient. The alternative is to use exome sequencing, as Rosenfeld's team did. This requires no previous knowledge about the cancer, but it is prohibitively expensive to sequence and analyse every sample at the depth required to detect rare mutant fragments.
DNA circulating in the bloodstream could guide cancer treatment — if researchers can work out how best to use it.
But researchers have found ways to get a richer view of a patient's cancer, and even track it over time. When cancer cells rupture and die, they release their contents, including circulating tumour DNA (ctDNA): genome fragments that float freely through the bloodstream. Debris from normal cells is normally mopped up and destroyed by 'cleaning cells' such as macrophages, but tumours are so large and their cells multiply so quickly that the cleaners cannot cope completely.
The first practical use of circulating DNA came in another field. Dennis Lo, a chemical pathologist now at the Chinese University of Hong Kong, reasoned that if tumours could flood the blood with DNA, surely fetuses could, too. In 1997, he successfully showed that pregnant women carrying male babies had fetal Y chromosomes in their blood6. That discovery allowed doctors to check a baby's sex early in gestation without disturbing the fetus, and ultimately to screen for developmental disorders such as Down's syndrome without resorting to invasive testing. It has revolutionized the field of prenatal diagnostics (see Nature 507, 19; 2014).
Despite its promise, ctDNA is not yet ready for a starring role in the clinic. For one thing, the most sensitive techniques for detecting it, such as BEAMing, rely on some knowledge of which mutations to look for. This knowledge can be provided by taking a biopsy, sequencing its mutations, designing patient-specific molecular probes that target them, and using those probes to analyse later blood samples — a laborious approach that must be repeated for each patient. The alternative is to use exome sequencing, as Rosenfeld's team did. This requires no previous knowledge about the cancer, but it is prohibitively expensive to sequence and analyse every sample at the depth required to detect rare mutant fragments.
Monday, July 28, 2014
Experts question Google’s new ‘moonshot’ project: mapping human genome biomarkers
Canadian experts have concerns about a report that Google Inc. is planning to create a map of the biomarkers in the “healthy” human genome.
Researchers have told The Globe and Mail they would welcome the search giant to this area of study, which they say is underfunded, but they questioned how useful Google’s project would be based on the relatively small number of people it would involve (just 175 initially).
Researchers have told The Globe and Mail they would welcome the search giant to this area of study, which they say is underfunded, but they questioned how useful Google’s project would be based on the relatively small number of people it would involve (just 175 initially).
Friday, July 25, 2014
Frequentists vs Bayesian
http://oikosjournal.wordpress.com/2011/10/11/frequentist-vs-bayesian-statistics-resources-to-help-you-choose/
Most ecologists use the frequentist approach. This approach focuses on P(D|H), the probability of the data, given the hypothesis. That is, this approach treats data as random (if you repeated the study, the data might come out differently), and hypotheses as fixed (the hypothesis is either true or false, and so has a probability of either 1 or 0, you just don’t know for sure which it is). This approach is called frequentist because it’s concerned with the frequency with which one expects to observe the data, given some hypothesis about the world. The P values you see in the “Results” sections of most empirical ecology papers are values of P(D|H), where H is usually some “null” hypothesis.
Bayesian statistical approaches are increasingly common in ecology. Bayesian statistics focuses on P(H|D), the probability of the hypothesis, given the data. That is, this approach treats the data as fixed (these are the only data you have) and hypotheses as random (the hypothesis might be true or false, with some probability between 0 and 1). This approach is called Bayesian because you need to use Bayes’ Theorem to calculate P(H|D).
I guess I lean more towards Bayesian statistics! There's probably life on Mars =)
Most ecologists use the frequentist approach. This approach focuses on P(D|H), the probability of the data, given the hypothesis. That is, this approach treats data as random (if you repeated the study, the data might come out differently), and hypotheses as fixed (the hypothesis is either true or false, and so has a probability of either 1 or 0, you just don’t know for sure which it is). This approach is called frequentist because it’s concerned with the frequency with which one expects to observe the data, given some hypothesis about the world. The P values you see in the “Results” sections of most empirical ecology papers are values of P(D|H), where H is usually some “null” hypothesis.
Bayesian statistical approaches are increasingly common in ecology. Bayesian statistics focuses on P(H|D), the probability of the hypothesis, given the data. That is, this approach treats the data as fixed (these are the only data you have) and hypotheses as random (the hypothesis might be true or false, with some probability between 0 and 1). This approach is called Bayesian because you need to use Bayes’ Theorem to calculate P(H|D).
I guess I lean more towards Bayesian statistics! There's probably life on Mars =)
Thursday, July 24, 2014
Biological insights from 108 schizophrenia-associated genetic loci
Biological insights from 108 schizophrenia-associated genetic loci
Abstract: Schizophrenia is a highly heritable disorder. Genetic risk is conferred by a large number of alleles, including common alleles of small effect that might be detected by genome-wide association studies. Here we report a multi-stage schizophrenia genome-wide association study of up to 36,989 cases and 113,075 controls. We identify 128 independent associations spanning 108 conservatively defined loci that meet genome-wide significance, 83 of which have not been previously reported. Associations were enriched among genes expressed in brain, providing biological plausibility for the findings. Many findings have the potential to provide entirely new insights into aetiology, but associations at DRD2 and several genes involved in glutamatergic neurotransmission highlight molecules of known and potential therapeutic relevance to schizophrenia, and are consistent with leading pathophysiological hypotheses. Independent of genes expressed in brain, associations were enriched among genes expressed in tissues that have important roles in immunity, providing support for the speculated link between the immune system and schizophrenia.
Subject terms: Genome-wide association studies
http://www.nature.com/nature/journal/v511/n7510/full/nature13595.html
Genomic inflation factors under polygenic inheritance
Abstract: Population structure, including population stratification and cryptic relatedness, can cause spurious associations in genome-wide association studies (GWAS). Usually, the scaled median or mean test statistic for association calculated from multiple single-nucleotide-polymorphisms across the genome is used to assess such effects, and ‘genomic control' can be applied subsequently to adjust test statistics at individual loci by a genomic inflation factor. Published GWAS have clearly shown that there are many loci underlying genetic variation for a wide range of complex diseases and traits, implying that a substantial proportion of the genome should show inflation of the test statistic. Here, we show by theory, simulation and analysis of data that in the absence of population structure and other technical artefacts, but in the presence of polygenic inheritance, substantial genomic inflation is expected. Its magnitude depends on sample size, heritability, linkage disequilibrium structure and the number of causal variants. Our predictions are consistent with empirical observations on height in independent samples of ~4000 and ~133 000 individuals.
Keywords: genome-wide association study, genomic inflation factor, polygenic inheritance
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3137506/
In population genetics, linkage disequilibrium is the non-random association of alleles at two or more loci, that descend from single, ancestral chromosomes
linkage equilibrium D = 0, is when PA * PB = PAB (ie. A is found and B is found)
http://en.wikipedia.org/wiki/Linkage_disequilibrium
Linkage disequilibrium makes tightly linked
variants strongly correlated producing cost
savings for association studies
http://www.sph.umich.edu/csg/abecasis/class/666.03.pdf
Population stratification is the presence of a systematic difference in allele frequencies between subpopulations in a population possibly due to different ancestry, especially in the context of association studies.
The two most widely used approaches to this problem include genomic control, which is a relatively nonparametric method for controlling the inflation of test statistics,[2] and structured association methods,[3] which use genetic information to estimate and control for population structure.
Genomic Control works by using markers that are not linked with the trait in question to correct for any inflation of the statistic caused by population stratification
http://en.wikipedia.org/wiki/Population_stratification
The Hardy–Weinberg principle states that allele and genotype frequencies in a population will remain constant from generation to generation in the absence of other evolutionary influences.
Abstract: Schizophrenia is a highly heritable disorder. Genetic risk is conferred by a large number of alleles, including common alleles of small effect that might be detected by genome-wide association studies. Here we report a multi-stage schizophrenia genome-wide association study of up to 36,989 cases and 113,075 controls. We identify 128 independent associations spanning 108 conservatively defined loci that meet genome-wide significance, 83 of which have not been previously reported. Associations were enriched among genes expressed in brain, providing biological plausibility for the findings. Many findings have the potential to provide entirely new insights into aetiology, but associations at DRD2 and several genes involved in glutamatergic neurotransmission highlight molecules of known and potential therapeutic relevance to schizophrenia, and are consistent with leading pathophysiological hypotheses. Independent of genes expressed in brain, associations were enriched among genes expressed in tissues that have important roles in immunity, providing support for the speculated link between the immune system and schizophrenia.
Subject terms: Genome-wide association studies
http://www.nature.com/nature/journal/v511/n7510/full/nature13595.html
Genomic inflation factors under polygenic inheritance
Abstract: Population structure, including population stratification and cryptic relatedness, can cause spurious associations in genome-wide association studies (GWAS). Usually, the scaled median or mean test statistic for association calculated from multiple single-nucleotide-polymorphisms across the genome is used to assess such effects, and ‘genomic control' can be applied subsequently to adjust test statistics at individual loci by a genomic inflation factor. Published GWAS have clearly shown that there are many loci underlying genetic variation for a wide range of complex diseases and traits, implying that a substantial proportion of the genome should show inflation of the test statistic. Here, we show by theory, simulation and analysis of data that in the absence of population structure and other technical artefacts, but in the presence of polygenic inheritance, substantial genomic inflation is expected. Its magnitude depends on sample size, heritability, linkage disequilibrium structure and the number of causal variants. Our predictions are consistent with empirical observations on height in independent samples of ~4000 and ~133 000 individuals.
Keywords: genome-wide association study, genomic inflation factor, polygenic inheritance
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3137506/
In population genetics, linkage disequilibrium is the non-random association of alleles at two or more loci, that descend from single, ancestral chromosomes
linkage equilibrium D = 0, is when PA * PB = PAB (ie. A is found and B is found)
http://en.wikipedia.org/wiki/Linkage_disequilibrium
Linkage disequilibrium makes tightly linked
variants strongly correlated producing cost
savings for association studies
http://www.sph.umich.edu/csg/abecasis/class/666.03.pdf
Population stratification is the presence of a systematic difference in allele frequencies between subpopulations in a population possibly due to different ancestry, especially in the context of association studies.
The two most widely used approaches to this problem include genomic control, which is a relatively nonparametric method for controlling the inflation of test statistics,[2] and structured association methods,[3] which use genetic information to estimate and control for population structure.
Genomic Control works by using markers that are not linked with the trait in question to correct for any inflation of the statistic caused by population stratification
http://en.wikipedia.org/wiki/Population_stratification
The Hardy–Weinberg principle states that allele and genotype frequencies in a population will remain constant from generation to generation in the absence of other evolutionary influences.
Subscribe to:
Posts (Atom)