Tuesday, December 16, 2014

Convert SRA to FASTQ

http://www.ncbi.nlm.nih.gov/sites/books/NBK158900/

./sratoolkit.2.4.2-ubuntu64/bin/fastq-dump --split-3 -L info --gzip -v SRR1131659

Friday, December 12, 2014

Oracle Health Sciences Translational Research Center: A Translational Medicine Platform to Address the Big Data Challenge

http://www.oracle.com/us/industries/healthcare/translational-medicine-platform-wp-1840042.pdf

Oracle, on the Oracle Health Sciences Translational Research Center, a scalable informatics solution for translational research. Its back-end data components seamlessly integrate clinical and omics data from diverse clinical data sources as well as from vendor-specific and modality-specific omics data silos, providing standardized data readily available for the front-end application. 

Tuesday, December 9, 2014

Finding overlap between two lists in Excel

Function

=COUNTIF(C:C,K4)

Where C:C contains one list and K4 is the value. Simply drag and drop this formula for all other values in the "K" column!

Monday, December 8, 2014

Eclipse misbehaving

If eclipse is misbehaving for some reasons, context menu not appearing, right-click does nothing ...

backup and remove ~/workspace/.metadata/.plugins

Or

Switch to a new workspace

Friday, December 5, 2014

Compound heterozygote

Compound heterozygote: The presence of two different mutant alleles at a particular gene locus, one on each chromosome of a pair.
The human genome contains two copies of each gene, a paternal and a maternal allele. A mutation affecting only one allele is called heterozygous. Ahomozygous mutation is the presence of the identical mutation on both alleles of a specific gene. However, when both alleles of a gene harbor mutations, but the mutations are different, these mutations are called compound heterozygous. Also called a genetic compound.

An individual who has two different abnormal alleles at a particular locus, one on each chromosome of a pair; usually refers to individuals affected with an autosomal recessive disorder

or they can be mutations at different locus of the same gene.

Syndromic vs non-syndromic (not associated with anything else, specific)

http://ghr.nlm.nih.gov/condition/nonsyndromic-deafness

Nonsyndromic deafness is hearing loss that is not associated with other signs and symptoms. In contrast, syndromic deafness involves hearing loss that occurs with abnormalities in other parts of the body. Different types of nonsyndromic deafness are named according to their inheritance patterns.

Tuesday, December 2, 2014

Friday, November 28, 2014

Continuous integration servers


Jenkins
http://jenkins-ci.org/

Bamboo
https://confluence.atlassian.com/display/BAMBOO/Bamboo+Documentation+Home

Wednesday, November 26, 2014

PLOS Computational Biology: Translational Bioinformatics

http://www.ploscollections.org/article/browseIssue.action?issue=info:doi/10.1371/issue.pcol.v03.i11

Education Articles

Chapter 2: Data-Driven View of Disease Biology

Casey S. Greene, Olga G. Troyanskaya

Chapter 4: Protein Interactions and Disease

Mileidy W. Gonzalez, Maricel G. Kann

Chapter 5: Network Biology Approach to Complex Diseases

Dong-Yeon Cho, Yoo-Ah Kim, Teresa M. Przytycka

Chapter 7: Pharmacogenomics

Konrad J. Karczewski, Roxana Daneshjou, Russ B. Altman

Chapter 9: Analyses Using Disease Ontologies

Nigam H. Shah, Tyler Cole, Mark A. Musen

Chapter 10: Mining Genome-Wide Genetic Markers

Xiang Zhang, Shunping Huang, Zhaojun Zhang, Wei Wang

Chapter 11: Genome-Wide Association Studies

William S. Bush, Jason H. Moore

Chapter 12: Human Microbiome Analysis

Xochitl C. Morgan, Curtis Huttenhower

Chapter 14: Cancer Genome Analysis

Miguel Vazquez, Victor de la Torre, Alfonso Valencia

Chapter 16: Text Mining for Translational Bioinformatics

K. Bretonnel Cohen, Lawrence E. Hunter

Chapter 17: Bioimage Informatics for Systems Pharmacology

Fuhai Li, Zheng Yin, Guangxu Jin, Hong Zhao, Stephen T. C. Wong

Thursday, November 20, 2014

threejs - makes WebGL - 3D in the browser - very easy

http://threejs.org/docs/index.html#Manual/Introduction/Creating_a_scene

Three.js is a library that makes WebGL - 3D in the browser - very easy. While a simple cube in raw WebGL would turn out hundreds of lines of Javascript and shader code, a Three.js equivalent is only a fraction of that.

Google Genomics

https://cloud.google.com/genomics/v1beta2/visualization

Google Genomics provides an API to store, process, explore, and share DNA sequence reads, reference-based alignments, and variant calls, using Google's cloud infrastructure.

    Store alignments and variant calls for one genome or a million.
    Process genomic data in batch by running principal component analysis or Hardy-Weinberg equilibrium, in minutes or hours, by using parallel computing frameworks like MapReduce.
    Explore data by slicing alignments and variants by genomic range across one or multiple samples -- for your own algorithms or for visualization; or interactively process entire cohorts to find transition/transversion ratios, allelic frequency, genome-wide association and more using BigQuery.
    Share genomic data with your research group, collaborators, the broader community, or the public. You decide.

Google Genomics is implementing the API defined by the Global Alliance for Genomics and Health for visualization, analysis and more. Compliant software can access Google Genomics, local servers, or any other implementation.

Cluster computing
https://spark.apache.org/downloads.html

Google BigQuery
https://cloud.google.com/bigquery/what-is-bigquery

Thursday, November 13, 2014

10 New Breakthrough Technologies 2014

http://www.technologyreview.com/lists/technologies/2014/


Agricultural Drones
Ultraprivate Smartphones
Brain Mapping
Neuromorphic Chips
Genome Editing
Microscale 3-D Printing
Mobile Collaboration
Oculus Rift
Agile Robots
Smart Wind and Solar Power

Monday, November 10, 2014

Alzheimer's drug sneaks through blood–brain barrier

Neurobiologist Ryan Watts and his colleagues at the biotechnology company Genentech in South San Francisco have sought to break through the barrier by exploiting transferrin, a protein that sits on the surface of blood vessels and carries iron into the brain. The team created an antibody with two ends. One end binds loosely to transferrin and uses the protein to transport itself into the brain. And once the antibody is inside, its other end targets an enzyme called β-secretase 1 (BACE1), which produces amyloid-β. Crucially, the antibody binds more tightly to BACE1 than to transferrin, and this pulls it off the blood vessel and into the brain. It locks BACE1 shut and prevents it from making amyloid-β.

http://www.nature.com/news/alzheimer-s-drug-sneaks-through-blood-brain-barrier-1.16291

Friday, October 10, 2014

R Graph Catalog

http://lsi.ubc.ca/resources/omics-phenotyping-portal/#R_Graph_Catalog

Statistics Training and Online Resources

R Graph Catalog

Recommended by Stefanie Butland, LSI
The R Graph Catalog is a visual index of over 100 graphs from the excellent book "Creating More Effective Graphs" by Naomi Robbins. Click on a graph thumbnail and you'll see the figure AND all the code necessary to reproduce the figure exactly with ggplot2, the R package written by Hadley Wickham. This is a resource for people who want to make a good graph and kind of know what it should look like … but they could really use an example to get started!
You can get the code for ALL figures and the infrastructure that makes the app from this repository on GitHub:https://github.com/jennybc/r-graph-catalog
The R Graph Catalog is maintained by Dr Jennifer Bryan, UBC Department of Statistics, and the initial work was facilitated by an NSERC Undergraduate Student Research Award to Joanna Zhao.

8 Realities of the Sequencing GWAS

http://massgenomics.org/2014/03/gwas-sequencing-realities.html

For several years, the genome-wide association study (GWAS) has served as the flagship discovery tool for genetic research, especially in the arena of common diseases. The wide availability and low cost of high-density SNP arrays made it possible to genotype 500,000 or so informative SNPs in thousands of samples. These studies spurred development of tools and pipelines for managing large-scale GWAS, and thus far they’ve revealed hundreds of new genetic associations.
As we all know, the cost of DNA sequencing has plummeted. Now it’s possible to do targeted, exome, or even whole-genome sequencing in cohorts large enough to power GWAS analyses. While we can leverage many of the same tools and approaches developed for SNP array-based GWAS, the sequencing data comes with some very important differences.

Friday, September 26, 2014

Git pretty log

git log --graph --decorate --oneline --all

git diff <c3b37f0>

MySQL Spatial Queries

http://howto-use-mysql-spatial-ext.blogspot.ca/

org.hibernate.dialect.MySQLDialect
MySQL5Dialect, MySQLInnoDBDialect, MySQLMyISAMDialect
 https://docs.jboss.org/hibernate/orm/3.6/javadocs/org/hibernate/dialect/MySQLDialect.html

drop table test;

create table test (
    id INT NOT NULL PRIMARY KEY,
    location LINESTRING NOT NULL,
    SPATIAL KEY sx_location (location)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

set @a = linestring(point(1,1),point(5,1));
set @b = linestring(point(1,1),point(10,1));

insert into test values(1, @a);
insert into test values(2, @b);

select * from test where MBRtouches(location, linestring(point(1,1),point(5,1)) );
select * from test where MBRtouches(location, linestring(point(2,1),point(3,1)) );
select * from test where MBRtouches(location, linestring(point(1,1),point(2,1)) );
select * from test where MBRtouches(location, linestring(point(1,1),point(7,1)) );
select * from test where MBRtouches(location, linestring(point(-1,1),point(0,1)) );
select * from test where MBRtouches(location, linestring(point(0,1),point(1,1)) );
select * from test where MBRtouches(location, linestring(point(7,1),point(12,1)) );
select * from test where MBRtouches(location, linestring(point(12,1),point(20,1)) );

select MBRcontains(@b,@a);
select MBRcontains(@a,@b);
select MBRoverlaps(@a,@b);
select MBRoverlaps(@b,@a);
select MBRtouches(@a,@b);

Hibernate SQL statement debug

log4j.logger.org.hibernate.SQL=TRACE
log4j.logger.org.hibernate.type.descriptor.sql.BasicBinder=TRACE

Wednesday, September 10, 2014

Microarray vs RNA-seq, Normalization

John Storey provides his take on the importance of new statistical methods for high-throughput sequencing.

http://www.nature.com/nbt/journal/v29/n4/full/nbt.1831.html

Honing our reading skills

http://www.nature.com/nbt/journal/v32/n9/full/nbt.3021.html?WT.ec_id=NBT-201409

Nature Biotechnology Contents: Volume 32 pp 700 - 960
http://mabsj2.blogspot.ca/2014/09/nature-biotechnology-contents-volume-32.html

The devil in the details of RNA-seq
http://www.nature.com/nbt/journal/v32/n9/full/nbt.3015.html#affil-auth
RNA-seq is clearly superior to microarrays for its ability for de novo discovery and detection of genes, especially those with low expression levels. The detection of alternative splicing patterns is possible, but attention needs to be paid to the underlying gene annotation, and parameters such as mapping and error rates become more important than sequencing depth.

Detecting and correcting systematic variation in large-scale RNA sequencing data
Sheng Li,    Paweł P Łabaj,    Paul Zumbo,    Peter Sykacek,    Wei Shi,    Leming Shi,    John Phan,    Po-Yen Wu,    May Wang,    Charles Wang,    Danielle Thierry-Mieg,    Jean Thierry-Mieg,    David P Kreil & Christopher E Mason
AffiliationsContributionsCorresponding authors
Nature Biotechnology 32, 888–895 (2014) doi:10.1038/nbt.3000

Abstract• Introduction• Results• Discussion• Methods• Accession codes• References• Acknowledgments• Author information• Supplementary information
High-throughput RNA sequencing (RNA-seq) enables comprehensive scans of entire transcriptomes, but best practices for analyzing RNA-seq data have not been fully defined, particularly for data collected with multiple sequencing platforms or at multiple sites. Here we used standardized RNA samples with built-in controls to examine sources of error in large-scale RNA-seq studies and their impact on the detection of differentially expressed genes (DEGs). Analysis of variations in guanine-cytosine content, gene coverage, sequencing error rate and insert size allowed identification of decreased reproducibility across sites. Moreover, commonly used methods for normalization (cqn, EDASeq, RUV2, sva, PEER) varied in their ability to remove these systematic biases, depending on sample complexity and initial data quality. Normalization methods that combine data from genes across sites are strongly recommended to identify and remove site-specific effects and can substantially improve RNA-seq studies.

http://www.nature.com/nbt/journal/v32/n9/full/nbt.3000.html?WT.ec_id=NBT-201409

http://www.nature.com/nbt/journal/v32/n9/full/nbt.2931.html?WT.ec_id=NBT-201409

http://www.nature.com/nbt/journal/v32/n9/full/nbt.2957.html?WT.ec_id=NBT-201409

http://www.nature.com/nbt/journal/v32/n9/full/nbt.2972.html?WT.ec_id=NBT-201409

http://www.nature.com/nbt/journal/v32/n9/full/nbt.3001.html?WT.ec_id=NBT-201409

Friday, August 29, 2014

Eclipse hotkeys

Shift hover over function shows code
Alt-left, Alt-right previous view
F3
Ctrl+F3 outline
Ctrl+hover open declaration, implementation

Monday, August 25, 2014

RESTful web api for getting gene information

NCBI's E-Utils

BRCA1
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gene&id=672

BioMart
http://central.biomart.org/martwizard/#!/Search_by_database_name?mart=Hugo+Gene+Nomenclature+(HGNC)+(EBI%2C+UK)&step=1&datasets=hgnc
<!DOCTYPE Query><Query client="true" processor="TSV" limit="-1" header="1"><Dataset name="hgnc" config="hgnc_config_1"><Filter name="gd_status" value="Approved" filter_list=""/><Filter name="gd_app_sym" value="BRCA1" filter_list=""/><Attribute name="gd_aliases"/></Dataset></Query>

HGNC
http://www.genenames.org/cgi-bin/hgnc_downloads.cgi?title=HGNC+output+data&hgnc_dbtag=on&col=gd_app_sym&col=gd_aliases&status=Approved&status=Entry+Withdrawn&status_opt=2&where=&order_by=gd_app_sym_sort&format=text&limit=&submit=submit&.cgifields=&.cgifields=chr&.cgifields=status&.cgifields=hgnc_dbtag

Monday, August 18, 2014

One Codex Wants To Be The Google For Genomic Data

http://techcrunch.com/2014/08/15/one-codex-wants-to-be-the-google-for-genomic-data/

As hospitals and public health organizations switch to using genomic data for testing, searching through genomic data can still take some time. Y Combinator-backed startup, One Codex, wants to help researchers, clinicians and public health officials, who have sequenced more than 100,000 genomes and created petabytes of data, to search this data.

For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights

http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html?_r=0
 
Yet far too much handcrafted work — what data scientists call “data wrangling,” “data munging” and “data janitor work” — is still required. Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.

ClearStory Data, a start-up in Palo Alto, Calif., makes software that recognizes many data sources, pulls them together and presents the results visually as charts, graphics or data-filled maps. Its goal is to reach a wider market of business users beyond data masters.

Trifacta makes a tool for data professionals. Its software employs machine-learning technology to find, present and suggest types of data that might be useful for a data scientist to see and explore, depending on the task at hand.

Friday, August 1, 2014

Human Longevity Project

http://www.genomeweb.com/blog/deeper-bench
 
With the launch of his new company, Human Longevity, this year, Venter aims to not only sequence tens of thousands of people, but also collect physiological data such as how much blood their heart can pump and brain size. So far, he tells Tech Review that his company has sequenced 500 people who are now beginning to undergo those additional tests.

"Google Translate started as a slow algorithm that took hours or days to run and was not very accurate. But Franz [Och] built a machine-learning version that could go out on the Web and find every article translated from German to English or vice versa, and learn from those," Venter says. "And then it was optimized, so it works in milliseconds."

Thursday, July 31, 2014

Cancer biomarkers: Written in blood

http://www.nature.com/news/cancer-biomarkers-written-in-blood-1.15624

DNA circulating in the bloodstream could guide cancer treatment — if researchers can work out how best to use it.

But researchers have found ways to get a richer view of a patient's cancer, and even track it over time. When cancer cells rupture and die, they release their contents, including circulating tumour DNA (ctDNA): genome fragments that float freely through the bloodstream. Debris from normal cells is normally mopped up and destroyed by 'cleaning cells' such as macrophages, but tumours are so large and their cells multiply so quickly that the cleaners cannot cope completely.

The first practical use of circulating DNA came in another field. Dennis Lo, a chemical pathologist now at the Chinese University of Hong Kong, reasoned that if tumours could flood the blood with DNA, surely fetuses could, too. In 1997, he successfully showed that pregnant women carrying male babies had fetal Y chromosomes in their blood6. That discovery allowed doctors to check a baby's sex early in gestation without disturbing the fetus, and ultimately to screen for developmental disorders such as Down's syndrome without resorting to invasive testing. It has revolutionized the field of prenatal diagnostics (see Nature 507, 19; 2014).

Despite its promise, ctDNA is not yet ready for a starring role in the clinic. For one thing, the most sensitive techniques for detecting it, such as BEAMing, rely on some knowledge of which mutations to look for. This knowledge can be provided by taking a biopsy, sequencing its mutations, designing patient-specific molecular probes that target them, and using those probes to analyse later blood samples — a laborious approach that must be repeated for each patient. The alternative is to use exome sequencing, as Rosenfeld's team did. This requires no previous knowledge about the cancer, but it is prohibitively expensive to sequence and analyse every sample at the depth required to detect rare mutant fragments.

Monday, July 28, 2014

Experts question Google’s new ‘moonshot’ project: mapping human genome biomarkers

Canadian experts have concerns about a report that Google Inc. is planning to create a map of the biomarkers in the “healthy” human genome.

Researchers have told The Globe and Mail they would welcome the search giant to this area of study, which they say is underfunded, but they questioned how useful Google’s project would be based on the relatively small number of people it would involve (just 175 initially).

Friday, July 25, 2014

Frequentists vs Bayesian

http://oikosjournal.wordpress.com/2011/10/11/frequentist-vs-bayesian-statistics-resources-to-help-you-choose/

Most ecologists use the frequentist approach. This approach focuses on P(D|H), the probability of the data, given the hypothesis. That is, this approach treats data as random (if you repeated the study, the data might come out differently), and hypotheses as fixed (the hypothesis is either true or false, and so has a probability of either 1 or 0, you just don’t know for sure which it is). This approach is called frequentist because it’s concerned with the frequency with which one expects to observe the data, given some hypothesis about the world. The P values you see in the “Results” sections of most empirical ecology papers are values of P(D|H), where H is usually some “null” hypothesis.

Bayesian statistical approaches are increasingly common in ecology. Bayesian statistics focuses on P(H|D), the probability of the hypothesis, given the data. That is, this approach treats the data as fixed (these are the only data you have) and hypotheses as random (the hypothesis might be true or false, with some probability between 0 and 1). This approach is called Bayesian because you need to use Bayes’ Theorem to calculate P(H|D).


I guess I lean more towards Bayesian statistics! There's probably life on Mars =)


Thursday, July 24, 2014

Biological insights from 108 schizophrenia-associated genetic loci

Biological insights from 108 schizophrenia-associated genetic loci 
Abstract: Schizophrenia is a highly heritable disorder. Genetic risk is conferred by a large number of alleles, including common alleles of small effect that might be detected by genome-wide association studies. Here we report a multi-stage schizophrenia genome-wide association study of up to 36,989 cases and 113,075 controls. We identify 128 independent associations spanning 108 conservatively defined loci that meet genome-wide significance, 83 of which have not been previously reported. Associations were enriched among genes expressed in brain, providing biological plausibility for the findings. Many findings have the potential to provide entirely new insights into aetiology, but associations at DRD2 and several genes involved in glutamatergic neurotransmission highlight molecules of known and potential therapeutic relevance to schizophrenia, and are consistent with leading pathophysiological hypotheses. Independent of genes expressed in brain, associations were enriched among genes expressed in tissues that have important roles in immunity, providing support for the speculated link between the immune system and schizophrenia.

Subject terms: Genome-wide association studies
http://www.nature.com/nature/journal/v511/n7510/full/nature13595.html

Genomic inflation factors under polygenic inheritance 
Abstract:  Population structure, including population stratification and cryptic relatedness, can cause spurious associations in genome-wide association studies (GWAS). Usually, the scaled median or mean test statistic for association calculated from multiple single-nucleotide-polymorphisms across the genome is used to assess such effects, and ‘genomic control' can be applied subsequently to adjust test statistics at individual loci by a genomic inflation factor. Published GWAS have clearly shown that there are many loci underlying genetic variation for a wide range of complex diseases and traits, implying that a substantial proportion of the genome should show inflation of the test statistic. Here, we show by theory, simulation and analysis of data that in the absence of population structure and other technical artefacts, but in the presence of polygenic inheritance, substantial genomic inflation is expected. Its magnitude depends on sample size, heritability, linkage disequilibrium structure and the number of causal variants. Our predictions are consistent with empirical observations on height in independent samples of ~4000 and ~133 000 individuals.

Keywords: genome-wide association study, genomic inflation factor, polygenic inheritance
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3137506/

In population genetics, linkage disequilibrium is the non-random association of alleles at two or more loci, that descend from single, ancestral chromosomes
linkage equilibrium D = 0, is when PA * PB = PAB (ie. A is found and B is found)

http://en.wikipedia.org/wiki/Linkage_disequilibrium

Linkage disequilibrium makes tightly linked 
variants strongly correlated producing cost 
savings for association studies



http://www.sph.umich.edu/csg/abecasis/class/666.03.pdf

 Population stratification is the presence of a systematic difference in allele frequencies between subpopulations in a population possibly due to different ancestry, especially in the context of association studies.

The two most widely used approaches to this problem include genomic control, which is a relatively nonparametric method for controlling the inflation of test statistics,[2] and structured association methods,[3] which use genetic information to estimate and control for population structure.

Genomic Control works by using markers that are not linked with the trait in question to correct for any inflation of the statistic caused by population stratification
http://en.wikipedia.org/wiki/Population_stratification 

The Hardy–Weinberg principle states that allele and genotype frequencies in a population will remain constant from generation to generation in the absence of other evolutionary influences.