Monday, February 28, 2011

QPSR / QSAR

opentox (http://www.opentox.org/) or other open initiatives like OCHEM (http://ochem.eu/) and chembench (http://chembench.mml.unc.edu/)


accelrys qsar / qpsr workbench


VLife qsar


Dragon molecular descriptors
http://www.talete.mi.it/products/dragon_description.htm

eHITS docking - Symbiosis Inc.

http://www.simbiosys.ca/ehits/index.html

eHITS - Electronic High Throughput Screening (small-molecule docking)

ROSETTA is a software suite for predicting and designing protein structures, protein folding mechanisms, and protein-protein interactions. ROSETTA has been consistently successful in CASP and CAPRI competitions.

http://www.rosettacommons.org/software/

RosettaAbinitio Performs de novo protein structure prediction.
RosettaDesign Identifies low free energy sequences for target protein backbones.
RosettaDesign pymol plugin A user-friendly interface for submitting Protein Design simulations using RosettaDesign.
RosettaDock Predicts the structure of a protein-protein complex from the individual structures of the monomer components.
RosettaAntibody Predicts antibody Fv region structures and performs antibody-antigen docking.
RosettaFragments Generates fragment libraries for use by Rosetta ab initio in building protein structures.
RosettaNMR Incorporates NMR data into the basic Rosetta protocol to accelerate the process of NMR structure prediction
RosettaDNA For the design of proteins that interact with specified DNA sequences.
RosettaRNA Fragment assembly of RNA.
RosettaLigand For small molecule - protein docking
RosettaSymmetry For enforcing symmetry in Rosetta
RosettaEnzdes For enzyme design
RosettaMembrane For membrane protein ab initio modeling
RosettaDDG For estimating the impact of sequence changes on protein stability
RosettaScripts An xml-based scripting language for controlling interface design, docking, and interface statistics
RosettaSnugDock Enables docking an antibody Fv region to an antigen and allows backbone flexibility in the paratope.

Docking

http://dasher.wustl.edu/bio5476/lectures/lecture-18.pdf

R Microarray Analysis

R Microarray Analysis

http://www.mas.ncl.ac.uk/~ngl9/topics/inotes/TutorialMicroarrayAnalysis.pdf
http://www.bioconductor.org/help/course-materials/
http://www.bioconductor.org/help/course-materials/2007/seattle_bioc_intro_nov_07/
http://www.bioconductor.org/help/course-materials/2002/ShortCourse012302/

a tutorial on PCA / SVD

"Clustering of spatial gene expression patterns in the mouse brain
and comparison with classical neuroanatomy"

http://grass.osgeo.org/wiki/Principal_Components_Analysis


The SVD is a decomposition of any p x q matrix M into a product M = USVt where U and V are unitary matrices (UUt = VVt = I), and S is a diagonal matrix with real entries. Here, U is a p x q matrix, and S and V are q x q matrices. The columns of U and V are known as the left and right singular vectors, respectively, and entries along the diagonal of S are known as singular values. Note that when M is centered (row and column means are zero), the left singular vectors are eigenvectors of the covariance matrix MtM, the right singular vectors are eigenvectors of the covariance matrix MMt , and the square of a singular value is the variance of the corre- sponding eigenvector. Therefore, a projection of the data matrix M to a d-dimensional subspace with the largest variance may be obtained by using MV = US, retaining only the d largest singular values and corresponding singular vectors.

http://public.lanl.gov/mewall/kluwer2002.html

http://genome-www.stanford.edu/SVD/

pca.narod.ru/pcaclustclass.pdf


General about principal components
– linear combinations of the original variables
– uncorrelated with each other

Summary
• Dimension reduction important to visualize data
– Principal Component Analysis
– Clustering
• Hierarchical
• Partitioning (K-means)
(distance measure important)
• Classification
– Reduction of dimension often nessesary (t-test, PCA)
– Several classification methods avaliable
– Validation




Linear Algebra
http://pillowlab.cps.utexas.edu/teaching/CompNeuro10/schedule.html


Data matrix A, rows=data points, columns = variables (attributes,
parameters).
1. Center the data by subtracting the mean of each column.
2. Compute the SVD of the centered matrix ˆA (or the k first singular
values and vectors):
ˆA = U S(V)T .
3. The principal components are the columns of V, the coordinates of the
data in the basis defined by the principal components are U S.


%Data matrix A, columns:variables, rows: data points
%matlab function for computing the first k principal components of A.
function [pc,score]=pca(A,k);
[rows,cols]=size(A);
Ameans=repmat(mean(A,1),rows,1); %matrix, rows=means of columns
A=A-Ameans; %centering data
[U,S,V]=svds(A,k); %k is the number of pc:s desired
pc=V;
score=U*S; %now A=scores*pcs’+Ameans;


The variance in the direction of the kth principal component is given
by the corresponding singular value: 2
k.
Singular values can be used to estimate how many principal components
to keep.
Rule of thumb: keep enough to explain 85% of the variation:
http://www.uta.edu/faculty/rcli/Teaching/math5392/NotesByHyvonen/lecture5.pdf


http://www.ncbi.nlm.nih.gov.proxy.lib.sfu.ca/pubmed/10963673

http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=8768BD97E00D5306E8437C70EB103959?doi=10.1.1.115.3503&rep=rep1&type=pdf

by J Shlens - Cited by 234 - Related articles
A Tutorial on Principal Component Analysis. Jonathon Shlens∗. Systems Neurobiology Laboratory, Salk Insitute for Biological Studies.

PCA = only works on square matrices
SVD = more generalized PCA

 PCA can fail if the data is very “non-Gaussian” – It assumes that the interesting directions are along lines, and are orthogonal.

PCA is non-parametric, the most important ones are the ones with the largest variance

prcomp(dat) – Calls svd(dat); Gives you stdev (square roots of eigenvalues) and rotation (columns are the eigenvectors) a.k.a. loadings

http://genetics.agrsci.dk/statistics/courses/Rcourse-DJF2006/day3/PCA-computing.pdf 

biplot(prcomp(USArrests, scale = TRUE))


library(limma)
mm <- model.matrix(~PC1, pData(esetr))
fit <- lmFit(esetr, mm) #Fit linear model for each gene given a series of arrays
fit <- eBayes(fit) #Given a series of related parameter estimates and standard errors, compute moderated t-statistics, moderated F-statistic, and log-odds of differential expression by empirical Bayes shrinkage of the standard errors towards a common value.
topTable(fit) #Extract a table of the top-ranked genes from a linear model fit.


PCA for correcting batch effects 
In ideal circumstances, with very consistent data, we expect all data 
points to form a single, cohesive grouping in this type of plot. We also
 expect that any observed clustering will not be related to the primary 
phenotype. If there is any clustering of cases and controls, this is 
usually indicative of batch effects or other systematic differences in 
the generation of the data, and it may cause problems in association 
testing.
http://chemtree.com/SNP_Variation/tutorials/cnv-quality-control/pca.html
 
http://www.puffinwarellc.com/index.php/news-and-articles/articles/30-singular-value-decomposition-tutorial.html?start=2

 http://www.miislita.com/information-retrieval-tutorial/reduced-svd.gif
 
http://spinner.cofc.edu/~langvillea/DISSECTION-LAB/Emmie%27sLSI-SVDModule/p4module.html 
 
http://www.cbs.dtu.dk/chipcourse/Exercises/Ex_Stat/NormStatEx.html 
 
T(V) = V transpose
 
X = U S T(V) = s1 u1 v1 + s2 u2 v2 + ∙ ∙ ∙ + sr ur vr ,
 
where U = (u1 , u2 , . . . , ur ), V = (v1 , v2 , . . . , vr ), and S = diag{s1 , s2 , . . . , sr } with
s1 ≥ s2 ≥ ∙ ∙ ∙ ≥ sr > 0. The singular columns {ui } form an orthonormal basis for the
column space spanned by {c j }, and the singular rows {v j } form an orthonormal basis for
the row space spanned by {ri }. The vectors {ui } and {vi } are called singular columns and
singular rows, respectively (Gabriel and Odoroff 1984); the scalars {si } are called singular values; and the matrices {si ui viT }(i = 1, . . . , r ) are referred to as SVD components.



 
Image Compression
http://www.johnmyleswhite.com/notebook/2009/12/17/image-compression-with-the-svd-in-r/ 

http://n0b3l1a.blogspot.ca/2010/09/pca-principal-component-analysis.html

RStudio

RStudio™ is a new integrated development environment (IDE) for R. RStudio combines an intuitive user interface with powerful coding tools to help you get the most out of R.

http://www.rstudio.org/download/desktop

Useful shortcuts

  • Move cursor to console  Ctrl+2
  • Move cursor to source  Ctrl+1
  • Run current line/selection  Ctrl+Enter

Sunday, February 27, 2011

OECD - Organisation for Economic Co-operation and Development

http://www.oecd.org/pages/0,3417,en_36734052_36734103_1_1_1_1_1,00.html

Welcome to the OECD, an international organisation helping governments tackle the economic, social and governance challenges of a globalised economy.

The mission of the Organisation for Economic Co-operation and Development (OECD) is to promote policies that will improve the economic and social well-being of people around the world.

in France

Friday, February 25, 2011

The Gene Expression Barcode 2.0

http://rafalab.jhsph.edu/barcode/

The barcode algorithm is designed to estimate which
genes are expressed and which are unexpressed in a
given microarray hybridization. The output of our
algorithm is a vector of ones and zeros denoting
which genes are estimated to be expressed (ones)
and unexpressed (zeros). We call this a gene
expression barcode.

McCall et al., "The Gene Expression Barocde: leveraging
public data repositories to begin cataloging the human and
murine transcriptomes
"
Nucleic Acids Research. 2011 Jan;39(suppl 1):D1011-D1015

Thursday, February 24, 2011

Partial Correlation - correlate X and Y while holding Z constant

Partial correlation is a procedure that allows us to measure the region of three-way overlap precisely, and then to remove it from the picture in order to determine what the correlation between any two of the variables would be (hypothetically) if they were not each correlated with the third variable. Alternatively, you can say that partial correlation allows us to determine what the correlation between any two of the variables would be (hypothetically) if the third variable were held constant. The partial correlation of X and Y, with the effects of Z removed (or held constant), would be given by the formula

http://faculty.vassar.edu/lowry/ch3a.html
http://tolstoy.newcastle.edu.au/R/help/00a/0512.html
http://en.wikipedia.org/wiki/Partial_correlation

qstat - show status of pbs batch jobs

qmgr query_other_jobs parameter (allow non-admin users to see other users' jobs
qstat -an Give me a listing of ALL jobs (-a) and the nodes allocated
to each one (n)
qstat -q list all queues on system
qstat -a list all jobs on system
qstat -r list all running jobs
xpbs Graphical User Interface to PBS commands

http://www.clusterresources.com/torquedocs21/commands/qstat.shtml
http://www.uic.edu/depts/accc/hardware/argo/qstat.html
http://doesciencegrid.org/public/pbs/

"I do" and "I am"

George Carlin Sayings ""I am" is reportedly the shortest sentence in the English language. Could it be that "I do" is the longest sentence?"

Tuesday, February 22, 2011

R matlines / matplot - plot multiple lines

Plot multiple lines


> exprs(esetr)[1:5,]
             Sample_22 Sample_23 Sample_20 Sample_21 Sample_16 Sample_17
1428537_at      11.560    11.430    11.730    11.580    10.150    11.480
1426543_x_at     6.043     6.086     6.098     6.296     6.233     6.320
1439128_at       8.088     7.994     8.040     7.894     8.547     8.031

> plot(...)

> matlines(t(exprs(esetr)[1:5,]))

or

> matplot(t(exprs(esetr)[1:5,]))

R Lab

http://www-stat.stanford.edu/~susan/courses/s141/Rlab5/

No one can make you feel inferior without your permission.

Quotes by Eleanor Roosevelt

"No one can make you feel inferior without your permission."

Monday, February 21, 2011

Pipeline Pilot - commercial lims

http://accelrys.com/solutions/industry/academic/student-edition.html

Spearman vs Pearson correlation

http://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient

In Situ Hybridization (ISH)

In Situ Hybridization (ISH) is a technique that allows for precise localization of a specific segment of nucleic acid within a histologic section. The underlying basis of ISH is that nucleic acids, if preserved adequately within a histologic specimen, can be detected through the application of a complementary strand of nucleic acid to which a reporter molecule is attached.

Visualization of the reporter molecule allows to localize DNA or RNA sequences in a heterogeneous cell populations including tissue samples and environmental samples. Riboprobes also allow to localize and assess degree of gene expression . The technique is particularly useful in neuroscience.

http://www.ncbi.nlm.nih.gov/projects/genome/probe/doc/TechISH.shtml

Sunday, February 20, 2011

Gene network inference


"The mRNA levels sensitively reflect the state of the cell, perhaps uniquely defining cell types, stages, and responses. To decipher the logic of gene regulation, we should aim to be able to monitor the expression level of all genes simultaneously ... " [Lander]

In a recent comparison of selected mRNA and protein abundances in human liver, a correlation of only 0.48 was observed between the two. Clearly, protein levels form an important part of the internal state of a cell.

In addition to mRNA and protein levels, one could imagine measuring a number of other parameters, including cell volume, growth rate, methylation states of DNA, phosphorylation state of proteins, localization of proteins and mRNA within the cell, ion levels, etc. One class of data which could be very useful is metabolite and nutrient levels.

For example, constraining the genes to be regulated by no more than 7 other genes will drastically simplify the number of regulatory interactions we need to consider.

Constraining the model by using a priori information about what is biologically known or plausible is probably the most important weapon we have to fight the Curse of Dimensionality! How precisely to include this information into the inference process is the true art of modeling.

http://www.cs.unm.edu/~patrik/networks/datareq.html

Friday, February 18, 2011

Ontologies

Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nature Genet 25, 25–29 (2000).

Whetzel, P. L. et al. The MGED Ontology: a resource for semantics-based description of microarray experiments. Bioinformatics 22, 866–873 (2006).

Sequence Ontology (SO; http://song.sourceforge.net),

ontologies are used to describe the semantics (that is, the types of data that exist and the relationships between them)

Modelling data across labs, genomes, space and time

http://www.nature.com.proxy.lib.sfu.ca/ncb/journal/v8/n11/full/ncb1496.html


Nature Cell Biology 8, 1190 - 1194 (2006)
doi:10.1038/ncb1496
Modelling data across labs, genomes, space and time
Jason R. Swedlow1, Suzanna E. Lewis2 & Ilya G. Goldberg3


Each element of the workflow (acquisition > illumination correction > deconvolution > segmentation > tracking) requires a defined model of the inputs and output for each step.

Gene interactions

Gene interactions can result in the alteration or suppression of a phenotype. This can occur when an organism inherits two different dominant genes, for example, resulting in incomplete dominance. This is commonly seen in flowers, where breeding two flowers that pass down dominant genes can result in a flower of an unusual color caused by incomplete dominance. If red and white are dominant, for example, the offspring might be pinkish or striped in color as the result of a gene interaction.


The fruit fly is famously extensively studied in genetics and much of the understanding of how gene interactions works comes from working with the fruit fly in lab environments. In organisms like humans, where genetic experimentation is viewed as unethical, geneticists are forced to rely on data from the existing population to learn about dominant and recessive traits and to see how groups of genetic traits can interact. A gene interaction is the result of inheriting genes that conflict in some way, making it impossible for all of them to express as coded, or of inheriting a set of interrelated genes that interact with each other to express a phenotype.


http://www.wisegeek.com/what-is-a-gene-interaction.htm

R lowess best fit line

require(graphics)

plot(cars, main = "lowess(cars)")
lines(lowess(cars), col = 2)

R generate factor levels - gl

> gl(2, 8, labels = c("Control", "Treat"))
[1] Control Control Control Control Control Control Control Control Treat
[10] Treat Treat Treat Treat Treat Treat Treat


> c('red','green','blue','brown','yellow')[brain[colSel, 'major_struct']]
[1] "brown" "red" "brown" "yellow" "yellow" "red" "red" "green"
[9] "yellow" "red" "red" "blue" "brown" "brown" "yellow" "green"
[17] "green" "red" "yellow" "brown" "blue" "green" "yellow" "red"
[25] "blue" "blue" "green" "blue" "red" "green" "yellow" "green"
[33] "yellow" "blue" "green" "blue" "yellow" "yellow" "yellow" "red"
[41] "red" "red" "yellow" "red" "green" "blue" "red" "green"
[49] "brown" "green" "blue" "blue" "blue" "yellow" "blue" "red"

Tuesday, February 15, 2011

Phylogeny, evolution

HKY (Hasegawa, Kishino and Yano 1985)
- Rate distance matrix
- substitution matrix
- distinguishes between the rate of transitions and transversions { rate matrix, Q, }
- allows unequal base frequencies { π, an equilibrium vector }
The diagonals of the Q matrix are chosen so that the rows sum to zero:
http://en.wikipedia.org/wiki/Models_of_DNA_evolution
http://en.wikipedia.org/wiki/Substitution_model
www.bioportal.uio.no/onlinemat/phylcourse/MaximumLikelihood.pdf

BLOSUM62
- Amino acid log-odds substitution model

Neighbourhood Joining
- hierarchical pairwise clustering tree-building
- uses rate matrix as measure of distance
- produce unrooted tree
- based on the Minimum Evolution criterion for phylogenetic trees, i.e. the topology that gives the least total branch length is preferred at each step of the algorithm.
- greedy so fast
- does not assume that all lineages evolve at the same rate ( molecular clock hypothesis) (assumed by UPGMA)
http://www.economicexpert.com/a/Neighbor:joining.htm

Likelihood (like statistics, based on observations) vs Probability (based on perceived parameter, eg. we know coin is fair, p=0.5)
- P(HH|ph=0.5) = 0.25 - what's the probability of seeing two heads if the probability of getting a tails is 0.5
- Likelihood L(ph=0.5 | HH) - what's the likelihood that ph = 0.5 (parameter), given that we see two heads (observed data) 0.25 of the times
- HH = Observation
http://stats.stackexchange.com/questions/2641/what-is-the-difference-between-likelihood-and-probability
http://en.wikipedia.org/wiki/Likelihood_function

Maximum Likelihood Estimate
- assume observations are iid (independent and identically distributed) (x1, x2, ..., xn)
- L(theta | x1, x2, ..., xn)

Felsenstein
- Compute the likelihood for a given tree
- finding maximum likelihood estimates for evolutionary trees from nucleic acid sequence data
- The key to the pruning algorithm is that once the four numbers are computed, they don't need to to be recomputed again (using dynamic programming). The algorithm is a recursion that computes at each node of the tree from the same quantities in the immediate descendent node. The algorithm is applied starting at the node which has all of its immediate descendents being tips. Then it is applied successively further down the tree, not applying it to any node until all of its descendents have been processed.
http://www.stat.berkeley.edu/users/terry/Classes/s260.1998/Week13b/week13b/node8.html

Resampling - Statistics

Estimating the precision of sample statistics (mediansvariancespercentiles) by using subsets of available data (jackknifing) or drawing randomly with replacement from a set of data points (bootstrapping)


Bootstrap - with replacement
Jackknife - without replacement


http://en.wikipedia.org/wiki/Resampling_(statistics)

Mutual Information

Mutual Information between i and j
- measure correlation
- covariation of nucleotides
- 0 means i and j are independent = p(i,j) = p(i)p(j)
- useful for Covariance Models (CM) in RNAFolding (eg. RNAalifold - Hofacker et al. JMB 2002, 319:1059-1066)

http://nlp.stanford.edu/IR-book/html/htmledition/mutual-information-1.html#mifeatsel2

http://mendel.informatics.indiana.edu/~yye/lab/teaching/get.php?course=fall2010-I519&name=RNAfold.pdf

Bioinformatics Lecture Notes

http://mendel.informatics.indiana.edu/~yye/lab/teaching.php

Monday, February 14, 2011

All time favorite music - Happy Blue Valentine 'Mostly Ballads'


2NE1 Park Bom - You And I
Air Supply - Lost in Love
Autumn Tale - Prayer
Avril Lavigne - Innocence
Big Bang - Last Farewell
Big Bang - Let Me Hear Your Voice
BoA - Every Heart
BoA - Key of Heart
BoA - Love Letter
BoA - Smile Again
Bon Jovi - (You Want To) Make A Memory
Bryan Adams & Rod Stewart & Sting - All for Love
Celine Dion - My Heart Will Go On (Titanic)
Chayanne - Yo Te Amo
Chris Daughtry - Over You
Coldrain - 8AM (Hajime No Ippo)
David Archuleta - Crush
DBSK - Doushite kimi wo suki ni natte shimattandarou (why did I fall in love)
Donna Lewis - Richard Marx - At the Beginning (Anastasia)
Dream - My Will (Inuyasha)
East of Eden - Track 6
FIR - 1139055
FIR wo men de ai
Guang Liang - Tong Hua (Fairy Tale)
Hanah - Aitai Kimochi
Hoobastank - The Reason
HowL & J - Perhaps Love
Ikimono Gakari - Yell
Janice Vidal (Wei) - Big Brother (Da ge) 大哥
Janice Vidal (Wei) - Doesn't Matter (Wu Sou Wei) "無所謂"
정일영 (Jeong Il-yeong) - Reason 가을동화 (Ga-eul-dong-hwa)
Jewel - Foolish Games
Jewel - Hands
JJ Lin 殺手 (Sha Shou) (Killer)
Johnny Mathis - Chances Are
Jordan Hill - Remember Me This Way (Casper)
JUJU with JAY'ED-明日がくるなら
Kelly Clarkson - Catch My Breath
Ken Hirai (平井堅) - 僕は君に恋をする
Kim Bum Soo - A Sad story than Sadness
Kim Bum Soo - Bo Go Ship Da (I miss you) (Stairway to Heaven)
KISS - Because I'm A Girl
Last Alliance - Hekireki (Thunder) (Hajime No Ippo)
Leehom 王力宏 - Kiss Goodbye
Liu Geng Hong - Cai Hong Tian Tang (彩虹天堂) (Rainbow Paradise / Heaven)
Maroon5 - Won't Go Home Without You
Mecano - Hijo De La Luna
Michelle Williams - We Break the Dawn
Ne-Yo - Because Of You
Nickle Back - Far Away
Remioremen - Konayuki (1 litre of tears)
Sayonara Solitaire - Chrono Crusade
S.H.E - Zui Jin Hai Hao Ma (How Are You Lately?)
Shin Seung Hoon (신승훈) - Love of Iris
Shin Seung Hun (신승훈) - I Believe (My Sassy Girl)
Shocking Lemon - Under Star (Hajime No Ippo)
Simple Plan - Save You
Snow Patrol - Chasing Cars
SPITZ - Tsugumi
Stanley Huang - wu shen lun (The Atheist)
Stanley Huang 黃立行 - 黑夜盡頭
State of Shock - Best I Ever Had
State of Shock - Money honey
Super Junior - It's You
Taeyang - Wedding Dress
Tanya Chua - Beautiful Love
Tanya Chua - Da Er Wen (Darwin)
Tanya Chua - Hu Xi (Breathing)
Tanya Chua - Kong Bai Ge (Blank Grid)
Taylor Swift - Love Story
Taylor Swift - You Belong With Me
Thelma Aoyama feat.SoulJa - Soba ni Irune
Thelma Aoyama feat.SoulJa - Koko ni Iruyo
Toni Braxton - Unbreak My Heart
TVXQ!(동방신기)  -  HUG (포옹)
Utada Hikaru - Colors
Utada Hikaru - Flavor of Life
Utada Hikaru - HEART STATION
Valentine - Kina Grannis
Winter Sonata - From the Beginning
Winter Sonata - My Memory


http://www.jpopasia.com/charts/

Kim Bum Soo
http://wn.com/Kim_Bum_Soo

Dinucleotide (NpN) - two nucleotides linked by phosphodiester bonds

a "dinucleotide" is a single piece of DNA or RNA that is two nucleotides long.

The example you give of a thymidine dinucleotide would be two thymidine nucleotides attached by a phosphate bridge - in this case, the 5'-phophate of one thymidine bonds to the 3'-hydroxyl group of the other thymidine, just as the bonding occurs in complete DNA and RNA molecules.

http://www.madsci.org/posts/archives/2001-02/982619379.Bc.r.html 

Sunday, February 13, 2011

RDF - Resource Description Framework

http://www.w3.org/TR/REC-rdf-syntax/

The Resource Description Framework (RDF) is a general-purpose language for representing information in the Web.

RDF is based on the idea of identifying things using Web identifiers (called Uniform Resource Identifiers, or URIs), and describing resources in terms of simple properties and property values. This enables RDF to represent simple statements about resources as a graph of nodes and arcs representing the resources, and their properties and values. To make this discussion somewhat more concrete as soon as possible, the group of statements "there is a Person identified by http://www.w3.org/People/EM/contact#me, whose name is Eric Miller, whose email address is em@w3.org, and whose title is Dr."





  
    Eric Miller
    
    Dr. 
  


Genotype vs Haplotype

Genotype
- single locus
- eg. AA, Aa, aa

Haplotype
- an ordered list of alleles of multiple linked loci on a single chromosome
- A way of denoting the collective genotype of a number of closely linked loci on a chromosome.
- multiple locus
- eg. ABC, ABc

Haploytpe Phasing
www.pnas.org/content/108/1/12.abstract
- Determining from which parent (mom or dad) haplotype came from
- HapMap obtained this info from 30 mother-father-child trios
- Other ways include SNP array profiling from chromosome microdissection
- To determine the haplotypes from genotypes containing tightly linked SNPs from a set of n individuals
- Determining Which SNPs are Inherited Together http://www.chromosomechronicles.com/2009/09/08/phasing-determining-which-snps-are-inherited-together/


www.sph.umich.edu/~qin/lect1.ppt
http://www.bio.net/mm/gen-link/1998-October/001554.html
http://www.ornl.gov/sci/techresources/Human_Genome/glossary/glossary.shtml#H

Linkage disequilibrium

Linkage disequilibrium
- strong linkage disequilibrium (D' > 0.5) means the presence of one particular variant in one site is indicative of the presence of the another variant at a second site.

http://www.patentlens.net/patentlens/patents.html?patnums=US_2010_0035251_A1&language=en&query=%28US_20100035251%20in%20publication_number%29&stemming=true&returnTo=patentnumber.html%3Fquery%3D%26stemming%3Dtrue%26patentNumber%3D20100035251%26collections%3DUS_B%2CEP_B%2CAU_B%2CUS_A%2CWO_A%2CAU_A%26language%3Den#claim

Saturday, February 12, 2011

eventbrite.com - Free reservation, ticketing system

http://www.eventbrite.com/

Parkinson's Disease vs Alzheimer's

http://www.nejm.org/doi/full/10.1056/NEJM2003ra020003

Alzheimer's
- Long-term memory loss
- Most prevalent form of dementia
- No cure
- believed to be caused by beta amyloids
- accumulates tau protein in the form of neurofibrillary tangles

Parkinson's
- resting tremor, bradykinesia, rigidity, and postural instability
- Second most prevalent
- believed to be caused by mutation in LRRK2
- accumulation of alpha-synuclein protein in the brain in the form of Lewy bodies  http://www.pdonlineresearch.org/pdguide/park14-sncaalpha-synuclein
http://www.pdonlineresearch.org/
- dopaminergic (DA) neuron degeneration
http://en.wikipedia.org/wiki/Parkinson%27s_disease

Parkinson's Disease: Why Dopamine Replacement Therapy Has a Paradoxical Effect on Cognition
Symptoms can also affect cognition and mood and may even lead to depression. According to Health Canada, it is estimated that 1 in 100 Canadians over age 60 are diagnosed with this condition. The direct and indirect costs associated with Parkinson's disease exceed $450 million a year.
http://www.sciencedaily.com/releases/2011/06/110615015057.htm

http://www.cleveland.com/healthfit/index.ssf/2011/05/the_parkinsons_mystery.html

Peter A LeWitt et al. AAV2-GAD gene therapy for advanced Parkinson's disease: a double-blind, sham-surgery controlled, randomised trial. The Lancet Neurology, 17 March 2011 DOI: 10.1016/S1474-4422(11)70039-4
http://www.sciencedaily.com
/releases/2011/03/110316222026.htm

Substantia nigra
http://en.wikipedia.org/wiki/Substantia_nigra
located in the mesencephalon (midbrain) that plays an important role in reward, addiction, and movement. Substantia nigra is Latin for "black substance", as parts of the substantia nigra appear darker than neighboring areas due to high levels of melanin in dopaminergic neurons. Parkinson's disease is caused by the death of dopaminergic neurons in the substantia nigra pars compacta.

http://www.montgomeryadvertiser.com/article/20110621/LIFESTYLE/106210313/Column-Parkinson-s-disease-has-many-causes

There are many causes for Parkinson's disease. The most common is aging. There is also a strong genetic correlation, as approximately one-fourth of all people with Parkinson's have a relative with this disease. Unfortunately, as with most diseases in which there is a strong genetic basis, there are some relatives who die before the disease manifests itself.

Parkinson's disease can also be the result of a neurological injury from exposure to an environmental agent, which can also cause neurodegeneration in the brain. Some of these culprits include exposure to pesticides and herbicides such as exposure to Agent Orange during the Vietnam War.

Deep brain stimulation has been in use for over 20 years and has helped more than 80,000 people worldwide. The method has been used widely in Europe, but was only approved in the U. S. in 2002.

http://www.webmd.com/parkinsons-disease/news/20110624/new-genetic-clues-to-cause-of-parkinsons


Park LOCI (PARK1-PARK16)
Summary of "PARK" Loci and of linkage regionsimplied by Genome-wide linkage analysis
http://www.pdgene.org/linkage.asp

Friday, February 11, 2011

Magic Number 42 - The Hitchhiker's Guide to the Galaxy, as the "Answer to the Ultimate Question of Life, the Universe, and Everything".

http://en.wikipedia.org/wiki/42_%28number%29

Correlations

a<-rep(1,10)
a[10] <- 1.1  # avoid 0 sd
b<-rep(0,10)
b[10] <- 0.1
c<-c(1,0,1,0,1,0,1,0,1,0)
d<-c(0,1,0,1,0,1,0,1,0,1)
dat<-data.frame(a,b,c,d)

> dat
     a   b c d
1  1.0 0.0 1 0
2  1.0 0.0 0 1
3  1.0 0.0 1 0
4  1.0 0.0 0 1
5  1.0 0.0 1 0
6  1.0 0.0 0 1
7  1.0 0.0 1 0
8  1.0 0.0 0 1
9  1.0 0.0 1 0
10 1.1 0.1 0 1
> cor(dat)
           a          b          c          d
a  1.0000000  1.0000000 -0.3333333  0.3333333
b  1.0000000  1.0000000 -0.3333333  0.3333333
c -0.3333333 -0.3333333  1.0000000 -1.0000000
d  0.3333333  0.3333333 -1.0000000  1.0000000

> apply(t(dat), 1, sd)
         a          b          c          d
0.03162278 0.03162278 0.52704628 0.52704628

Wednesday, February 9, 2011

A-G and T-C mutations (transitions) are more frequent than transversions

Transitions (A–G and T–C mutations in DNA) are more frequent than transversions (the rest) (e.g. Gojobori et al., 1982).

Maximum Likelihood - MLE, parameter estimation

In general, for a fixed set of data and underlying probability model, the method of maximum likelihood selects values of the model parameters (eg. mean, variance) that produce the distribution most likely to have resulted in the observed data (i.e. the parameters that maximize the likelihood function). Maximum likelihood estimation gives a unified approach to estimation, which is well-defined in the case of the normal distribution and many other problems.

Generally assumes data are
- iid (independent and identically distributed)
- normal distribution

http://en.wikipedia.org/wiki/Maximum_likelihood

Felsenstein - PHYLIP, phylogenetic inference

Joseph "Joe" Felsenstein
- phylogenetic inference
- PHYLIP
- Felsenstein's tree peeling algorithm provides a computationally feasible scheme for finding maximum likelihood estimates for evolutionary trees from nucleic acid sequence data.

http://en.wikipedia.org/wiki/Joseph_Felsenstein
http://en.wikipedia.org/wiki/Felsenstein%27s_tree_peeling_algorithm

Training parameters for HMM

Manually adjusting the parameters of an HMM in order to get a high prediction accuracy can be a very time consuming task which is also not guaranteed to improve the performance accuracy. A variety of training algorithms have therefore been devised in order to address this challenge. These training algorithms require as input and starting point a so-called training set of (typically partly annotated) data. Starting with a set of (typically user-chosen) initial parameter values, the training algorithm employs an iterative procedure which subsequently derives new, more refined parameter values. The iterations are stopped when a termination criterion is met, e.g. when a maximum number of iterations have been completed or when the change of the log-likelihood from one iteration to the next become sufficiently small. The model with the final set of parameters is then used to test if the performance accuracy has been improved. This is typically done by analyzing a test set of annotated data which has no overlap with the training set by comparing the predicted to the known annotation.

Another well-known training algorithm for HMMs is Baum-Welch training [21] which is an expectation maximization (EM) algorithm [22]. In each iteration, a new set of parameter values is derived from the estimated number of counts of emissions and transitions by considering all possible state paths (rather than only a single Viterbi path) for every training sequence. The iterations are typically stopped after a fixed number of iterations or as soon as the change in the log-likelihood is sufficiently small. For Baum-Welch training, the likelihood P(equation M3|ϕ) [13] can be shown to converge (under some conditions) to a stationary point which is either a local optimum or a saddle point. Baum-Welch training using the traditional combination of forward and backward algorithm [13] is, for example, implemented into the prokaryotic gene prediction method EASYGENE [23] and the HMM-compiler HMMoC [15]. As for Viterbi training, the outcome of Baum-Welch training may strongly depend on the chosen set of initial parameter values. As Jensen [24] and Khreich et al. [25] describe, computationally more efficient algorithms for Baum-Welch training which render the memory requirement independent of the sequence length have been proposed, first in the communication field by [26-28] and later, independently, in bioinformatics by Miklós and Meyer [29], see also [30]. The advantage of this linear-memory memory algorithm is that it is comparatively easy to implement as it requires only a one- rather than a two-step procedure and as it scans the sequence in a uni- rather than bi-directional way. This algorithm was employed by Hobolth and Jensen [31] for comparative gene prediction and has also been implemented, albeit in a modified version, by Churbanov and Winters-Hilt [30] who also compare it to other implementations of Viterbi and Baum-Welch training including checkpointing implementations.

Stochastic expectation maximization (EM) training or Monte Carlo EM training [32] is another iterative procedure for training the parameters of HMMs. Instead of considering only a single Viterbi state path for a given training sequence as in Viterbi training or all state paths as in Baum-Welch training, stochastic EM training considers a fixed-number of K state paths Πs which are sampled from the posterior distribution P(Π|X) for every training sequence X in every iteration. Sampled state paths have already been used in several bioinformatics applications for sequence decoding, see e.g. [2,33] where sampled state paths are used in the context of gene prediction to detect alternative splice variants.

http://www.ncbi.nlm.nih.gov.proxy.lib.sfu.ca/pmc/articles/PMC3019189/?tool=pubmed

Tuesday, February 8, 2011

Briefings in Bioinformatics

Bioinformatics training: a review of challenges, actions and support requirements.

http://www.binf.umn.edu/about/whatsbinf.php

Schneider MV, Watson J, Attwood T, Rother K, Budd A, McDowall J, Via A, Fernandes P, Nyronen T, Blicher T, Jones P, Blatter MC, De Las Rivas J, Phillip Judge D, van der Gool W, Brooksbank C.
Bioinformatics training: a review of challenges, actions and support requirements.
Brief Bioinform. 2010 Jun 18.

Wright VA, Vaughan BW, Laurent T, Lopez R, Brooksbank C, Schneider MV.
Bioinformatics training: selecting an appropriate learning content management system--an example from the European Bioinformatics Institute.
Brief Bioinform. 2010 Jul 2.

Yamashita G, Miller H, Goddard A, Norton C.
A model for Bioinformatics training: the marine biological laboratory.
Brief Bioinform. 2010 Jul 25.

Cummings MP, Temple GG.
Broader incorporation of bioinformatics in education: opportunities and challenges.
Brief Bioinform. 2010 Aug 26.

Buttigieg PL.
Perspectives on presentation and pedagogy in aid of bioinformatics education.
Brief Bioinform. 2010 Aug 19.

Kulkarni-Kale U, Sawant S, Chavan V.
Bioinformatics education in India.
Brief Bioinform. 2010 Aug 12.

Jungck JR, Donovan SS, Weisstein AE, Khiripet N, Everse SJ.
Bioinformatics education dissemination with an evolutionary problem solving perspective.
Brief Bioinform. 2010 Nov;11(6):570-81.

Human Genome glossary of terms

http://www.ornl.gov/sci/techresources/Human_Genome/glossary/glossary.shtml

SCFG - Stochastic Context Free Grammar, CYK - finds Viterbi Path, HKY -

http://en.wikipedia.org/wiki/Stochastic_context-free_grammar
A stochastic context-free grammar (SCFG; also probabilistic context-free grammar, PCFG) is a context-free grammar in which each production is augmented with a probability. The probability of a derivation (parse) is then the product of the probabilities of the productions used in that derivation; thus some derivations are more consistent with the stochastic grammar than others.

A variant of the CYK algorithm finds the Viterbi parse of a sequence for a given SCFG. The Viterbi parse is the most likely derivation (parse) of the sequence by the given SCFG.

http://en.wikipedia.org/wiki/CYK_algorithm
The Cocke–Younger–Kasami (CYK) algorithm (alternatively called CKY) determines whether a string can be generated by a given context-free grammar and, if so, how it can be generated. This is known as parsing the string. The algorithm employs bottom-up parsing and dynamic programming.

Chomsky Normal Form
http://en.wikipedia.org/wiki/Chomsky_normal_form
In computer science, a context-free grammar is said to be in Chomsky normal form if all of its production rules are of the form:

A -> BC or
A -> α or
S -> ε

where A, B and C are nonterminal symbols, α is a terminal symbol (a symbol that represents a constant value), S is the start symbol, and ε is the empty string. Also, neither B nor C may be the start symbol.

Every grammar in Chomsky normal form is context-free, and conversely, every context-free grammar can be transformed into an equivalent one which is in Chomsky normal form.

http://en.wikipedia.org/wiki/Weighted_context-free_grammar
A weighted context-free grammar (WCFG) is a context-free grammar where each production has a numeric weight associated with it. The weight of a parse tree in a WCFG is the weight of the rule used to produce the top node, plus the weights of its children. A special case of WCFGs are stochastic context-free grammars, where the weights are (logarithms of) probabilities.

Sunday, February 6, 2011

Geany - Programming editor for Ubuntu

http://ubuntuforums.org/archive/index.php/t-880066.html

Batch effects (lab effects) eg. variation due to different lab groups or processing time

http://rafalab.jhsph.edu/batch/

Batch effects are sub-groups of measurements that have qualitatively different behaviour across conditions and are unrelated to the biological or scientific variables in a study. For example, batch effects may occur if a subset of experiments was run on Monday and another set on Tuesday, if two technicians were responsible for different subsets of the experiments or if two different lots of reagents, chips or instruments were used.

http://www.nature.com/nrg/journal/v11/n10/full/nrg2825.html

Combat
http://jlab.byu.edu/ComBat/Abstract.html
Reference: Johnson, WE, Rabinovic, A, and Li, C (2007). Adjusting batch effects in microarray expression data using Empirical Bayes methods. Biostatistics 8(1):118-127. [Abstract]

Confounded
An extraneous variable (for example, processing data) is said to be confounded with the outcome of interest (for example, disease state) when it correlates both with the outcome and with an independent variable of interest (for example, gene expression).

Patent Databases

Free
* PatentLens (by Cambia)
o http://www.patentlens.net/
o 10,702,297 patent documents
o updated Dec 21, 2010 (No CA, only US, Europe, Australia, WIPO/PCT)
o Sequence Search Facility
o RSS feeds
o limitations with alternate spellings eg. harbor / harbour, names eg. J. Smith vs John Smith, OCR processed
o great documentation
o nice UI
o PatentFamily - Easily find patent numbers filed to other countries, current status of patent
o sequence DB last updated May, 2, 2010
o Full text documents include:
+ 1976 onwards - All US granted patents
+ mid-1998 onwards - All Australian granted patents
+ 1980 onwards - EP-B granted patents
+ 1978 onwards - WO-A/PCT patents
* USPTO - US Patent and Trademark Office, old interface, lots of scrolling down
o Publication Site for Issued and Published Sequences (PSIPS) http://seqdata.uspto.gov/
o Shopping cart?
* GenBank/Entrez
o cystic fibrosis[All Fields] AND (all[filter] AND gbdiv_pat[PROP])
o http://www.ncbi.nlm.nih.gov/nuccore/
* DNA Patent Database http://dnapatents.georgetown.edu/
+ DPD updated February 2, 2011
+ U.S. Granted Patents = 57,176
+ U.S. Patent Applications = 87,438
* Google Patents (better interface but can only read / download PDF)
* JPO / IPDL - Industrial Property Digital Library (Japanese)
* EPO (European Patent Office)
* CIPO (Canada)
* World / European? http://ep.espacenet.com
* WPO (PatentScope) http://www.wipo.int/pctdb/en/
* Blogs, newsletters, magazines - http://www.ipfrontline.com/, mailing lists



Commercial
* Thomson Reuters / Derwent -
o DWPI (Derwent World Patents Index)
o GENESEQ (manually and professionally curated, continually growing with ~600 unique patents every two weeks)
o http://thomsonreuters.com/products_services/science/science_products/a-z/geneseq/
o patent titles and abstracts written in English — using clear, consistent, industry-specific terms.
o Delphion
+ $4.00 each time you perform a search
+ $6.00 for each full Derwent Record viewed
+ http://www.delphion.com
o MicroPatent, WestLaw
* CAS (a division of the American Chemical Society) CAPlus
o 1,500 key scientific journals
o Available within 2 days of patent’s issuance (JPO, USPTO, CIP, etc.)
+ Fully indexed by scientist within 27 days
o http://www.cas.org/expertise/cascontent/caplus/index.html
o Use STN
+ STN - operated jointly by CAS, FIZ Karlsruhe and JAICI
o 1907 to the present, plus many records from earlier years
o Books, Dissertations, Reviews, etc.
* LexisNexis
o Legal Store - Bookstore
o Law firms, Corporate, Academic, and Government solutions
o provides customers with access to billions of searchable documents and records from more than 45,000 legal, news and business sources.
o http://www.lexisnexis.com
* Cooler looking websites, fast, excellent tutorial, documentation, analytical tools, mobile apps, blogs, up-to-date info, powerful search and indexing



Notes
* Commercial websites UI are more user friendly, offers high quality annotations, better support and training, analysis tools? over free ones but obviously costs more (negotiable?)
* Maybe give a couple of commercial products a test drive first
* Sequence BLAST search available for commercial (Thomson, CAS) and public (GenBank, PatentLens)
* Fairly good coverage dating back from 1970s and fairly fast
* Full text search obtained through OCR scanning of documents
* Main difficulty comes from search terms entered, alternate spellings, author’s name formatting, finding the write word to search, (this is what DWPI is trying to address)



References
* Jon R. Cavicchi, Intellectual Property Research Tools and Strategies: Lexis vs. Westlaw for Research--Better, Different, or Same and the QWERTY Effect?, 47 IDEA 363 (2007).
o LexisNexis vs Westlaw - Qwerty effect - history matters, also personal feel (not so much on function)
* G. Gann Xu, Amie Webster, and Ellen Doran, “Patent sequence databases,” World Patent Information 24, no. 2 (June 2002): 95-101.
* http://www.ornl.gov/sci/techresources/Human_Genome/elsi/patents.shtml

Friday, February 4, 2011

How not to collaborate with a biostatistician

http://www.xtranormal.com/watch/6878253/

Mouse brain resources and gene expression

To illustrate the potential of the INCF Digital Atlasing framework, we integrated three major community resources into this developing infrastructure as atlas hubs: the ABA and associated tools such as the Anatomic Gene Expression Atlas (AGEA), EMAP/EMAGE for developmental mouse brain data, and the WBC, which integrates the UCSD/BIRN Smart Atlas (Spatial Mark-Up and Rendering Tool) and the Cell Centered Database (CCDB, http://www.ccdb.ucsd.edu/), including the Paxinos and Watson mouse brain atlas. Each of these atlases represents an important community resource in the rodent brain research community. To make these atlases interoperable, we registered the atlases to WHS and made their data accessible via the standards and Web services indicated above.

http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1001065

Thursday, February 3, 2011

Master's Defense

15-20 min presentation
round of questions 3x15 min
< 2 hrs in total
breadth questions: what is blast, 3 domains of GO (biology process, cellular component, molecular function)

* cellular component, the parts of a cell or its extracellular environment;
* molecular function, the elemental activities of a gene product at the molecular level, such as binding or catalysis;
* biological process, operations or sets of molecular events with a defined beginning and end, pertinent to the functioning of integrated living units: cells, tissues, organs, and organisms.


motif - Zinc finger bind zinc ions

RDF

weeding out outliers

how will it affect biologists

reproducible research

results sound and complete?

greedy vs dynamic programming

http://www.algorito.com/algorithm/counting-coins-algorithm-greedy-vs-dynamic-programming

# The greedy method computes its solution by making its choices in a serial forward fashion, never looking back or revising previous choices. (can get stuck at local minima) (FASTER than dynamic programming)

Nevertheless, they are useful because they are quick to think up and often give good approximations to the optimum

# Dynamic programming computes its solution bottom up by synthesizing them from smaller subsolutions, and by trying many possibilities and choices before it arrives at the optimal set of choices.

In other words, a greedy algorithm never reconsiders its choices. This is the main difference from dynamic programming, which is exhaustive and is guaranteed to find the solution. After every stage, dynamic programming makes decisions based on all the decisions made in the previous stage, and may reconsider the previous stage's algorithmic path to solution.

semantic web

http://dbpedia.org/About

http://en.wikipedia.org/wiki/Semantic_Web

is a group of methods and technologies to allow machines to understand the meaning – or "semantics" – of information on the World Wide Web.[1]

Ontology, semantic web
- Resource Description Framework (RDF), OWL, XML, SPARQL
- make machine-readable

Cartesian product (cross product) of Character Vectors - interaction()

> y <- c("1","2","3") > x <- c("aaa","bbb","ccc") > interaction(x,y)
[1] aaa.1 bbb.2 ccc.3
Levels: aaa.1 bbb.1 ccc.1 aaa.2 bbb.2 ccc.2 aaa.3 bbb.3 ccc.3

https://stat.ethz.ch/pipermail/r-help/2008-December/181756.html

Wednesday, February 2, 2011

Baum-Welch Algorithm, expectation maximization (EM), HMM

Baum-Welch Algorithm
Intuition
To solve Problem 3 we need a method of adjusting the lambda parameters to maximize the likelihood of the training set.

Suppose that the outputs (observations) are in a 1-1 correspondence with the states so that N = M, varphi(q_i) = v_i and b_i(j) = 1 for j = i and 0 for j != i. Now the Markov process is not hidden at all and the HMM is just a Markov chain. To estimate the lambda parameters for this Markov chain it is enough just to calculate the appropriate frequencies from the observed sequence of outputs. These frequencies constitute sufficient statistics for the underlying distributions.

In the more general case, we can't observe the states directly so we can't calculate the required frequencies. In the hidden case, we use expectation maximization (EM) as described in [Dempster et al., 1977].

Instead of calculating the required frequencies directly from the observed outputs, we iteratively estimated the parameters. We start by choosing arbitrary values for the parameters (just make sure that the values satisfy the requirements for probability distributions).

We then compute the expected frequencies given the model and the observations. The expected frequencies are obtained by weighting the observed transitions by the probabilities specified in the current model. The expected frequencies so obtained are then substituted for the old parameters and we iterate until there is no improvement. On each iteration we improve the probability of O being observed from the model until some limiting probability is reached. This iterative procedure is guaranteed to converge on a local maximum of the cross entropy (Kullback-Leibler) performance measure.

http://www.comp.leeds.ac.uk/roger/HiddenMarkovModels/html_dev/gen_patterns/s1_pg1.html
http://vimeo.com/7303679

Forward and Backward
- Goal is to calculate theta at a specific t
- Split the Markov Chain, reduces total chain rule of joint probability to a bunch of recursion rules
- X = emission
- theta = state
- P(theta t, X) = P(theta t, {X1..t}) * P(theta t | X{t+1 .. n})
- So use forward algorithm to calculate the left
- And backward algorithm to calculate the right (why do this, more computationally efficient)

Entropy (Information Theory)

http://en.wikipedia.org/wiki/Entropy_(information_theory)

In information theory, entropy is a measure of the uncertainty associated with a random variable. In this context, the term usually refers to the Shannon entropy, which quantifies the expected value of the information contained in a message, usually in units such as bits.

Entropy is a measure of disorder, or more precisely unpredictability.

The extreme case is that of a double-headed coin which never comes up tails. Then there is no uncertainty. The entropy is zero: each toss of the coin delivers no information.