Thursday, March 27, 2014

CLARITY challenge

http://www.childrenshospital.org/research-and-innovation/research-initiatives/clarity-challenge



 An international effort towards developing standards for best practices in analysis, interpretation and reporting of clinical genome sequencing results in the CLARITY Challenge

The CLARITY Challenge provides a comprehensive assessment of current practices for using genome sequencing to diagnose and report genetic diseases. There is remarkable convergence in bioinformatic techniques, but medical interpretation and reporting are areas that require further development by many groups.

The CLARITY Challenge (Children’s Leadership Award for the Reliable Interpretation and appropriate Transmission of Your genomic information) is a contest initiated by Boston Children’s Hospital. Its goal is to identify best methods and practices for the analysis, interpretation and reporting of individuals’ DNA sequence data, to provide the most meaningful results to clinicians, patients and families.
 

Tuesday, March 18, 2014

Multivariate analysis

http://little-book-of-r-for-multivariate-analysis.readthedocs.org/en/latest/src/multivariateanalysis.html

Multivariate Analysis

This booklet tells you how to use the R statistical software to carry out some simple multivariate analyses, with a focus on principal components analysis (PCA) and linear discriminant analysis (LDA).

PCA How To
http://psych.colorado.edu/wiki/lib/exe/fetch.php?media=labs:learnr:emily_-_principal_components_analysis_in_r:pca_how_to.pdf 

So that’s it. To do PCA, all you have to do is follow these steps:
1. Get X in the proper form. This will probably mean subtracting off the means of
each row. If the variances are significantly different in your data, you may also
wish to scale each row by dividing by its standard deviation to give the rows a
uniform variance of 1 (the subtleties of how this affects your analysis are beyond
the scope of this paper, but in general, if you have significantly different scales in
your data it’s probably a good idea).
2. Calculate A=XXT
3. Find the eigenvectors of A and stack them to make the matrix P.
4. Your new data is PX, the new variables (a.k.a. principal components) are the rows
of P.
5. The variance for each principal component can be read off the diagonal of the
covariance matrix.

# Obtain data in a matrix
Xoriginal=t(as.matrix(recorded.data))
# Center the data so that the mean of each row is 0
rm=rowMeans(Xoriginal)
X=Xoriginal-matrix(rep(rm, dim(X)[2]), nrow=dim(X)[1])
# Calculate P
A=X %*% t(X)
E=eigen(A,TRUE)
P=t(E$vectors)
# Find the new data and standard deviations of the principal components
newdata = P %*% X
sdev = sqrt(diag((1/(dim(X)[2]-1)* P %*% A %*% t(P))))

Going back to the derivation of PCA, we have N=PX, where N is our new data. Since we
know that P-1=PT , it is easy to see that X=PT N. Thus, if we know P and N, we can easily
recover X. This is useful, because if we choose to throw away some of the smaller
components—which hopefully are just noise anyway—N is a smaller dataset than X. But
we can still reconstruct data that is almost the same as X.

Run prcomp() again, but this time include the option tol=0.1. What is returned will be
any principal components whose standard deviation is greater than 10% of the standard
deviation of the first principal component. In this case, the first two components are
returned.

pr=prcomp(recorded.data)
pr
plot(pr)
barplot(pr$sdev/pr$sdev[1])
pr2=prcomp(recorded.data, tol=.1)
plot.ts(pr2$x)
quartz(); plot.ts(intensities)
quartz(); plot.ts(recorded.data)
quartz(); plot.ts(cbind(-1*pr2$x[,1],pr2$x[,2]))

You can see that od, which is the reconstruction of X  from when no principal components were discarded, is identical to the recorded.data.  od2 is the reconstruction of X from only two principal components.

Because prcomp() works with variables in columns instead of rows as in the derivation
above, the required transformation is X=NPT
, or in R syntax, X=pr$x %*% t(pr$rotation).
Run the code at the top of the page

od=pr$x %*% t(pr$rotation)
od2=pr2$x %*% t(pr2$rotation)
quartz(); plot.ts(recorded.data)
quartz(); plot.ts(od)
quartz(); plot.ts(od2)

Monday, March 10, 2014

Unity Ubuntu Launcher for STS (Spring Tool Suite)

me@home:~/.local/share/applications$ cat sts.desktop
[Desktop Entry]
Name=STS
GenericName=STS
Comment=Spring Source Tool Suite
Exec=/home/me/springsource/sts-3.4.0.RELEASE/STS
Icon=/home/me/springsource/sts-3.4.0.RELEASE/icon.xpm
Terminal=false
Type=Application
Categories=Development;Programming
X-SuSE-translate=false;

Then search for "sts" in the Unity application launcher and drag the icon to the sidebar.

Saturday, March 8, 2014

Friendship and procrastination.

Nick does what many of us don’t do because we feel that we must say more, make it a bigger deal. But as Burkeman suggests, here’s what happens, "the crucial work of nurturing friendships falls into a familiar procrastinatory black hole: precisely because it matters, you postpone it until you can give it the attention it deserves, which often means never."

http://us5.campaign-archive1.com/?u=24914cf8afc0c718f3289c278&id=c971894fa6&e=1a1f4d5833

Thursday, March 6, 2014

ObjectAid UML Explorer

http://www.objectaid.com/

The ObjectAid UML Explorer is an agile and lightweight code visualization tool for the Eclipse IDE. It shows your Java source code and libraries in live UML class and sequence diagrams that automatically update as your code changes.

Scientific method: Statistical errors

http://www.nature.com/news/scientific-method-statistical-errors-1.14700?WT.mc_id=PIN_NatureNews

P values, the 'gold standard' of statistical validity, are not as reliable as many scientists assume.

Statisticians have pointed to a number of measures that might help. To avoid the trap of thinking about results as significant or not significant, for example, Cumming thinks that researchers should always report effect sizes and confidence intervals. These convey what a P value does not: the magnitude and relative importance of an effect.

Many statisticians also advocate replacing the P value with methods that take advantage of Bayes' rule: an eighteenth-century theorem that describes how to think about probability as the plausibility of an outcome, rather than as the potential frequency of that outcome. This entails a certain subjectivity — something that the statistical pioneers were trying to avoid. But the Bayesian framework makes it comparatively easy for observers to incorporate what they know about the world into their conclusions, and to calculate how probabilities change as new evidence arises.

A related idea that is garnering attention is two-stage analysis, or 'preregistered replication', says political scientist and statistician Andrew Gelman of Columbia University in New York City. In this approach, exploratory and confirmatory analyses are approached differently and clearly labelled. Instead of doing four separate small studies and reporting the results in one paper, for instance, researchers would first do two small exploratory studies and gather potentially interesting findings without worrying too much about false alarms. Then, on the basis of these results, the authors would decide exactly how they planned to confirm the findings, and would publicly preregister their intentions in a database such as the Open Science Framework (https://osf.io). They would then conduct the replication studies and publish the results alongside those of the exploratory studies. This approach allows for freedom and flexibility in analyses, says Gelman, while providing enough rigour to reduce the number of false alarms being published.

On the scalability of statistical procedures: why the p-value bashers just don't get it.

I think of particular interest given the NIH Director's recent comments on reproducibility is our course on Reproducible Research. There are also many more specialized resources that are very good and widely available that will build on the base we created with the data science specialization.
  1. For scientific software engineering/reproducibility: Software Carpentry.
  2. For data analysis in genomics: Rafa's Data Analysis for Genomics Class.
  3. For Python and computing: The Fundamentals of Computing Specialization