Thursday, March 12, 2015

Points of Significance in Biology - Nature column

http://www.nature.com/nmeth/journal/v10/n9/full/nmeth.2613.html

http://www.nature.com/collections/qghhqm

Since September 2013 Nature Methods has been publishing a monthly column on statistics aimed at providing reseachers in biology with a basic introduction to core statistical concepts and methods, including experimental design. Although targeted at biologists, the articles are useful guides for researchers in other disciplines as well. A continuously updated list of these articles is provided below.

Importance of being uncertain - How samples are used to estimate population statistics and what this means in terms of uncertainty.

Error Bars - The use of error bars to represent uncertainty and advice on how to interpret them.

Significance, P values and t-tests - Introduction to the concept of statistical significance and the one-sample t-test.

Power and sample size - Using statistical power to optimize study design and sample numbers.

Visualizing samples with box plots - Introduction to box plots and their use to illustrate the spread and differences of samples. See also: Kick the bar chart habit and BoxPlotR: a web tool for generation of box plots

Comparing samples—part I - How to use the two-sample t-test to compare either uncorrelated or correlated samples.

Comparing samples—part II - Adjustment and reinterpretation of P values when large numbers of tests are performed.

Nonparametric tests - Use of nonparametric tests to robustly compare skewed or ranked data.

Designing comparative experiments - The first of a series of columns that tackle experimental design shows how a paired design achieves sensitivity and specificity requirements despite biological and technical variability.

Analysis of variance and blocking - Introduction to ANOVA and the importance of blocking in good experimental design to mitigate experimental error and the impact of factors not under study.

Replication - Technical replication reveals technical variation while biological replication is required for biological inference.

Nested designs - Use the relative noise contribution of each layer in nested experimental designs to optimally allocate experimental resources using ANOVA.

Two-factor designs - It is common in biological systems for multiple experimental factors to produce interacting effects on a system. A study design that allows these interactions can increase sensitivity.

Sources of variation - To generalize experimental conclusions to a population, it is critical to sample its variation while using experimental control, randomization, blocking and replication to collect replicable and meaningful results.

Matrix factorization - R-bloggers

http://www.r-bloggers.com/testing-recommender-systems-in-r/

http://www.r-bloggers.com/matrix-factorization/?utm_source=feedburner&utm_medium=email&utm_campaign=Feed%3A+RBloggers+%28R+bloggers%29

Latent Factor Matrix Factorization.
Singular value decomposition 
optim() function in R for optimization

http://www.r-bloggers.com/in-depth-introduction-to-machine-learning-in-15-hours-of-expert-videos/ 
In-depth introduction to machine learning in 15 hours of expert videos 

Wednesday, March 11, 2015

The Elements of Programming Style

http://en.wikipedia.org/wiki/The_Elements_of_Programming_Style

  1. Write clearly -- don't be too clever.
  2. Say what you mean, simply and directly.
  3. Use library functions whenever feasible.
  4. Avoid too many temporary variables.
  5. Write clearly -- don't sacrifice clarity for efficiency.
  6. Let the machine do the dirty work.
  7. Replace repetitive expressions by calls to common functions.
  8. Parenthesize to avoid ambiguity.
  9. Choose variable names that won't be confused.
  10. Avoid unnecessary branches.
  11. If a logical expression is hard to understand, try transforming it.
  12. Choose a data representation that makes the program simple.
  13. Write first in easy-to-understand pseudo language; then translate into whatever language you have to use.
  14. Modularize. Use procedures and functions.
  15. Avoid gotos completely if you can keep the program readable.
  16. Don't patch bad code -- rewrite it.
  17. Write and test a big program in small pieces.
  18. Use recursive procedures for recursively-defined data structures.
  19. Test input for plausibility and validity.
  20. Make sure input doesn't violate the limits of the program.
  21. Terminate input by end-of-file marker, not by count.
  22. Identify bad input; recover if possible.
  23. Make input easy to prepare and output self-explanatory.
  24. Use uniform input formats.
  25. Make input easy to proofread.
  26. Use self-identifying input. Allow defaults. Echo both on output.
  27. Make sure all variables are initialized before use.
  28. Don't stop at one bug.
  29. Use debugging compilers.
  30. Watch out for off-by-one errors.
  31. Take care to branch the right way on equality.
  32. Be careful if a loop exits to the same place from the middle and the bottom.
  33. Make sure your code does "nothing" gracefully.
  34. Test programs at their boundary values.
  35. Check some answers by hand.
  36. 10.0 times 0.1 is hardly ever 1.0.
  37. 7/8 is zero while 7.0/8.0 is not zero.
  38. Don't compare floating point numbers solely for equality.
  39. Make it right before you make it faster.
  40. Make it fail-safe before you make it faster.
  41. Make it clear before you make it faster.
  42. Don't sacrifice clarity for small gains in efficiency.
  43. Let your compiler do the simple optimizations.
  44. Don't strain to re-use code; reorganize instead.
  45. Make sure special cases are truly special.
  46. Keep it simple to make it faster.
  47. Don't diddle code to make it faster -- find a better algorithm.
  48. Instrument your programs. Measure before making efficiency changes.
  49. Make sure comments and code agree.
  50. Don't just echo the code with comments -- make every comment count.
  51. Don't comment bad code -- rewrite it.
  52. Use variable names that mean something.
  53. Use statement labels that mean something.
  54. Format a program to help the reader understand it.
  55. Document your data layouts.
  56. Don't over-comment

The Elements of Style

Vigorous writing is concise. A sentence should contain no unnecessary words, a paragraph no unnecessary sentences, for the same reason that a drawing should have no unnecessary lines and a machine no unnecessary parts. This requires not that the writer make all his sentences short, or that he avoid all detail and treat his subjects only in outline, but that he make every word tell.
—"Elementary Principles of Composition", The Elements of Style[10]

Research methods: Know when your numbers are significant

http://www.nature.com/nature/journal/v492/n7428/fig_tab/492180a_T1.html  
When N is only 2 or 3, it would be more transparent to just plot the independent data points, and let the readers interpret the data for themselves, rather than showing possibly misleading P values or error bars and drawing statistical inferences. 
All experimental biologists and all those who review their papers should know what sort of sampling errors are to be expected in common experiments, such as determining the percentages of live and dead cells or counting the number of colonies on a plate or cells in a microscope field. Otherwise, they will not be able to judge their own data critically, or anyone else's.   Table 1: Statistics glossary: Some common statistical concepts and their uses in analysing experimental results.
TermMeaningCommon uses
N, number of independent samples; t, the t-statistic; p, probability.
Standard deviation (s.d.)The typical difference between each value and the mean value.Describing how broadly the sample values are distributed. 
s.d. = √(∑ (x − mean)2/(N − 1))
Standard error of the mean (s.e.m.)An estimate of how variable the means will be if the experiment is repeated multiple times.Inferring where the population mean is likely to lie, or whether sets of samples are likely to come from the same population. 
s.e.m. = s.d./√N
Confidence interval (CI; 95%)With 95% confidence, the population mean will lie in this interval.To infer where the population mean lies, and to compare two populations. 
CI = mean ± s.e.m. × t(N−1)
Independent dataValues from separate experiments of the same type that are not linked.Testing hypotheses about the population.
Replicate dataValues from experiments where everything is linked as much as possible.Serves as an internal check on performance of an experiment.
Sampling errorVariation caused by sampling part of a population rather than measuring the whole population.Can reveal bias in the data (if it is too small) or problems with conduct of the experiment (if it is too big). In binomial distributions (such as live and dead cell counts) the expected s.d. is √(N × p × (1 − p)); in Poisson dist

Monday, March 9, 2015

Points of significance: Power and sample size

Points of significance: Power and sample size

    Martin Krzywinski  
    & Naomi Altman  

    Nature Methods
    10,
    1139–1140
    (2013)
    doi:10.1038/nmeth.2738

http://www.nature.com/nmeth/journal/v10/n12/full/nmeth.2738.html

Figure 3: Decreasing specificity increases power.
Inference errors and statistical power.

(a) Observations are assumed to be from the null distribution (H0) with mean μ0. We reject H0 for values larger than x* with an error rate α (red area). (b) The alternative hypothesis (HA) is the competing scenario with a different mean μA. Values sampled from HA smaller than x* do not trigger rejection of H0 and occur at a rate β. Power (sensitivity) is 1 − β (blue area). (c) Relationship of inference errors to x*. The color key is same as in Figure 1.

Figure 4: Impact of sample (n) and effect size (d) on power.
 
Impact of sample (n) and effect size (d) on power.
H0 and HA are assumed normal with σ = 1. (a) Increasing n decreases the spread of the distribution of sample averages in proportion to 1/√n. Shown are scenarios at n = 1, 3 and 7 for d = 1 and α = 0.05. Right, power as function of n at four different α values for d = 1. The circles correspond to the three scenarios. (b) Power increases with d, making it easier to detect larger effects. The distributions show effect sizes d = 1, 1.5 and 2 for n = 3 and α = 0.05. Right, power as function of d at four different a values for n = 3.

Monday, January 12, 2015

Anaconda / Miniconda - Python package managers

http://docs.continuum.io/anaconda/index.html
http://conda.pydata.org/miniconda.html

Anaconda is a free collection of powerful packages for Python that enables large-scale data management, analysis, and visualization for Business Intelligence, Scientific Analysis, Engineering, Machine Learning, and more.