Monday, February 28, 2011

a tutorial on PCA / SVD

"Clustering of spatial gene expression patterns in the mouse brain
and comparison with classical neuroanatomy"

http://grass.osgeo.org/wiki/Principal_Components_Analysis


The SVD is a decomposition of any p x q matrix M into a product M = USVt where U and V are unitary matrices (UUt = VVt = I), and S is a diagonal matrix with real entries. Here, U is a p x q matrix, and S and V are q x q matrices. The columns of U and V are known as the left and right singular vectors, respectively, and entries along the diagonal of S are known as singular values. Note that when M is centered (row and column means are zero), the left singular vectors are eigenvectors of the covariance matrix MtM, the right singular vectors are eigenvectors of the covariance matrix MMt , and the square of a singular value is the variance of the corre- sponding eigenvector. Therefore, a projection of the data matrix M to a d-dimensional subspace with the largest variance may be obtained by using MV = US, retaining only the d largest singular values and corresponding singular vectors.

http://public.lanl.gov/mewall/kluwer2002.html

http://genome-www.stanford.edu/SVD/

pca.narod.ru/pcaclustclass.pdf


General about principal components
– linear combinations of the original variables
– uncorrelated with each other

Summary
• Dimension reduction important to visualize data
– Principal Component Analysis
– Clustering
• Hierarchical
• Partitioning (K-means)
(distance measure important)
• Classification
– Reduction of dimension often nessesary (t-test, PCA)
– Several classification methods avaliable
– Validation




Linear Algebra
http://pillowlab.cps.utexas.edu/teaching/CompNeuro10/schedule.html


Data matrix A, rows=data points, columns = variables (attributes,
parameters).
1. Center the data by subtracting the mean of each column.
2. Compute the SVD of the centered matrix ˆA (or the k first singular
values and vectors):
ˆA = U S(V)T .
3. The principal components are the columns of V, the coordinates of the
data in the basis defined by the principal components are U S.


%Data matrix A, columns:variables, rows: data points
%matlab function for computing the first k principal components of A.
function [pc,score]=pca(A,k);
[rows,cols]=size(A);
Ameans=repmat(mean(A,1),rows,1); %matrix, rows=means of columns
A=A-Ameans; %centering data
[U,S,V]=svds(A,k); %k is the number of pc:s desired
pc=V;
score=U*S; %now A=scores*pcs’+Ameans;


The variance in the direction of the kth principal component is given
by the corresponding singular value: 2
k.
Singular values can be used to estimate how many principal components
to keep.
Rule of thumb: keep enough to explain 85% of the variation:
http://www.uta.edu/faculty/rcli/Teaching/math5392/NotesByHyvonen/lecture5.pdf


http://www.ncbi.nlm.nih.gov.proxy.lib.sfu.ca/pubmed/10963673

http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=8768BD97E00D5306E8437C70EB103959?doi=10.1.1.115.3503&rep=rep1&type=pdf

by J Shlens - Cited by 234 - Related articles
A Tutorial on Principal Component Analysis. Jonathon Shlens∗. Systems Neurobiology Laboratory, Salk Insitute for Biological Studies.

PCA = only works on square matrices
SVD = more generalized PCA

 PCA can fail if the data is very “non-Gaussian” – It assumes that the interesting directions are along lines, and are orthogonal.

PCA is non-parametric, the most important ones are the ones with the largest variance

prcomp(dat) – Calls svd(dat); Gives you stdev (square roots of eigenvalues) and rotation (columns are the eigenvectors) a.k.a. loadings

http://genetics.agrsci.dk/statistics/courses/Rcourse-DJF2006/day3/PCA-computing.pdf 

biplot(prcomp(USArrests, scale = TRUE))


library(limma)
mm <- model.matrix(~PC1, pData(esetr))
fit <- lmFit(esetr, mm) #Fit linear model for each gene given a series of arrays
fit <- eBayes(fit) #Given a series of related parameter estimates and standard errors, compute moderated t-statistics, moderated F-statistic, and log-odds of differential expression by empirical Bayes shrinkage of the standard errors towards a common value.
topTable(fit) #Extract a table of the top-ranked genes from a linear model fit.


PCA for correcting batch effects 
In ideal circumstances, with very consistent data, we expect all data 
points to form a single, cohesive grouping in this type of plot. We also
 expect that any observed clustering will not be related to the primary 
phenotype. If there is any clustering of cases and controls, this is 
usually indicative of batch effects or other systematic differences in 
the generation of the data, and it may cause problems in association 
testing.
http://chemtree.com/SNP_Variation/tutorials/cnv-quality-control/pca.html
 
http://www.puffinwarellc.com/index.php/news-and-articles/articles/30-singular-value-decomposition-tutorial.html?start=2

 http://www.miislita.com/information-retrieval-tutorial/reduced-svd.gif
 
http://spinner.cofc.edu/~langvillea/DISSECTION-LAB/Emmie%27sLSI-SVDModule/p4module.html 
 
http://www.cbs.dtu.dk/chipcourse/Exercises/Ex_Stat/NormStatEx.html 
 
T(V) = V transpose
 
X = U S T(V) = s1 u1 v1 + s2 u2 v2 + ∙ ∙ ∙ + sr ur vr ,
 
where U = (u1 , u2 , . . . , ur ), V = (v1 , v2 , . . . , vr ), and S = diag{s1 , s2 , . . . , sr } with
s1 ≥ s2 ≥ ∙ ∙ ∙ ≥ sr > 0. The singular columns {ui } form an orthonormal basis for the
column space spanned by {c j }, and the singular rows {v j } form an orthonormal basis for
the row space spanned by {ri }. The vectors {ui } and {vi } are called singular columns and
singular rows, respectively (Gabriel and Odoroff 1984); the scalars {si } are called singular values; and the matrices {si ui viT }(i = 1, . . . , r ) are referred to as SVD components.



 
Image Compression
http://www.johnmyleswhite.com/notebook/2009/12/17/image-compression-with-the-svd-in-r/ 

http://n0b3l1a.blogspot.ca/2010/09/pca-principal-component-analysis.html

No comments: