Gary
Posts:
73
Registered:
9/6/07


Re: SVD for PCA: The right most rotation matrix
Posted:
Jan 4, 2013 4:19 PM


On Monday, 29 October 2012 02:28:37 UTC+2, Paul wrote: > My apologies if this appears twice. The posting of this message seems > > to have been held up. > > > > I am trying to understand SVD in the context of PCA. I have looked at > > Leskovec (http://search.yahoo.com/r/ > > _ylt=A0oG7t31r41QSHsAFA9XNyoA;_ylu=X3oDMTE0YmlrMDI5BHNlYwNzcgRwb3MDMQRjb2xvA2FjMgR2dGlkA01BUDAwNl83MQ/ > > SIG=13fl10gvd/EXP=1351491701/**http%3a//www.cs.cmu.edu/~guestrin/Class/ > > 10701S06/Handouts/recitations/recitationpca_svd.ppt) and Shlen > > (http://search.yahoo.com/r/ > > _ylt=A0oG7t0dsI1Qj3oAG1ZXNyoA;_ylu=X3oDMTE0YmlrMDI5BHNlYwNzcgRwb3MDMQRjb2xvA2FjMgR2dGlkA01BUDAwNl83MQ/ > > SIG=11r2sjgrs/EXP=1351491741/**http%3a//www.snl.salk.edu/~shlens/ > > pca.pdf) for intution. > > > > The scenario I use is a lab experiment in which m sensors > > syncrhonously sample data at n points in time, yielding a data matrix > > X with m rows and n columns. Each row contains the readings from a > > single sensor/instrument, and each column contains the readings from > > an instant in time. I suppose that the rows could also be key words > > in a data mining exercise, and the columns could be documents in which > > we try to find these key words in (as per Leskovec above), but that > > scenario is a bit foggier for me because it deals with "concepts", the > > number of which matches neither m nor n. So as a first step, stick > > with the scenario for lab sensor/instrument. Also, consider only real > > data, so the data covariance matrices are diagonalizable with > > orthonormal eigenvectors corresponding to simple rotations of the data > > in mspace. > > > > http://en.wikipedia.org/wiki/Principal_component_analysis#Details > > diagonalizes the data set X by factoring it into X=(W)(Sigma)(Vt) > > where: > > > > * For W, the columns of this (m)x(m) matrix are the orthonormal > > eigenvectors of covariance matrix (X)(Xt) {Xt is the transpose of X}. > > > > * Specifically, (X)(Xt) contain the covariances from pairing the m > > sensors/instruments rather than from pairing the n samples of m > > measurements. The former is of interest to us while for the life of > > me, I can't see the relevance of the latter. > > > > * Xt = Is the transpose of X. > > > > * Vt is the transpose of (n)x(n) matrix V. The columns of V are the > > orthonormal eigenvectors of the covariance matrix (Xt)(X)  > > specifically, the covariances from pairing the n samples of m > > measurements. This relevance of this matrix is what I can't see the > > relevance of (intuitively). > > > > * Sigma is the diagonal matrix of square roots of eigenvalues of (X) > > (Xt), which are the same as for (Xt)(X). > > > > I am trying to eek out some intuition from X=(W)(Sigma)(Vt). I find > > it curious and interesting that the covariances (X)(Xt) are viewed as > > a linear transformation, and the eigenvectors in W become the > > orthogonal directions in which the scalings differ. Hence, they form > > the basis vectors that are aligned with the principal components. > > Then it becomes obvious that Sigma is simply the anisotropic axial > > scaling. > > > > If X is viewed as some kind of linear tranformation (and I'm not sure > > if I'm actully suppose to do that), than Vt can be seen as a rotation > > so that the princpal component aligns with the 1st axis, the 2nd > > principal component aligns with the 2nd, etc., prior to the scaling by > > Sigma. Finally, I would expect W to rotate the data back to its > > original orientation, thus yielding X on the LHS. > > > > Following Shlen's tutorial, I find the above picture is easier to see > > if we rewrite the SVD formula as (Wt)(X)=(Sigma)(Vt), where the /rows/ > > of Wt are the eigenvectors of covariance (X)(Xt) between sensors/ > > instruments. Treating them as basis vectors, then multiplying them by > > the columns of X simply projects the mvalue samples from each > > measurement instance onto the principle components, which yields the > > rotation of the data points so that the principle components align > > with the axes. Conversely, X=(W)[(Sigma)(Vt)] takes the data points > > in the rotated state (principle components aligned with axes) and > > unrotates themm so that it matches the orientation of the measured > > data points. > > > > One of the most disturbing things I haven't been able to figure out is > > what V (or Vt) corresponds to in the real world. I mean, if X was a > > transformation, then Vt is simply a rotation in nspace. But X > > *isn't* a transformation. And nspace is meaningless because we would > > never treat the vector of data from a single sensor as a data point > > (i.e., each measurement instance in time as a dimension) and plot it > > in ndimensional space. So even though V or Vt somehow corresponds to > > a geometric rotation of sorts, it's in an space that is nonsensical > > and has no bearing in the real world. > > > > I realize that Leskovec describes SVD differently, as documents versus > > search terms, with concepts as an intermediate thing that is > > determined by the SVD. The left and right singular vectors then > > represent the correlation of documents versus concepts and search > > terms versus concepts. However, he doesn't really delve into why the > > math corresponds to that. Also, I'm much more interested in the lab > > sensor/instrument scenario, where the size of the diagonal matrix > > corresponds to the size of the data set (at least before dimensional > > reduction). > > > > So when I look at the mockingly simple SVD formula, I have developed a > > phobia of the mysterious rotation matrix at the tail end. It has > > defied my endless attempts (no joke) to try to understand > > intuitively. Thank you anyone for imparting some clear intution to > > this.
You have been given a lot of references but I didn't see the one below so I will mention it here:
Stanley Mulaik "Foundations of Factor Analysis" (second edition). Chapman & Hall/CRC. 9Taylor Francis Group). publication date: 2010.
Lance

