Paul
Posts:
517
Registered:
2/23/10


SVD for PCA: The right most rotation matrix
Posted:
Oct 28, 2012 8:28 PM


My apologies if this appears twice. The posting of this message seems to have been held up.
I am trying to understand SVD in the context of PCA. I have looked at Leskovec (http://search.yahoo.com/r/ _ylt=A0oG7t31r41QSHsAFA9XNyoA;_ylu=X3oDMTE0YmlrMDI5BHNlYwNzcgRwb3MDMQRjb2xvA2FjMgR2dGlkA01BUDAwNl83MQ/ SIG=13fl10gvd/EXP=1351491701/**http%3a//www.cs.cmu.edu/~guestrin/Class/ 10701S06/Handouts/recitations/recitationpca_svd.ppt) and Shlen (http://search.yahoo.com/r/ _ylt=A0oG7t0dsI1Qj3oAG1ZXNyoA;_ylu=X3oDMTE0YmlrMDI5BHNlYwNzcgRwb3MDMQRjb2xvA2FjMgR2dGlkA01BUDAwNl83MQ/ SIG=11r2sjgrs/EXP=1351491741/**http%3a//www.snl.salk.edu/~shlens/ pca.pdf) for intution.
The scenario I use is a lab experiment in which m sensors syncrhonously sample data at n points in time, yielding a data matrix X with m rows and n columns. Each row contains the readings from a single sensor/instrument, and each column contains the readings from an instant in time. I suppose that the rows could also be key words in a data mining exercise, and the columns could be documents in which we try to find these key words in (as per Leskovec above), but that scenario is a bit foggier for me because it deals with "concepts", the number of which matches neither m nor n. So as a first step, stick with the scenario for lab sensor/instrument. Also, consider only real data, so the data covariance matrices are diagonalizable with orthonormal eigenvectors corresponding to simple rotations of the data in mspace.
http://en.wikipedia.org/wiki/Principal_component_analysis#Details diagonalizes the data set X by factoring it into X=(W)(Sigma)(Vt) where:
* For W, the columns of this (m)x(m) matrix are the orthonormal eigenvectors of covariance matrix (X)(Xt) {Xt is the transpose of X}.
* Specifically, (X)(Xt) contain the covariances from pairing the m sensors/instruments rather than from pairing the n samples of m measurements. The former is of interest to us while for the life of me, I can't see the relevance of the latter.
* Xt = Is the transpose of X.
* Vt is the transpose of (n)x(n) matrix V. The columns of V are the orthonormal eigenvectors of the covariance matrix (Xt)(X)  specifically, the covariances from pairing the n samples of m measurements. This relevance of this matrix is what I can't see the relevance of (intuitively).
* Sigma is the diagonal matrix of square roots of eigenvalues of (X) (Xt), which are the same as for (Xt)(X).
I am trying to eek out some intuition from X=(W)(Sigma)(Vt). I find it curious and interesting that the covariances (X)(Xt) are viewed as a linear transformation, and the eigenvectors in W become the orthogonal directions in which the scalings differ. Hence, they form the basis vectors that are aligned with the principal components. Then it becomes obvious that Sigma is simply the anisotropic axial scaling.
If X is viewed as some kind of linear tranformation (and I'm not sure if I'm actully suppose to do that), than Vt can be seen as a rotation so that the princpal component aligns with the 1st axis, the 2nd principal component aligns with the 2nd, etc., prior to the scaling by Sigma. Finally, I would expect W to rotate the data back to its original orientation, thus yielding X on the LHS.
Following Shlen's tutorial, I find the above picture is easier to see if we rewrite the SVD formula as (Wt)(X)=(Sigma)(Vt), where the /rows/ of Wt are the eigenvectors of covariance (X)(Xt) between sensors/ instruments. Treating them as basis vectors, then multiplying them by the columns of X simply projects the mvalue samples from each measurement instance onto the principle components, which yields the rotation of the data points so that the principle components align with the axes. Conversely, X=(W)[(Sigma)(Vt)] takes the data points in the rotated state (principle components aligned with axes) and unrotates themm so that it matches the orientation of the measured data points.
One of the most disturbing things I haven't been able to figure out is what V (or Vt) corresponds to in the real world. I mean, if X was a transformation, then Vt is simply a rotation in nspace. But X *isn't* a transformation. And nspace is meaningless because we would never treat the vector of data from a single sensor as a data point (i.e., each measurement instance in time as a dimension) and plot it in ndimensional space. So even though V or Vt somehow corresponds to a geometric rotation of sorts, it's in an space that is nonsensical and has no bearing in the real world.
I realize that Leskovec describes SVD differently, as documents versus search terms, with concepts as an intermediate thing that is determined by the SVD. The left and right singular vectors then represent the correlation of documents versus concepts and search terms versus concepts. However, he doesn't really delve into why the math corresponds to that. Also, I'm much more interested in the lab sensor/instrument scenario, where the size of the diagonal matrix corresponds to the size of the data set (at least before dimensional reduction).
So when I look at the mockingly simple SVD formula, I have developed a phobia of the mysterious rotation matrix at the tail end. It has defied my endless attempts (no joke) to try to understand intuitively. Thank you anyone for imparting some clear intution to this.

