Search All of the Math Forum:

Views expressed in these public forums are not endorsed by NCTM or The Math Forum.

Topic: SVD for PCA: The right most rotation matrix
Replies: 22   Last Post: Jan 4, 2013 4:19 PM

 Messages: [ Previous | Next ]
 Paul Posts: 493 Registered: 2/23/10
SVD for PCA: The right most rotation matrix
Posted: Oct 28, 2012 8:28 PM

My apologies if this appears twice. The posting of this message seems
to have been held up.

I am trying to understand SVD in the context of PCA. I have looked at
Leskovec (http://search.yahoo.com/r/
_ylt=A0oG7t31r41QSHsAFA9XNyoA;_ylu=X3oDMTE0YmlrMDI5BHNlYwNzcgRwb3MDMQRjb2xvA2FjMgR2dGlkA01BUDAwNl83MQ--/
SIG=13fl10gvd/EXP=1351491701/**http%3a//www.cs.cmu.edu/~guestrin/Class/
10701-S06/Handouts/recitations/recitation-pca_svd.ppt) and Shlen
(http://search.yahoo.com/r/
_ylt=A0oG7t0dsI1Qj3oAG1ZXNyoA;_ylu=X3oDMTE0YmlrMDI5BHNlYwNzcgRwb3MDMQRjb2xvA2FjMgR2dGlkA01BUDAwNl83MQ--/
SIG=11r2sjgrs/EXP=1351491741/**http%3a//www.snl.salk.edu/~shlens/
pca.pdf) for intution.

The scenario I use is a lab experiment in which m sensors
syncrhonously sample data at n points in time, yielding a data matrix
X with m rows and n columns. Each row contains the readings from a
single sensor/instrument, and each column contains the readings from
an instant in time. I suppose that the rows could also be key words
in a data mining exercise, and the columns could be documents in which
we try to find these key words in (as per Leskovec above), but that
scenario is a bit foggier for me because it deals with "concepts", the
number of which matches neither m nor n. So as a first step, stick
with the scenario for lab sensor/instrument. Also, consider only real
data, so the data covariance matrices are diagonalizable with
orthonormal eigenvectors corresponding to simple rotations of the data
in m-space.

http://en.wikipedia.org/wiki/Principal_component_analysis#Details
diagonalizes the data set X by factoring it into X=(W)(Sigma)(Vt)
where:

* For W, the columns of this (m)x(m) matrix are the orthonormal
eigenvectors of covariance matrix (X)(Xt) {Xt is the transpose of X}.

* Specifically, (X)(Xt) contain the covariances from pairing the m
sensors/instruments rather than from pairing the n samples of m
measurements. The former is of interest to us while for the life of
me, I can't see the relevance of the latter.

* Xt = Is the transpose of X.

* Vt is the transpose of (n)x(n) matrix V. The columns of V are the
orthonormal eigenvectors of the covariance matrix (Xt)(X) --
specifically, the covariances from pairing the n samples of m
measurements. This relevance of this matrix is what I can't see the
relevance of (intuitively).

* Sigma is the diagonal matrix of square roots of eigenvalues of (X)
(Xt), which are the same as for (Xt)(X).

I am trying to eek out some intuition from X=(W)(Sigma)(Vt). I find
it curious and interesting that the covariances (X)(Xt) are viewed as
a linear transformation, and the eigenvectors in W become the
orthogonal directions in which the scalings differ. Hence, they form
the basis vectors that are aligned with the principal components.
Then it becomes obvious that Sigma is simply the anisotropic axial
scaling.

If X is viewed as some kind of linear tranformation (and I'm not sure
if I'm actully suppose to do that), than Vt can be seen as a rotation
so that the princpal component aligns with the 1st axis, the 2nd
principal component aligns with the 2nd, etc., prior to the scaling by
Sigma. Finally, I would expect W to rotate the data back to its
original orientation, thus yielding X on the LHS.

Following Shlen's tutorial, I find the above picture is easier to see
if we rewrite the SVD formula as (Wt)(X)=(Sigma)(Vt), where the /rows/
of Wt are the eigenvectors of covariance (X)(Xt) between sensors/
instruments. Treating them as basis vectors, then multiplying them by
the columns of X simply projects the m-value samples from each
measurement instance onto the principle components, which yields the
rotation of the data points so that the principle components align
with the axes. Conversely, X=(W)[(Sigma)(Vt)] takes the data points
in the rotated state (principle components aligned with axes) and
unrotates themm so that it matches the orientation of the measured
data points.

One of the most disturbing things I haven't been able to figure out is
what V (or Vt) corresponds to in the real world. I mean, if X was a
transformation, then Vt is simply a rotation in n-space. But X
*isn't* a transformation. And n-space is meaningless because we would
never treat the vector of data from a single sensor as a data point
(i.e., each measurement instance in time as a dimension) and plot it
in n-dimensional space. So even though V or Vt somehow corresponds to
a geometric rotation of sorts, it's in an space that is nonsensical
and has no bearing in the real world.

I realize that Leskovec describes SVD differently, as documents versus
search terms, with concepts as an intermediate thing that is
determined by the SVD. The left and right singular vectors then
represent the correlation of documents versus concepts and search
terms versus concepts. However, he doesn't really delve into why the
math corresponds to that. Also, I'm much more interested in the lab
sensor/instrument scenario, where the size of the diagonal matrix
corresponds to the size of the data set (at least before dimensional
reduction).

So when I look at the mockingly simple SVD formula, I have developed a
phobia of the mysterious rotation matrix at the tail end. It has
defied my endless attempts (no joke) to try to understand
intuitively. Thank you anyone for imparting some clear intution to
this.

Date Subject Author
10/28/12 Paul
10/29/12 Ray Koopman
10/29/12 Paul
10/29/12 Ray Koopman
10/29/12 Paul
10/29/12 Art Kendall
10/29/12 Art Kendall
10/29/12 Paul
10/29/12 Art Kendall
10/29/12 Paul
10/29/12 Art Kendall
10/29/12 Paul
10/30/12 Art Kendall
11/1/12 Paul
10/29/12 Richard Ulrich
10/29/12 Paul
11/1/12 Gottfried Helms
11/1/12 Paul
11/2/12 Gottfried Helms
11/4/12 Paul
11/4/12 Gottfried Helms
11/6/12 Paul
1/4/13 Gary