Search All of the Math Forum:
Views expressed in these public forums are not endorsed by
NCTM or The Math Forum.



Re: How to determine which variables from the dataset are important and which can be discarded using PCA.
Posted:
Jan 5, 2013 4:05 AM


"Maureen " <maureen_510@hotmail.com> wrote in message <kc87ar$il3$1@newscl01ah.mathworks.com>... > Hi, > I have 28 variables and by performing > [COEFF,SCORE,latent,tsquare] = princomp(X) > I understand that for coeff, the columns are in order of decreasing component variance. So how do I know which column represent the respective 28 variables, since they are now reordered? > I performed > > cumsum(latent)./sum(latent) > > and it shows that the first 9 variables are probably more important and so I would like to remove those variables that are least important to avoid overfit issue. But how do I remove the least important variables when I do not know what they are? > I am pretty new in this and would appreciate any help here. Thanks in advance!
What exactly, is your threshold for keeping principal components? cumsum/sum < 0.95, 0.99, or ...?
If you have a regression problem, this shows that 9 principal components are sufficient. HOWEVER, IT DOES NOT NECESSARILY MEAN THAT 9 ORIGINAL VARIABLES ARE EQUIVALENT.
In general, you transform from original to principal and then use the 9 dominant principal components, not 9 original variables. The overwhelming advantage of using the PCs is that they are orthogonal.
However, if you want to use the nonorthogonal original values, it seems to me that the most reasonable approach is to choose the 9 original variables that have the largest projection into the 9dimensional dominant PC subspace and are linearly independent.
If you have a classification problem use PLS instead of PCA.
help plsregress doc plsregress
Hope this helps.
Greg



