Drexel dragonThe Math ForumDonate to the Math Forum



Search All of the Math Forum:

Views expressed in these public forums are not endorsed by Drexel University or The Math Forum.


Math Forum » Discussions » Software » comp.soft-sys.matlab

Topic: How to determine which variables from the dataset are important and which can be discarded using PCA.
Replies: 4   Last Post: Jan 7, 2013 11:40 PM

Advanced Search

Back to Topic List Back to Topic List Jump to Tree View Jump to Tree View   Messages: [ Previous | Next ]
Greg Heath

Posts: 5,919
Registered: 12/7/04
Re: How to determine which variables from the dataset are important and which can be discarded using PCA.
Posted: Jan 5, 2013 4:05 AM
  Click to see the message monospaced in plain text Plain Text   Click to reply to this topic Reply

"Maureen " <maureen_510@hotmail.com> wrote in message <kc87ar$il3$1@newscl01ah.mathworks.com>...
> Hi,
> I have 28 variables and by performing
> [COEFF,SCORE,latent,tsquare] = princomp(X)
> I understand that for coeff, the columns are in order of decreasing component variance. So how do I know which column represent the respective 28 variables, since they are now re-ordered?
> I performed
>
> cumsum(latent)./sum(latent)
>
> and it shows that the first 9 variables are probably more important and so I would like to remove those variables that are least important to avoid overfit issue. But how do I remove the least important variables when I do not know what they are?
> I am pretty new in this and would appreciate any help here. Thanks in advance!


What exactly, is your threshold for keeping principal components? cumsum/sum < 0.95,
0.99, or ...?

If you have a regression problem, this shows that 9 principal components are
sufficient. HOWEVER, IT DOES NOT NECESSARILY MEAN THAT 9 ORIGINAL VARIABLES
ARE EQUIVALENT.

In general, you transform from original to principal and then use the 9 dominant principal components, not 9 original variables. The overwhelming advantage of using the PCs is that they are orthogonal.

However, if you want to use the nonorthogonal original values, it seems to me that the
most reasonable approach is to choose the 9 original variables that have the largest projection into the 9-dimensional dominant PC subspace and are linearly independent.

If you have a classification problem use PLS instead of PCA.

help plsregress
doc plsregress

Hope this helps.

Greg



Point your RSS reader here for a feed of the latest messages in this topic.

[Privacy Policy] [Terms of Use]

© Drexel University 1994-2014. All Rights Reserved.
The Math Forum is a research and educational enterprise of the Drexel University School of Education.