Search All of the Math Forum:
Views expressed in these public forums are not endorsed by
NCTM or The Math Forum.



Re: How to determine which variables from the dataset are important and which can be discarded using PCA.
Posted:
Jan 7, 2013 11:40 PM


"Greg Heath" <heath@alumni.brown.edu> wrote in message <kcg770$4g6$1@newscl01ah.mathworks.com>... > "Maureen " <maureen_510@hotmail.com> wrote in message <kcdam8$5s9$1@newscl01ah.mathworks.com>... > > "Greg Heath" <heath@alumni.brown.edu> wrote in message <kc8qc2$kd9$1@newscl01ah.mathworks.com>... > > > "Maureen " <maureen_510@hotmail.com> wrote in message <kc87ar$il3$1@newscl01ah.mathworks.com>... > > > > Hi, > > > > I have 28 variables and by performing > > > > [COEFF,SCORE,latent,tsquare] = princomp(X) > > > > I understand that for coeff, the columns are in order of decreasing component variance. So how do I know which column represent the respective 28 variables, since they are now reordered? > > > > I performed > > > > > > > > cumsum(latent)./sum(latent) > > > > > > > > and it shows that the first 9 variables are probably more important and so I would like to remove those variables that are least important to avoid overfit issue. But how do I remove the least important variables when I do not know what they are? > > > > I am pretty new in this and would appreciate any help here. Thanks in advance! > > > > > > What exactly, is your threshold for keeping principal components? cumsum/sum < 0.95, > > > 0.99, or ...? > > > > > > If you have a regression problem, this shows that 9 principal components are > > > sufficient. HOWEVER, IT DOES NOT NECESSARILY MEAN THAT 9 ORIGINAL VARIABLES > > > ARE EQUIVALENT. > > > > > > In general, you transform from original to principal and then use the 9 dominant principal components, not 9 original variables. The overwhelming advantage of using the PCs is that they are orthogonal. > > > > > > However, if you want to use the nonorthogonal original values, it seems to me that the > > > most reasonable approach is to choose the 9 original variables that have the largest projection into the 9dimensional dominant PC subspace and are linearly independent. > > > > > > If you have a classification problem use PLS instead of PCA. > > > > > > help plsregress > > > doc plsregress > > > > > > Hope this helps. > > > > > > Greg > > > > Hi Greg, > > > > Thank you so much for the clarification! > > > > But I am a little confused. How does orthogonal/ nonorthogonal original values affect my result or I should say under what circumstances should I decide on using an orthogonal or nonorthogonal value? > > Your original question was how to use PCA to determine the relative importance of original, not necessarily nonorthogonal, variables. The phrase "use PCA" led to the answer I gave. > > The first thing to realize is that PCA does not take into account dependent variables or > your ultimate goal. PCA lets you model the input space with an ordered orthogonal space that will maximize the spread of the input data for each ordered subspace. > > That does not mean that choosing an ordered PCA subspace is the best subspace to use w.r.t. your goal that depends on a set of dependent variables. > > Case in point is the classification of two parallel cigar shaped distributions in 2space. The > least important PCA variable is the the one perpendicular to the long axes of the cigars. However, if that variable is ignored, the remaining variable cannot provide any information that will help determine to which distribution an arbitrary point belongs. In this case, the least important PCA variable is the one with the largest variance. > > If you change your original question to "How to determine the relative importance of original, not necessarily nonorthogonal, variables". PCA may not be the best solution. > For linear models, STEPWISEFIT may be the way to go. For more general models it maybe SEQUENTIALFS . You can read their documentation using the commands help and doc. > > > My intention is to plot out the scores in a 2 dimensional plot based on the 28 variables I have. But I am worried about the overfit issue hence I intend to reduce the number of variables but at the same time ensuring that the plot is fairly spread. So I was thinking of removing variables that does not contribute as much to the plot. > > > You haven't defined your goal. What type of problem are you trying to solve. Regression? Classification? > > What quantity are you trying to minimize? MSE? PCTerr (Per Cent Classification Error)? > > Do you want to explain the results in terms of the original variables or would a set of transformed variables be sufficient? > > Are you restricted to linear models or can nonlinear models be considered? > > > Saying this, another question popped into my head. So what if I presume variable A having a significant importance to my result but it appears to have a small projection, which means it does not contribute much to my plot. Does it mean that it can be discarded since it does not very much separate the scores on the plot? > > > > Is my concept correct and am I missing out anything here? Appreciate any help. Thanks in advance! > > Hard to answer untill I know what, exactly, your problem is and what constraints are there > on your method of solution. What is the size of your input data matrix [ I N ] = [ 28 N ]; N = ? What is the size of your output data matrix [ O N ]; O = ?
Hope this helps. Greg



