PLEASE DO NOT TOP-POST: 'IT IS CONSIDERED A HEINOUS BREACH OF GOOGLE GROUP ETIQUETTE TO POST REPLIES ABOVE A PREVIOUS POST.'
"Maureen " <email@example.com> wrote in message <firstname.lastname@example.org>... > Initially, I am interested in dimension reduction as I wanted to reduce the plot down from 27 to either a 2 or 3 dimensional plot, that was why I decided to use PCA.
PCA ranks variables according to spread. However, if you are not using techniques that depend on spread ranking, you cannot expect the display to provide any more information than the practical dimensionality of the data and the corresponding linearly dependent combinations of variables (negligibly small singular values).
If you are more interested in correlations among variables, then standardize the original variables to have zero-mean and unit-variance. The resulting covariance matrix is then the correlation coefficient matrix. In addition to providing the correlation information, projections onto the new PC planes may yield useful info.
> I also understand that in PCA the orthogonal transformed input gives the most spread and I thought it could be helpful in visualizing my data with maximum spread on the input variables. I am not doing any form of classification, just to clarify, I do not have classes >in which I hope my data will sit into.
Then use unsupervised clustering. It can tell you if classes of data appear to be present.
> But after plotting, I realised some overfit issue and I thought maybe I used too many >input variables.
You don't have to guess
help cond doc cond help rank doc rank
>Thus, I decided to remove some of the variables, but I do not know which constitute more and which are less significant in which I can remove. I tried by removing the variables that produce smaller projection on the plot and the result did not seem to >improve, instead worsen.
What result? If you can quantify a result goal or bound then maybe we can help.
>Hence, I thought maybe I should find out which variables contribute most to the first 2 >PCs for a 2D plot. > > Is my line of thoughts right?
I don't know what you are looking for. If you want to rank variables according to spread, just sort var(X) where 27 = size(X,2) or use the diagonal of the covariance matrix.
>So if that is right, will STEPWISEFIT as mentioned earlier in the discussion, help in >finding the variable importance? Or would some other method be more effective?
STEPWISEFIT is for linear regression. If you have no specified output variables or classes, it is of no use.
It might help if you
1. Sort the variables w.r.t. variance 2. Explain what each variable does in real life 3. Posted the 27 sorted variances and resulting correlation coeffiient matrix in a form suitable for cutting and pasting.