On Monday, September 9, 2013 4:32:48 PM UTC-4, Rich Ulrich wrote: > On Sat, 7 Sep 2013 09:52:19 -0700 (PDT), Greg Heath > <firstname.lastname@example.org> wrote: > > >http://en.wikipedia.org/wiki/Adjusted_R-squared > > >I have been using the estimation degrees of freedom expression > > >Ndof = Ntrneq - Nw > > >for mitigating the bias in the MSE estimate of nonlinear neural network > >regression models when the training data is used for the estimate. > > >Ntrn - Number of input/target training example vector pairs > >O - Dimensionality of the target/output vectors > >Ntrneq - Number of scalar training equations: Ntrneq = Ntrn*O > >Nw - Number of unknown weights that have to be estimated > >SSEtrn - Sum-squared-error > >MSEtrn - Biased mean-squared-error estimate: MSEtrn = SSEtrn/Ntrneq > >MSEtrna - Adjusted mean-squared-error estimate: MSEtrna = SSEtrn/Ndof
For the Naive Constant Output Model, the mimimum MSE is just the mean target variance because the minimizing output is the mean of the target values.
> >My question is: Is there a similar adjustment for classifiers? > > >PctErr = 100*Ntrnerr/Ntrneq > >PctErra = 100*Ntrnerr/Ndof ????
Also, is there a similar adjustment for estimating the rates of each class?
> I have no idea what people do with neural networks, but I've never seen any > version of correcting for d.f. for the 2x2 classification table in discriminant > function. In fact, for discriminant function, that table gets rather little > attention because its interpretation is so confounded by extreme/ not > extreme marginal frequencies. And anything bigger than a 2x2 table gets > worse.
The difference from the discriminant approach is that neural network nonlinear classifiers use the same algorithms that are used for nonlinear regression. The only difference is that the target matrix for a mutually exclusive c-class classifier consists of columns from the c-dimensional unit matrix. The index of the target row containing the "1", is the class index of the corresponding input vector.
NOTE: MATLAB NN data matrices are transposed from those commonly used.
The output of the trained model is interpreted as a consistent estimator of the input-conditional posterior probability vector. Accordingly, the input is assigned to the class corresponding to the row index of the maximum component in the output vector. The results can then be summarized in a c+1 x c+1 Count Confusion Matrix and a corresponding PerCent Confusion Matrix. The extra row and columns contain rowwise and columnwise sums, respectively.
> Here's my thought, though, about what I consider a starting point > -- Translate the "R-squared" and "adjusted R-square" from regression to > 2x2 tables, and see how much difference those tables have in prediction.
I don't have the slightest idea how to do this. I have never found a reliable transformation from mean-square-error of posterior probability estimates to classification error rates.
> Look at the fraction of loss in above-chance prediction, when you move > from the unadjusted to the adjusted R^2. That's using the so-called > Adjusted R-squared, and not what you point to as the adjusted MSE. > The former depends both on number of predictors and on sample size.
So does the latter. See the above linear relationship formula.
> If you have the same number of predictors to start with as there are cases, > the *conservative* approach says that a step-wise approach should use > the whole count for reducing the d.f., so you are left with 0 d.f.
I don't see the point of that statement.
Anyway, in that case wouldn't the minimum achievable SSE be zero (even with noise and measurement error)?
> There is a literature in "stepwise" that offers some Monte Carlo estimates > that are more generous. That approach might tell you something about > the case where the Case/NonCase fraction is near 50%. > > I don't know how readily it applies to extreme divisions.
You lost me.
The reason for my request is that, for multivariate nonlinear regression I try to minimize the number of necessary neural network weights by looking at the performance of multiple designs on training data. Plots of R2trn, R2trna, and R2val and R2tst (non-training validation and test data) indicate that the comparison of R2trn and R2trna is usually sufficient for choosing a good model. Therefore I can use all of the data for training (i.e., no validation and test subsets) and get more reliable weight estimates.
However, I have found that posterior probability R^2 is not a good measure for choosing the best of multiple classifier designs, even if the R^2 is from non-training data! I am then left with estimating PctErr of BOTH training and non-training data.
Obviously, I would love to have some training bias mitigation for PctErr so that I can use all of my data for training.