On Wed, 30 Jan 2013 23:27:30 -0800 (PST), Darek <email@example.com> wrote:
>Hi all! > >I would like to ask about R^2 in linearized regression where Y value >is transformed e.g.: >http://en.wikipedia.org/wiki/Nonlinear_regression#Linearization >If we apply power function (Y=a*b^X) for regression in Excel or SPSS >the R^2 (sum of squares etc.) is calculated using linearized function >i.e.: ln(Y)=a+ln(X)
I'm not sure about your starting point here.
Yes, if you "linearize" an equation, you do get a different Sum of squares, etc. But SPSS allows nonlinear regression which does not require or use linearization (and I imagine that there are various packages with various provisions, for Excel).
> >I think that comparison of R^2 for the same dataset for various >regression functions (e.g.between linear and power function) where Y >is transformed is not proper method of selection of best regression >model.
Right. What regression minimizes is the sum of squares of residuals, and you cannot merely compare those sums when they are measured by different metrics, raw versus log versus [whatever].
What metric is used for those residuals? What metric do you *want* for those residuals? - the "best model" is the one that provides the "smallest residuals" in whatever is the "most sensible" metric. Differences measured in log-units will not match differences measured by raw units.
- Now, Tukey describes using power transformations in the precise form that incorporates constants (from the derivatives of the transformations) so that the SS remains approximately constant. I think that that was in his book on regression. If I recall properly, the SS-residuals is better preserved than the overall SS, so that regressions in different metrics can be approximately compared by the size of their residual SS, and not the R^2. I always considered that to be "of academic interest" because I was happier looking at the scatterplots, and judging which one has the characteristic of "equal interval" in the measurements, so that errors across the range of the scale seem to have the same "clinical" meaning. For my data, that was (almost) always the plot where the two variables were closer to the Normal distributions.
>I think that in the case described above if we would like to compare >various functions of regression, R^2 should be calculated using >function Y=a*b^X not function after linearization ln(Y)=a+ln(X). > >Could you give your opinion on this matter?
Which residuals do you like better? What is the natural metric for the variable? If you easily talk about, "This score is twice the size of that one," then your language suggests that the log metric is the natural one. That's true for a whole lot of biological variables. And a lot of others.
My opinion is - You can't take any scores out of context and say that the raw values deserve to be raw, or deserve to be logged. It depends on what sort of relation between the two variables is expected to be linear, with homogeneous (equal variance across the range) errors.