|
|
Re: Response to your last
Posted:
Dec 8, 2012 6:32 PM
|
|
On Dec 7, 6:07 pm, djh <halitsk...@att.net> wrote: > [...] > > II. You wrote: > > 2. Something's wrong somewhere. Those p's are too similar to one > another, and are too large to be consistent with the other results > you've been reporting. > > No ? it?s just that the ?good? and ?great? p?s for u^2 are very length- > specific, as is shown by the following table for u^2 in regression c > on (e,u,u*e,u^2) for Len | subset = S, method = N, fold = a1, set = 1. > (Note that this table is sorted by increasing p.) > > So the question posed by the following table is the same basic > question I actually asked several posts ago, namely: for (S, N, a1, > 1), do we have ENOUGH ?good? and ?great? p?s to claim that the model c > on (e,u,u*e,u^2) ?works? in a sufficient number of cases to ?keep? it, > at least for the factor combination S, N, a1, 1) ? > > Also, please note that similar tables exist for all of the factor > combinations equivalent to (S, N, a1, 1), so is it possible we should > actually be comparing the distributions of p for u^2 from all these > different factor combinations ... to see which distributions of p are > ?left? of others and ?right? of others in the horizontal sense (i.e. > with p as the x-axis)? > > u^2 (t, df, p) Table: t, df, and p values for u^2 in regression c on > (e,u,u*e,u^2) for Len | subset=S, method= N, fold=a1, set=1 > > Len t df p > > 71 3.930 24 0.00063 > 26 3.434 44 0.00131 > 122 3.565 16 0.00258 > 24 3.162 47 0.00274 > 27 3.101 58 0.00297 > 110 3.396 16 0.00369 > 101 3.179 19 0.00494 > 35 2.870 59 0.00569 > 84 2.460 27 0.02058 > 109 2.462 25 0.02108 > 25 2.343 66 0.02216 > 73 2.185 31 0.03654 > 69 1.989 34 0.05474 > 62 1.988 24 0.05828 > 49 1.922 39 0.06193 > 55 1.929 31 0.06294 > 44 1.733 35 0.09186 > 37 1.667 68 0.10004 > 28 1.635 64 0.10697 > 54 1.639 33 0.11063 > 94 1.638 22 0.11567 > 41 1.616 32 0.11598 > 29 1.564 74 0.12219 > 30 1.546 64 0.12705 > 60 1.533 34 0.13462 > 75 1.510 20 0.14672 > 33 1.464 54 0.14893 > 66 1.451 35 0.15580 > 52 1.404 38 0.16830 > 74 1.394 25 0.17562 > 50 1.240 40 0.22236 > 32 1.216 47 0.22989 > 67 1.186 40 0.24280 > 63 1.147 28 0.26105 > 38 1.084 38 0.28513 > 53 1.065 33 0.29463 > 40 1.053 46 0.29789 > 68 1.064 19 0.30072 > 77 0.998 28 0.32687 > 58 0.989 32 0.32996 > 76 0.950 22 0.35222 > 48 0.873 38 0.38816 > 43 0.860 33 0.39616 > 80 0.807 31 0.42564 > 46 0.766 30 0.44947 > 87 0.717 17 0.48337 > 56 0.679 31 0.50249 > 45 0.677 29 0.50349 > 83 0.659 19 0.51765 > 96 0.644 23 0.52619 > 59 0.537 24 0.59645 > 61 0.490 39 0.62669 > 36 0.454 57 0.65159 > 39 0.443 30 0.66063 > 65 0.424 21 0.67621 > 120 0.390 16 0.70203 > 95 0.325 12 0.75075 > 51 0.288 45 0.77443 > 108 0.270 14 0.79079 > 31 0.234 65 0.81572 > 90 0.169 14 0.86841 > 111 0.124 18 0.90264 > 34 0.078 73 0.93820 > 47 0.065 45 0.94811 > 89 0.061 11 0.95249 > 42 0.002 31 0.99881 > > III. You wrote: > > ?In particular, you should not be considering any results from > regressing c on (u,u^2) if e matters?. > > I'm sorry to plead ignorance but nothing you've ever posted before has > prepared me to understand you here at all. What I mean by this is the > following. > > From the beginning we have been using a regression involving e, a > regression involving u, and a regression involving (e,u) IN CONCERT, > NOT as mutually exclusive alternatives. > > First we had: > > 1a) ln(c/L) on ln(c/e) > 1b) ln(c/L) on ln(c/u) > 1c) ln(c/L) on (ln(c/e), ln(c/u)) > > Then, because of your reservations about these regressions, we > simplified to > > 2a)c on e > 2b)c on u > 2c)c on (e,u) > > and that actually improved matters. > > And then finally, because of your very remarkable intuition that the > ?L/H? dichotomization of u should be replaced by adding u-related > factors to the regressions themselves, we have arrived at > > 3a) c on (e,u,u*e), by addition of a u-factor to c on e > 3b) c on (u,u^2), by addition of a u-factor to c on u > 3c) c on (e,u,u*e,u^2), by addition of two u factors to c on (e,u) > > So ... if we never intended 1(a-c) as mutually exclusive alternatives, > nor 2a-c as mutually exclusive alternatives, why all of a sudden do we > have to treat (3a-3c) as mutually exclusive alternatives? Please > recall here that the ultimate goal was always to develop predictors > for logistic regressions, and back when we were doing logistic > regressions, you said it?s best to throw everything into the soup that > one can think of ... that?s why we had logistic regression predictors > based on MORE THAN ONE linear regression. > > Also, why is NOT statistically legitimate to postulate that there are > BOTH: > > a) a relationship between c and u that, as you suspected, is best > expressed by c on (u,u^2) because the relationship changes with > increasing u > > b) a relationship between c and e that, again as you expected, is best > expressed by c on (e,u,u*e) because again, the relationship changes > with increasing u. > > [...]
Let me focus initially on 2a-c: the regressions of c on e, on u, and on (e,u). There are two problems. First, c is a count, with no measurement error, but both e and u contain measurement error. The usual regression model, that we have been using all along, assumes the opposite: that the predictors are known exactly, and that only the d.v. contains measurement error. (I mentioned this in a post on Oct 25 @ 12:54 pm.) However, I have been (and still am) willing to ignore this problem because I believe the measurement errors are probably negligible compared to random sampling error.
The other problem is something that I thought I had mentioned before, but apparently I never got beyond thinking about it. If you wanted the results of 2a-c for purely descriptive purposes, or to use as input for other computations, then I would see nothing wrong with doing all three. The problem comes when you ask for p-values. Then you need to specify a probability model, and the models for 2a-c are mutually exclusive (except in special cases, such as when at least one of the regression coefficients in 2c is zero).
We have been using the "conditional regression" model: for 2c, it says that for every (e,u) pair in the domain of interest, c|(e,u) = a0 + a1*e + a2*u + error, where the errors are independent identically-distributed zero-mean normal random variables. There are no distributional assumptions about (e,u}; their values are taken to be given, arbitrary. If this model holds then neither 2a nor 2b can hold, and so we can not get p-values for their coefficients.
One way to legitimize p-values for 2a-c would be to switch to a completely random model, in which the sample triples (c,e,u) are assumed to come from a trivariate normal distribution. (The trivariate normal model is equivalent to augmenting the conditional regression model with the assumption that the sample pairs (e,u) come from a bivariate normal distribution.) However, that would rule out 3a-c, because all the regressions in any multivariate normal distribution are purely linear; there are no product terms or squared terms.
A plot of the ordered p's from point II against their ranks is sufficiently different (by the IOT test) from plots of ordered random Uniform[0,1] variables against their ranks to allow the conclusion that the coefficient of u^2 is generally nonzero when subset=S, method=N, fold=a1, set=1. Accordingly, I see no defensible way to attach p-values to coefficients in models that omit u^2 in that cell.
|
|