
Re: Response to your last
Posted:
Dec 8, 2012 6:32 PM


On Dec 7, 6:07 pm, djh <halitsk...@att.net> wrote: > [...] > > II. You wrote: > > 2. Something's wrong somewhere. Those p's are too similar to one > another, and are too large to be consistent with the other results > you've been reporting. > > No ? it?s just that the ?good? and ?great? p?s for u^2 are very length > specific, as is shown by the following table for u^2 in regression c > on (e,u,u*e,u^2) for Len  subset = S, method = N, fold = a1, set = 1. > (Note that this table is sorted by increasing p.) > > So the question posed by the following table is the same basic > question I actually asked several posts ago, namely: for (S, N, a1, > 1), do we have ENOUGH ?good? and ?great? p?s to claim that the model c > on (e,u,u*e,u^2) ?works? in a sufficient number of cases to ?keep? it, > at least for the factor combination S, N, a1, 1) ? > > Also, please note that similar tables exist for all of the factor > combinations equivalent to (S, N, a1, 1), so is it possible we should > actually be comparing the distributions of p for u^2 from all these > different factor combinations ... to see which distributions of p are > ?left? of others and ?right? of others in the horizontal sense (i.e. > with p as the xaxis)? > > u^2 (t, df, p) Table: t, df, and p values for u^2 in regression c on > (e,u,u*e,u^2) for Len  subset=S, method= N, fold=a1, set=1 > > Len t df p > > 71 3.930 24 0.00063 > 26 3.434 44 0.00131 > 122 3.565 16 0.00258 > 24 3.162 47 0.00274 > 27 3.101 58 0.00297 > 110 3.396 16 0.00369 > 101 3.179 19 0.00494 > 35 2.870 59 0.00569 > 84 2.460 27 0.02058 > 109 2.462 25 0.02108 > 25 2.343 66 0.02216 > 73 2.185 31 0.03654 > 69 1.989 34 0.05474 > 62 1.988 24 0.05828 > 49 1.922 39 0.06193 > 55 1.929 31 0.06294 > 44 1.733 35 0.09186 > 37 1.667 68 0.10004 > 28 1.635 64 0.10697 > 54 1.639 33 0.11063 > 94 1.638 22 0.11567 > 41 1.616 32 0.11598 > 29 1.564 74 0.12219 > 30 1.546 64 0.12705 > 60 1.533 34 0.13462 > 75 1.510 20 0.14672 > 33 1.464 54 0.14893 > 66 1.451 35 0.15580 > 52 1.404 38 0.16830 > 74 1.394 25 0.17562 > 50 1.240 40 0.22236 > 32 1.216 47 0.22989 > 67 1.186 40 0.24280 > 63 1.147 28 0.26105 > 38 1.084 38 0.28513 > 53 1.065 33 0.29463 > 40 1.053 46 0.29789 > 68 1.064 19 0.30072 > 77 0.998 28 0.32687 > 58 0.989 32 0.32996 > 76 0.950 22 0.35222 > 48 0.873 38 0.38816 > 43 0.860 33 0.39616 > 80 0.807 31 0.42564 > 46 0.766 30 0.44947 > 87 0.717 17 0.48337 > 56 0.679 31 0.50249 > 45 0.677 29 0.50349 > 83 0.659 19 0.51765 > 96 0.644 23 0.52619 > 59 0.537 24 0.59645 > 61 0.490 39 0.62669 > 36 0.454 57 0.65159 > 39 0.443 30 0.66063 > 65 0.424 21 0.67621 > 120 0.390 16 0.70203 > 95 0.325 12 0.75075 > 51 0.288 45 0.77443 > 108 0.270 14 0.79079 > 31 0.234 65 0.81572 > 90 0.169 14 0.86841 > 111 0.124 18 0.90264 > 34 0.078 73 0.93820 > 47 0.065 45 0.94811 > 89 0.061 11 0.95249 > 42 0.002 31 0.99881 > > III. You wrote: > > ?In particular, you should not be considering any results from > regressing c on (u,u^2) if e matters?. > > I'm sorry to plead ignorance but nothing you've ever posted before has > prepared me to understand you here at all. What I mean by this is the > following. > > From the beginning we have been using a regression involving e, a > regression involving u, and a regression involving (e,u) IN CONCERT, > NOT as mutually exclusive alternatives. > > First we had: > > 1a) ln(c/L) on ln(c/e) > 1b) ln(c/L) on ln(c/u) > 1c) ln(c/L) on (ln(c/e), ln(c/u)) > > Then, because of your reservations about these regressions, we > simplified to > > 2a)c on e > 2b)c on u > 2c)c on (e,u) > > and that actually improved matters. > > And then finally, because of your very remarkable intuition that the > ?L/H? dichotomization of u should be replaced by adding urelated > factors to the regressions themselves, we have arrived at > > 3a) c on (e,u,u*e), by addition of a ufactor to c on e > 3b) c on (u,u^2), by addition of a ufactor to c on u > 3c) c on (e,u,u*e,u^2), by addition of two u factors to c on (e,u) > > So ... if we never intended 1(ac) as mutually exclusive alternatives, > nor 2ac as mutually exclusive alternatives, why all of a sudden do we > have to treat (3a3c) as mutually exclusive alternatives? Please > recall here that the ultimate goal was always to develop predictors > for logistic regressions, and back when we were doing logistic > regressions, you said it?s best to throw everything into the soup that > one can think of ... that?s why we had logistic regression predictors > based on MORE THAN ONE linear regression. > > Also, why is NOT statistically legitimate to postulate that there are > BOTH: > > a) a relationship between c and u that, as you suspected, is best > expressed by c on (u,u^2) because the relationship changes with > increasing u > > b) a relationship between c and e that, again as you expected, is best > expressed by c on (e,u,u*e) because again, the relationship changes > with increasing u. > > [...]
Let me focus initially on 2ac: the regressions of c on e, on u, and on (e,u). There are two problems. First, c is a count, with no measurement error, but both e and u contain measurement error. The usual regression model, that we have been using all along, assumes the opposite: that the predictors are known exactly, and that only the d.v. contains measurement error. (I mentioned this in a post on Oct 25 @ 12:54 pm.) However, I have been (and still am) willing to ignore this problem because I believe the measurement errors are probably negligible compared to random sampling error.
The other problem is something that I thought I had mentioned before, but apparently I never got beyond thinking about it. If you wanted the results of 2ac for purely descriptive purposes, or to use as input for other computations, then I would see nothing wrong with doing all three. The problem comes when you ask for pvalues. Then you need to specify a probability model, and the models for 2ac are mutually exclusive (except in special cases, such as when at least one of the regression coefficients in 2c is zero).
We have been using the "conditional regression" model: for 2c, it says that for every (e,u) pair in the domain of interest, c(e,u) = a0 + a1*e + a2*u + error, where the errors are independent identicallydistributed zeromean normal random variables. There are no distributional assumptions about (e,u}; their values are taken to be given, arbitrary. If this model holds then neither 2a nor 2b can hold, and so we can not get pvalues for their coefficients.
One way to legitimize pvalues for 2ac would be to switch to a completely random model, in which the sample triples (c,e,u) are assumed to come from a trivariate normal distribution. (The trivariate normal model is equivalent to augmenting the conditional regression model with the assumption that the sample pairs (e,u) come from a bivariate normal distribution.) However, that would rule out 3ac, because all the regressions in any multivariate normal distribution are purely linear; there are no product terms or squared terms.
A plot of the ordered p's from point II against their ranks is sufficiently different (by the IOT test) from plots of ordered random Uniform[0,1] variables against their ranks to allow the conclusion that the coefficient of u^2 is generally nonzero when subset=S, method=N, fold=a1, set=1. Accordingly, I see no defensible way to attach pvalues to coefficients in models that omit u^2 in that cell.

