Date: Dec 8, 2012 6:32 PM
Author: Ray Koopman
Subject: Re: Response to your last

On Dec 7, 6:07 pm, djh <halitsk...@att.net> wrote:
> [...]
>
> II. You wrote:
>
> 2. Something's wrong somewhere. Those p's are too similar to one
> another, and are too large to be consistent with the other results
> you've been reporting.
>
> No ? it?s just that the ?good? and ?great? p?s for u^2 are very length-
> specific, as is shown by the following table for u^2 in regression c
> on (e,u,u*e,u^2) for Len | subset = S, method = N, fold = a1, set = 1.
> (Note that this table is sorted by increasing p.)
>
> So the question posed by the following table is the same basic
> question I actually asked several posts ago, namely: for (S, N, a1,
> 1), do we have ENOUGH ?good? and ?great? p?s to claim that the model c
> on (e,u,u*e,u^2) ?works? in a sufficient number of cases to ?keep? it,
> at least for the factor combination S, N, a1, 1) ?
>
> Also, please note that similar tables exist for all of the factor
> combinations equivalent to (S, N, a1, 1), so is it possible we should
> actually be comparing the distributions of p for u^2 from all these
> different factor combinations ... to see which distributions of p are
> ?left? of others and ?right? of others in the horizontal sense (i.e.
> with p as the x-axis)?
>
> u^2 (t, df, p) Table: t, df, and p values for u^2 in regression c on
> (e,u,u*e,u^2) for Len | subset=S, method= N, fold=a1, set=1
>
> Len t df p
>
> 71 3.930 24 0.00063
> 26 3.434 44 0.00131
> 122 3.565 16 0.00258
> 24 3.162 47 0.00274
> 27 3.101 58 0.00297
> 110 3.396 16 0.00369
> 101 3.179 19 0.00494
> 35 2.870 59 0.00569
> 84 2.460 27 0.02058
> 109 2.462 25 0.02108
> 25 2.343 66 0.02216
> 73 2.185 31 0.03654
> 69 1.989 34 0.05474
> 62 1.988 24 0.05828
> 49 1.922 39 0.06193
> 55 1.929 31 0.06294
> 44 1.733 35 0.09186
> 37 1.667 68 0.10004
> 28 1.635 64 0.10697
> 54 1.639 33 0.11063
> 94 1.638 22 0.11567
> 41 1.616 32 0.11598
> 29 1.564 74 0.12219
> 30 1.546 64 0.12705
> 60 1.533 34 0.13462
> 75 1.510 20 0.14672
> 33 1.464 54 0.14893
> 66 1.451 35 0.15580
> 52 1.404 38 0.16830
> 74 1.394 25 0.17562
> 50 1.240 40 0.22236
> 32 1.216 47 0.22989
> 67 1.186 40 0.24280
> 63 1.147 28 0.26105
> 38 1.084 38 0.28513
> 53 1.065 33 0.29463
> 40 1.053 46 0.29789
> 68 1.064 19 0.30072
> 77 0.998 28 0.32687
> 58 0.989 32 0.32996
> 76 0.950 22 0.35222
> 48 0.873 38 0.38816
> 43 0.860 33 0.39616
> 80 0.807 31 0.42564
> 46 0.766 30 0.44947
> 87 0.717 17 0.48337
> 56 0.679 31 0.50249
> 45 0.677 29 0.50349
> 83 0.659 19 0.51765
> 96 0.644 23 0.52619
> 59 0.537 24 0.59645
> 61 0.490 39 0.62669
> 36 0.454 57 0.65159
> 39 0.443 30 0.66063
> 65 0.424 21 0.67621
> 120 0.390 16 0.70203
> 95 0.325 12 0.75075
> 51 0.288 45 0.77443
> 108 0.270 14 0.79079
> 31 0.234 65 0.81572
> 90 0.169 14 0.86841
> 111 0.124 18 0.90264
> 34 0.078 73 0.93820
> 47 0.065 45 0.94811
> 89 0.061 11 0.95249
> 42 0.002 31 0.99881
>
> III. You wrote:
>
> ?In particular, you should not be considering any results from
> regressing c on (u,u^2) if e matters?.
>
> I'm sorry to plead ignorance but nothing you've ever posted before has
> prepared me to understand you here at all. What I mean by this is the
> following.
>
> From the beginning we have been using a regression involving e, a
> regression involving u, and a regression involving (e,u) IN CONCERT,
> NOT as mutually exclusive alternatives.
>
> First we had:
>
> 1a) ln(c/L) on ln(c/e)
> 1b) ln(c/L) on ln(c/u)
> 1c) ln(c/L) on (ln(c/e), ln(c/u))
>
> Then, because of your reservations about these regressions, we
> simplified to
>
> 2a)c on e
> 2b)c on u
> 2c)c on (e,u)
>
> and that actually improved matters.
>
> And then finally, because of your very remarkable intuition that the
> ?L/H? dichotomization of u should be replaced by adding u-related
> factors to the regressions themselves, we have arrived at
>
> 3a) c on (e,u,u*e), by addition of a u-factor to c on e
> 3b) c on (u,u^2), by addition of a u-factor to c on u
> 3c) c on (e,u,u*e,u^2), by addition of two u factors to c on (e,u)
>
> So ... if we never intended 1(a-c) as mutually exclusive alternatives,
> nor 2a-c as mutually exclusive alternatives, why all of a sudden do we
> have to treat (3a-3c) as mutually exclusive alternatives? Please
> recall here that the ultimate goal was always to develop predictors
> for logistic regressions, and back when we were doing logistic
> regressions, you said it?s best to throw everything into the soup that
> one can think of ... that?s why we had logistic regression predictors
> based on MORE THAN ONE linear regression.
>
> Also, why is NOT statistically legitimate to postulate that there are
> BOTH:
>
> a) a relationship between c and u that, as you suspected, is best
> expressed by c on (u,u^2) because the relationship changes with
> increasing u
>
> b) a relationship between c and e that, again as you expected, is best
> expressed by c on (e,u,u*e) because again, the relationship changes
> with increasing u.
>
> [...]


Let me focus initially on 2a-c: the regressions of c on e, on u,
and on (e,u). There are two problems. First, c is a count, with no
measurement error, but both e and u contain measurement error. The
usual regression model, that we have been using all along, assumes
the opposite: that the predictors are known exactly, and that only
the d.v. contains measurement error. (I mentioned this in a post on
Oct 25 @ 12:54 pm.) However, I have been (and still am) willing to
ignore this problem because I believe the measurement errors are
probably negligible compared to random sampling error.

The other problem is something that I thought I had mentioned before,
but apparently I never got beyond thinking about it. If you wanted
the results of 2a-c for purely descriptive purposes, or to use as
input for other computations, then I would see nothing wrong with
doing all three. The problem comes when you ask for p-values. Then
you need to specify a probability model, and the models for 2a-c are
mutually exclusive (except in special cases, such as when at least
one of the regression coefficients in 2c is zero).

We have been using the "conditional regression" model: for 2c,
it says that for every (e,u) pair in the domain of interest,
c|(e,u) = a0 + a1*e + a2*u + error, where the errors are independent
identically-distributed zero-mean normal random variables. There are
no distributional assumptions about (e,u}; their values are taken to
be given, arbitrary. If this model holds then neither 2a nor 2b can
hold, and so we can not get p-values for their coefficients.

One way to legitimize p-values for 2a-c would be to switch to a
completely random model, in which the sample triples (c,e,u) are
assumed to come from a trivariate normal distribution. (The trivariate
normal model is equivalent to augmenting the conditional regression
model with the assumption that the sample pairs (e,u) come from a
bivariate normal distribution.) However, that would rule out 3a-c,
because all the regressions in any multivariate normal distribution
are purely linear; there are no product terms or squared terms.

A plot of the ordered p's from point II against their ranks is
sufficiently different (by the IOT test) from plots of ordered random
Uniform[0,1] variables against their ranks to allow the conclusion
that the coefficient of u^2 is generally nonzero when subset=S,
method=N, fold=a1, set=1. Accordingly, I see no defensible way to
attach p-values to coefficients in models that omit u^2 in that cell.