Date: Dec 8, 2012 6:32 PM Author: Ray Koopman Subject: Re: Response to your last On Dec 7, 6:07 pm, djh <halitsk...@att.net> wrote:

> [...]

>

> II. You wrote:

>

> 2. Something's wrong somewhere. Those p's are too similar to one

> another, and are too large to be consistent with the other results

> you've been reporting.

>

> No ? it?s just that the ?good? and ?great? p?s for u^2 are very length-

> specific, as is shown by the following table for u^2 in regression c

> on (e,u,u*e,u^2) for Len | subset = S, method = N, fold = a1, set = 1.

> (Note that this table is sorted by increasing p.)

>

> So the question posed by the following table is the same basic

> question I actually asked several posts ago, namely: for (S, N, a1,

> 1), do we have ENOUGH ?good? and ?great? p?s to claim that the model c

> on (e,u,u*e,u^2) ?works? in a sufficient number of cases to ?keep? it,

> at least for the factor combination S, N, a1, 1) ?

>

> Also, please note that similar tables exist for all of the factor

> combinations equivalent to (S, N, a1, 1), so is it possible we should

> actually be comparing the distributions of p for u^2 from all these

> different factor combinations ... to see which distributions of p are

> ?left? of others and ?right? of others in the horizontal sense (i.e.

> with p as the x-axis)?

>

> u^2 (t, df, p) Table: t, df, and p values for u^2 in regression c on

> (e,u,u*e,u^2) for Len | subset=S, method= N, fold=a1, set=1

>

> Len t df p

>

> 71 3.930 24 0.00063

> 26 3.434 44 0.00131

> 122 3.565 16 0.00258

> 24 3.162 47 0.00274

> 27 3.101 58 0.00297

> 110 3.396 16 0.00369

> 101 3.179 19 0.00494

> 35 2.870 59 0.00569

> 84 2.460 27 0.02058

> 109 2.462 25 0.02108

> 25 2.343 66 0.02216

> 73 2.185 31 0.03654

> 69 1.989 34 0.05474

> 62 1.988 24 0.05828

> 49 1.922 39 0.06193

> 55 1.929 31 0.06294

> 44 1.733 35 0.09186

> 37 1.667 68 0.10004

> 28 1.635 64 0.10697

> 54 1.639 33 0.11063

> 94 1.638 22 0.11567

> 41 1.616 32 0.11598

> 29 1.564 74 0.12219

> 30 1.546 64 0.12705

> 60 1.533 34 0.13462

> 75 1.510 20 0.14672

> 33 1.464 54 0.14893

> 66 1.451 35 0.15580

> 52 1.404 38 0.16830

> 74 1.394 25 0.17562

> 50 1.240 40 0.22236

> 32 1.216 47 0.22989

> 67 1.186 40 0.24280

> 63 1.147 28 0.26105

> 38 1.084 38 0.28513

> 53 1.065 33 0.29463

> 40 1.053 46 0.29789

> 68 1.064 19 0.30072

> 77 0.998 28 0.32687

> 58 0.989 32 0.32996

> 76 0.950 22 0.35222

> 48 0.873 38 0.38816

> 43 0.860 33 0.39616

> 80 0.807 31 0.42564

> 46 0.766 30 0.44947

> 87 0.717 17 0.48337

> 56 0.679 31 0.50249

> 45 0.677 29 0.50349

> 83 0.659 19 0.51765

> 96 0.644 23 0.52619

> 59 0.537 24 0.59645

> 61 0.490 39 0.62669

> 36 0.454 57 0.65159

> 39 0.443 30 0.66063

> 65 0.424 21 0.67621

> 120 0.390 16 0.70203

> 95 0.325 12 0.75075

> 51 0.288 45 0.77443

> 108 0.270 14 0.79079

> 31 0.234 65 0.81572

> 90 0.169 14 0.86841

> 111 0.124 18 0.90264

> 34 0.078 73 0.93820

> 47 0.065 45 0.94811

> 89 0.061 11 0.95249

> 42 0.002 31 0.99881

>

> III. You wrote:

>

> ?In particular, you should not be considering any results from

> regressing c on (u,u^2) if e matters?.

>

> I'm sorry to plead ignorance but nothing you've ever posted before has

> prepared me to understand you here at all. What I mean by this is the

> following.

>

> From the beginning we have been using a regression involving e, a

> regression involving u, and a regression involving (e,u) IN CONCERT,

> NOT as mutually exclusive alternatives.

>

> First we had:

>

> 1a) ln(c/L) on ln(c/e)

> 1b) ln(c/L) on ln(c/u)

> 1c) ln(c/L) on (ln(c/e), ln(c/u))

>

> Then, because of your reservations about these regressions, we

> simplified to

>

> 2a)c on e

> 2b)c on u

> 2c)c on (e,u)

>

> and that actually improved matters.

>

> And then finally, because of your very remarkable intuition that the

> ?L/H? dichotomization of u should be replaced by adding u-related

> factors to the regressions themselves, we have arrived at

>

> 3a) c on (e,u,u*e), by addition of a u-factor to c on e

> 3b) c on (u,u^2), by addition of a u-factor to c on u

> 3c) c on (e,u,u*e,u^2), by addition of two u factors to c on (e,u)

>

> So ... if we never intended 1(a-c) as mutually exclusive alternatives,

> nor 2a-c as mutually exclusive alternatives, why all of a sudden do we

> have to treat (3a-3c) as mutually exclusive alternatives? Please

> recall here that the ultimate goal was always to develop predictors

> for logistic regressions, and back when we were doing logistic

> regressions, you said it?s best to throw everything into the soup that

> one can think of ... that?s why we had logistic regression predictors

> based on MORE THAN ONE linear regression.

>

> Also, why is NOT statistically legitimate to postulate that there are

> BOTH:

>

> a) a relationship between c and u that, as you suspected, is best

> expressed by c on (u,u^2) because the relationship changes with

> increasing u

>

> b) a relationship between c and e that, again as you expected, is best

> expressed by c on (e,u,u*e) because again, the relationship changes

> with increasing u.

>

> [...]

Let me focus initially on 2a-c: the regressions of c on e, on u,

and on (e,u). There are two problems. First, c is a count, with no

measurement error, but both e and u contain measurement error. The

usual regression model, that we have been using all along, assumes

the opposite: that the predictors are known exactly, and that only

the d.v. contains measurement error. (I mentioned this in a post on

Oct 25 @ 12:54 pm.) However, I have been (and still am) willing to

ignore this problem because I believe the measurement errors are

probably negligible compared to random sampling error.

The other problem is something that I thought I had mentioned before,

but apparently I never got beyond thinking about it. If you wanted

the results of 2a-c for purely descriptive purposes, or to use as

input for other computations, then I would see nothing wrong with

doing all three. The problem comes when you ask for p-values. Then

you need to specify a probability model, and the models for 2a-c are

mutually exclusive (except in special cases, such as when at least

one of the regression coefficients in 2c is zero).

We have been using the "conditional regression" model: for 2c,

it says that for every (e,u) pair in the domain of interest,

c|(e,u) = a0 + a1*e + a2*u + error, where the errors are independent

identically-distributed zero-mean normal random variables. There are

no distributional assumptions about (e,u}; their values are taken to

be given, arbitrary. If this model holds then neither 2a nor 2b can

hold, and so we can not get p-values for their coefficients.

One way to legitimize p-values for 2a-c would be to switch to a

completely random model, in which the sample triples (c,e,u) are

assumed to come from a trivariate normal distribution. (The trivariate

normal model is equivalent to augmenting the conditional regression

model with the assumption that the sample pairs (e,u) come from a

bivariate normal distribution.) However, that would rule out 3a-c,

because all the regressions in any multivariate normal distribution

are purely linear; there are no product terms or squared terms.

A plot of the ordered p's from point II against their ranks is

sufficiently different (by the IOT test) from plots of ordered random

Uniform[0,1] variables against their ranks to allow the conclusion

that the coefficient of u^2 is generally nonzero when subset=S,

method=N, fold=a1, set=1. Accordingly, I see no defensible way to

attach p-values to coefficients in models that omit u^2 in that cell.