Date: Dec 12, 2012 7:57 PM
Author: Ray Koopman
Subject: Re: OK then, how ‘bout “hetness”? Are you amen<br>	able to its further investigation?

On Dec 11, 7:12 pm, djh <> wrote:
> To date, the only results we have that are both ?good? and fully cross-
> fold are the ?het?-related results:
> a1 a3 b1 b47 c1 c2
> C S C S c S c S C S C S
> Het
> 1N Aubqe H H L L H H H H L L L L 0
> 3N Aubqe L L H H L L H H H H L L 0
> LH-Het
> 3N Aubqu L H H H H H L H L H L H 4
> HL-Het
> 1N CVubq H L H L H L H L H L H L 6
> 2N CVubq H L H L H L H L H L H L 6
> where Aubqe is the average slope of e in regression Rubq = c on
> (e,u,u*e,u^2)
> Aubqu is the average slope of u in regression Rubq = c on
> (e,u,u*e,u^2)
> CVubq is the covar of e and u in regression Rubq = c on
> (e,u,u*emu^2(
> These results are ?good? not only because:
> a) your MonteCarlo-ing indicated p?s for Het=0, LH-Het=4, HL-Het=6
> of .022, .049, and .001 respectively;
> but also because:
> b) no set x MoSS combination involving MoSS=R exhibits a value for
> Het, LH-Het, or HL-het with an associated probability of < .05.
> And therefore, the three flavors of ?hetness? can certainly be said
> to successfully distinguish our non-random dicodon subsets from our
> random dicodon subsets (an outcome we have not been able to achieve
> ACROSS ALL FOLDS via computation of 2-ways or Q-associated p?s etc.)
> On the other hand, you?ve expressed two kinds of reservations about
> ?hetness?:
> c) it involves a dichotomization of slopes obtained when Aubqe or
> Aubqu or CVubq is regressed on length;
> d) you yourself have no intuition at all about what CVubq might
> actually ?mean, and only a vague intuition about what Aubqe or Aubqu
> might actually ?mean?.
> So, given these reservations, are you amenable to further
> investigation of ?hetness?, or is that somewhere you don?t
> particularly want to go?

1. The variables on which Het is defined -- Aubqe, Aubqu, CVubq --
are all sensitive to the distribution of the predictors. A different
distribution that gave the same regression equation could give quite
different values of Aubqe, Aubqu, and especially CVubq.

2. You defined Het for a Subset x Fold table with fixed values of Set
and MoSS, but what happened to Length? Are the values whose median
you get averages over all Lengths? What if different cells in the
table have different distributions of Length?

3. You can't look at just p_4; you need p_4 + p_5 + p_6 = .056056,
and you should probably double that, to make it two-tailed.
Moreover, some sort of Bonferroni correction is needed.

4. Two continuous variables that might work better than Het are the
SD and RMS of the differences between the values for each fold; i.e.,
get 6 differences, then take either their SD or RMS. However, they
would still have the problems noted in points 1 and 2.

> Thanks as always for considering this question, and please forgive
> the apparent "numerology". (I should have introduced the matter in
> the context of local properties of surfaces in the neighborhoods of
> different points ... standard differential geometry.)