"But here the SEs of Aubque are roughly linear in Length, even tho the regressions are for individual lengths, not length intervals. Does that hold for the other Axxxx's?"
I will check the other Axxxx?s and also the other folds at N, since I only gave you b1,C and b1,S at N.
To check linearity, I assume I plot (length,SE)?s as (x,y)?s, and see if the plot looks ?straight?? (If not, please let me know the check you want me to do (otherwise, no need to respond to this question.)
II. You wrote:
This needs to be understood. Is there something intrinsically different going on at the longer lengths, or is it just length per se (as in more opportunities)?
The answer to this question would first take a great great great deal of what might be called ?distributional analysis? of the frequency of ?dipeptides of interest? and their ?dicodons of interest? relative to length. (Recall here that for method N and set 1, for example, there are 63 ?dicodons of interest? encoding 49 ?dipeptides of interest?.)
And then, once we knew the frequency of ?dipeptides of interest? and ?dicodons of interest? relative to length, we would have to compute the possibilities of u-variation and e-variation within each n-tuple of dicodons of interest that encodes each dipeptide of interest. Only then would we be able to compute an answer to the question of whether it?s as simple a matter as ?possibilities increase with length?.
But I would much prefer not to ?go there? now for two reasons:
a) the requisite ?distributional analysis? and subsequent u/e- variation analysis could easily take months if not years;
b) I?d like to see first if linearity of SE with length is essentially constant across method x set x subset x fold, or whether the 72 (MoSS,Set,Subset,Fold) combinations exhibit different degrees of linearity of SE with length in some systematic way(s).
For example, it would be a highly desirable outcome (though of course ?too good to be true?) if MoSS = R combinations (and/or Subset = C combinations, or (R,C) combinations) exhibited MORE constancy of SE with length than MoSS = N combinations (and/or Subset = S, or (N,S) combinations). This is because such an outcome would suggest that the system might over-represent certain dicodons because over- representation of certain dicodons keeps the ?enthalpic profile? of messages invariant with length.
So, I hope you?ll permit me to first do the linearity checks for all of the special average slopes (and the special covar) across the 36 (Set, Subset, Fold) combinations at N, before going to ?distributional analysis?.