Sorry - here's the actual text that should have been in my last (empty) post
This is a post of an offline email I sent to the team yesterday, with cc to you. I'm posting it here for two reasons.
First, so that it can be more easily referenced later if need be.
Second, because the reported results shows that there's no point in trying to reach any final conclusions from intra-fold significance analysis (e.g. as in my last post) until significance analysis has been done for all folds - otherwise, any intra-fold conclusions may simply not be sufficently generalizable.
Off-line email (7/10/2011):
[Note to Ray Koopman: the following will be of interest to you primarily because it explains the "mystery" of why our driver based on "u" was not as strong as our driver based on "e" when we did our logistic regressions on yields from AML's programs]
[Note to Robert Lewis: the following will be of interest to you only because it greatly strengthens the case that AT LEAST one pair of linear regressions informs the over all system]
I have now completed Ray's approved test protocol on folds a1/a3/b1 (1728 out of 3456 regressions) and the data yield the following 3 results in a completely unambiguous manner, with possibles avenues of protest available to statistical referees greatly reduced due to a "signficcance-reduction" technique called the "Bonferroni correction" which Ray deliberately built into the proticol.
Results ACROSS ALL THREE FOLDS:
1) we get our linear regression involvoing the ENTHALPIC level of dicodons if and only if we use our original set of 63 dicodons;
2) iwe get out linear regression involving the REPRESENTATION level of dicodons if and only if we use the augmented set of 119 dicodons containing the original 63 PLUSs their stop-free reverse complements (MINUS four duplicates, i.e. 119 = 123 - 4.)
3) we get neither regression if we use just the set of 60 stop-free dicodons that are the reverse complements (and therefore, the energetic equivalents) of the 63.
These results raise the question of why we must augment the original 63 with their non-duplicatory stop-free reverse dicomplements in order to obtain our linear regression on REPRESENTAION level, but not our linear regression on ENTHALPIC level.
And of course, one answer to this question is that the data contain inverted repeats in which the 63 original dicodons and their stop-free reverse complements figure heavily.
Fortunately, we can use a portion of our existing test protocol to investigate this possibility - if we see more inverted repeats using the set of 119 dicodons than using the natural alternative group of 1058 dicodons which encode the same 82 dipeptides, then we will know that inverted repeats are contributing to result (2) noted above ...
If this investigation yields the desired result, then this result would certainly be an item to include in Paper I, along with the central notion of "bounds on mutation" implied by our regressions. (And any results we may get related our 63 dicodons to hexnucleotides in eukaryotic promoters, as discussed in a prior email.)