[Note to JRF/AML/MS: for you, the ?executive summary? of the following is that: i) we?ve made a major conceptual step forward; ii) this advance will require reinvestigation of the data that will elongate the schedule by at least weeks if not a couple of months.]
[Note to RL: for you, the ?executive summary? of the following is that nothing changes operationally for you and your students ? the only difference will be in the actual content of the data provided (the actual slopes and interecepts plus aSD?s etc.]
[Note to MS: if you read the two paragraphs below starting with ?The reason why ...? and ?Second ...?, you will immediately see why I think it?s fair to say that I?ve finally managed to successfully rationalize the choice of c/e, c/u, and c/L, as opposed to e/c, u/c, and L/c]
If I?d only thought more carefully about a suggestion you made two months ago, I would have immediately seen that from a purely scientific perspective:
1) the simple linear regressions that must be investigated are ln(c/L) on ln(c/e) and ln(c/L) on ln(c/u), NOT ln(c/u) on ln(c/L) nor ln(c/e) on ln(c/L).
2) the multiple linear regression ln(c/L) on ln(c/e) AND ln(c/u) must also be investigated in comparison to ln(c/L) on ln(c/e) and ln(c/L) on ln(c/u).
I would also have seen that from a purely statistical/methodological perspective, these three regressions can be investigated via a strategy that is entirely sample-size independent. This is because:
a) we can use your custom two-stage heteroscedastic t-test to compare the three regressions pairwise (three comparisons in all)
b) in each such triple of t-tests, the 24 input data sets will have exactly the same N?s, so sample size cannot influence the results of the three pairwise comparisons in any way.
The reason why (1-2) comprise the correct scientific strategy consists first in the fact that c/L is simply a measure of how many dipeptides of a certain type occur per length interval. For example if we?re dealing with dicodon set S63, this set encodes 49 dipeptides and c/L therefore measures the number of occurrences of these 49 dipeptides per length interval L.
Second, c/e is a measure of how many dipeptides of a certain type occur per enthalpic level, and c/u is a measure of how many dipeptides of a certain type occur per representation level. And therefore:
c) when we investigate the regression of ln(c/L) on ln(c/e), we?re simply asking whether the number of dipeptides per length interval might correlate with of enthalpic level per length interval;
d) when we investigate the regression of ln(c/L) on ln(c/u), we?re simply asking whether the number of dipeptides per length interval might correlate with representation level per length interval for reasons to be determined by JRF/AML/MS);
e) when we investigate the regression of ln(c/L) on ln(c/u) AND ln(c/ e), we?re simply asking whether the number of dipeptides per length interval might be correlate with representation AND enthalpic levels per length interval.
Of course, we will still ask these questions relative to our various ?frames? consisting of different choices of: i) our three different families of dicodons (S63/C711, S119/C1058, S60/C493) plus our true control set (S63R/C673R); ii) our three types of restrictions on u (uL, uH, and uA). But regardless of the frame in which we execute the investigations (c-e), factors (a-b) will guarantee that our results will be sample-size independent.
I imagine it will take me several weeks to carry out the strategy outlined above, but I believe it is absolutely critical to do so. Even with the ?backwards? strategy I?ve been employing, I can see relations between results on fold-discrimination and results from the logistic regressions we obtained earlier this year. So, I am hopeful that execution of the strategy outlined above will automatically define a new set of logistic regression predictors that do even better at: i) fold-discrimination; ii) prediction of structural alignability yields.