|
|
Your questions go to the "moment of truth" that Jacques and Arthur may shortly be facing
Posted:
Apr 29, 2012 2:38 PM
|
|
You wrote:
"Part of the applied statistics lore is that such simple linear combinations often -- some would say almost always -- work better than they "ought" to, but that's after an intentional decision to weight. The problem with accidental weightings that work is that they can just as accidentally quit working.
I was hoping for an a priori theoretical justification. Forget the logging, and ask yourself why approximating c/e well should be more important than approximating c/u well, where both approximations are of the form a*(c/L)^b, each approximation has its own (empirically determined) a & b, and "importance" means utility for predicting the probability of a match. Under what conditions should the relative importance be approximately constant? Might it ever reverse? Etc. "
In response, let me say first that because I chose to answer your question about "weighting" narrowly instead of broadly, I left you with the impression that there is no a priori theoretical justification for favoring "dicodon enthalpic level" (c/e) over "dicodon representation level" (c/u.)
In fact, the primary problem which may shortly confront Jacques and Arthur is just the reverse: choosing among several different hypotheses as to why "dicodon enthalpic level" should be more important than "dicodon representation level". In this regard, I have been championing the "evolutionary hypothesis" which I've mentioned earlier: the importante of "dicodon enthalpic level" is a left-over from constraints on the earliest systems capable of making proteins from genes. On the other hand, our colleague Marvin Stodolsky has been championing a "mechanistic hypothesis", i.e. a hypothesis which asserts that "dicodon enthalpic level" is relevant to various processes which can and do affect the manufacture of proteins from genes even today. (Note that these two hypotheses are not mutually exclusive: the question is which Jacques and Arthur should choose to include or omit, and which to emphasize.)
Further, the secondary probem which may shortly confront Jacques and Arthur is deciding how to explain why "dicodon representation level" (c/u) should matter at all. (Recall that in your first proposed model {lnL,x1,lnLx1}, which worked fairly well in certain cases, we disregarded c/u entirely (since we disregarded x2 as a predictor.)
Regarding this question, I have an idea which is amenable to empirical investigation - it has to do with a "supply chain" constraint involving the numbers of available auxiliary molecules (tRNAs) of various types that are available to an organism during the process by which a protein is made from a gene. And if necessary, I am prepared to analyze the relevant available databases to see how this idea does or doesn't play out. And, of course, Jacques and Arthur and Marvin may have their own ideas as to why "dicodon representation level" should matter at all, as opposed to "dicodon enthalpic level".
Finally, let me explain why I have chosen the locution "Jacques and Arthur MAY have to confront ..."
We have three folds left to go (b47, c1, and c2) to see if the new {lnL,mv,lnLmv} model holds up with respect to the nice differentiation which it makes between study group and control group data for each fold.
If the apparent success of this new model is an "accident" of the sort you indicate it might be, then it will fail on one of the three remaining folds and there will be no problem at all for Jacques and Arthur to confront.
But if this new model should work on the remaining three folds to differentiate study group from control group data, then there is reason to expect that it will "always" work, inasmuch as our six folds each contains two different examples of the three basic fold types: helical (a1/a3), sheet(b1.b47), and helix/sheet(c1,c2.) So, if the new model fails to work on future data after succeeding for our present six folds, then we will have reason to suspect a readily explainable "exception to the rule."
In any event, everything depends right now on what happens with our remaining three folds ... I will be posting the b47 fold results sometime later tonight after I finish the control group runs.
Thanks so much again, Ray.
|
|