|
|
Re: Correct way to normalize an rmsd-based distance metric used in repeated trials of pairs
Posted:
Apr 9, 2012 5:51 PM
|
|
Thanks for considering the question.
First, please be assured that I didn't mean to imply that an overall logistic regression model relating yield to i,j (plus "helper" predictors) is no longer the ultimate goal.
What I was asking is what, if anything, can be safely said while in pursuit of that ultimate goal. If something can be safely said in the absence of an overall model, then we have our "Paper II" conceptually in hand, and submission/ acceptance of this paper might well lead to funding of the type required to pursue the overall model with sufficient resources and skill-sets at our disposal.
Second, I don't want to get too hopeful here too soon, so permit me to present the matter in sufficient detail for you to decide whether some elaboration of a basic Pearson chi-square approach will actually tell us whether we can establish the "base-camp" that we'd like to establish.
For length 40, the two new count variables p,q can each range from 2 to 15 (these are not "quantized" ... they're actual counts of elements.) And as before, the original variables i,j can each range from 0 to 9.
What we would like to be the case is that for some "acceptable" majority of the ((14**2)+14))/2 = 105 possible non-redundant choices of (p,q), the yields from the 55 choices of (i,j) are higher to an "acceptably significant" extent when j-i <= k than when j-i > k (where we would hope k to be as small as 1.) From an evolutionary perspective, it doesn't matter how high (i,j) are in the absolute sense, i.e. how close they both are to 9 ... what matters is that the yield from Arthur's program is greater when i,j are closer (again, when L(ength), p, and q are held constant.)
Also, the thrust of our central hypothesis would allow it to be the case that the data are "bi-modal", i.e. that:
a) for some "acceptable" number of the ((14**2)+14))/2 = 105 possible non-redundant choices of (p,q), the yields from the 55 choices of (i,j) are HIHGER to an "acceptably significant" extent when j-i <= k than when j-i > k;
b) a) for some equally "acceptable" number of the ((14**2)+14))/2 = 105 possible non-redundant choices of (p,q), the yields from the 55 choices of (i,j) are LOWER to an "acceptably significant" extent when j-i <= k than when j-i > k;
According to our hypothesis, case (a) results when the input pairs to Arthur's program reflect a highly-conserved condition dating back to the earliest stages of protein evolution ORDERLY (non-random) fashion.
So - bottom-line: would there be some appropriate set of Pearson chi- squares that could be performed to see if (a) is true and if so, whether (b) is also true?
|
|