
Re: Correct way to normalize an rmsdbased distance metric used in repeated trials of pairs
Posted:
Apr 24, 2012 12:24 AM


On Apr 20, 8:59 pm, gimpe...@hotmail.com wrote: > Very interesting ... Arthur has suspected all along that submitting > "short" subsequences to his alignment program would not be wise, > because at least in the case of alphahelical proteins, his program > will always find something to align. And when I ran the length > interval 2332 for fold a1 (which includes helical proteins only) > on the refined "00" data ... sure enough ... the yield ratio was 1, > i.e. output = input. So there is no sense in running the remaining > 01/10/11 data for length interval 3342 through Arthur's program, nor > for running any of the 00/01/10/11 a1 data for the last remaining > length interval 1322. > > Therefore, I have added length as a third predictor and done a summary > run on 32 points (the 00/01/10/11 results for each of the eight length > intervals from 3342 thru 103112. The results appear below, and it > seems like the critical predictor (which is now predictor 2) holds up > very well. > > But ... that is of course my naive and untutored judgement ... please > take a look at your earliest convenience at the data below and tell me > what you think. Should I move on to the next of the five folds using > the same methodology? Or can you tell that this attempt using the > "00/01/10/11" game plan has also failed ?
Let me call the predictors L, x1, x2 instead of first, second, third.
With all three predictors in the model, the weight on x2 is so small and so far from significance that x2 should be dropped. As is usual in such cases, dropping it changes things only negligibly.
Using log L instead of L improves the fit (increases the maximized likelihood) and makes x2 even more nonsignificant.
Dropping x2 and using L, x1, and L*x1 gives almost as good a fit, and changing L to log L improves the fit even more.
Here is a table giving {chisquare, df} for various models against the corresponding saturated model. (The saturated model with x2 is not the same as the saturated model wihout x2.)
L,x1,x2 L,x1 L,x1,L*x1 raw L {1136.41, 28} {533.755, 13} {455.82, 12} log L {1048.03, 28} {445.424, 13} {348.74, 12}
Chisquare/df is a measure of the misfit of the model.
Here are the results for the bestfitting model. (ASE is the standard acronym for Asymptotic Standard Error. In this case it refers to the estimated ASE of the estimated coefficient.)
Predictor Coefficient ASE Log L 4.3080 0.3609 x1 0.8996 0.0915 (Log L)*x1 6.5615 0.2587 intercept 6.5615 0.2587
Including the product term effectively does two separate regressions  one for x1 = 1, one for x1 = 0. The two regression curves are predicted to cross at about L = 120. Does that seem reasonable?
Conclusions: 1. Both L and x1 are necessary. 2. x2 is uselesss. 3. Log L is better than L. 4. The effect of x1 varies as a function of L.
> > 40,0,0,1416,1485 > 40,0,1,1919,1017 > 40,1,0,1053,2575 > 40,1,1,787,2072 > 50,0,0,416,1068 > 50,0,1,701,1040 > 50,1,0,642,1250 > 50,1,1,484,1231 > 60,0,0,304,645 > 60,0,1,366,714 > 60,1,0,343,1087 > 60,1,1,107,714 > 70,0,0,252,534 > 70,0,1,278,758 > 70,1,0,182,765 > 70,1,1,128,954 > 80,0,0,160,430 > 80,0,1,103,197 > 80,1,0,171,310 > 80,1,1,108,520 > 90,0,0,27,24 > 90,0,1,46,152 > 90,1,0,30,304 > 90,1,1,26,137 > 100,0,0,67,98 > 100,0,1,50,189 > 100,1,0,93,196 > 100,1,1,1,230 > 110,0,0,32,39 > 110,0,1,20,157 > 110,1,0,81,238 > 110,1,1,0,125 > > Descriptives... > > 10393 cases have Y=0; 21255 cases have Y=1. > > Variable Avg SD > 1 55.5015 17.8084 > 2 0.5354 0.4987 > 3 0.4844 0.4998 > > Iteration History... > 2 Log Likelihood = 40068.5930 (Null Model) > 2 Log Likelihood = 38305.1477 > 2 Log Likelihood = 38278.7631 > 2 Log Likelihood = 38278.7296 > 2 Log Likelihood = 38278.7296 (Converged) > > Overall Model Fit... > Chi Square= 1789.8634; df=3; p= 0.0000 > > Coefficients and Standard Errors... > Variable Coeff. StdErr p > 1 0.0205 0.0008 0.0000 > 2 0.7643 0.0247 0.0000 > 3 0.0074 0.0247 0.7649 > Intercept 0.7802 > > Odds Ratios and 95% Confidence Intervals... > Variable O.R. Low  High > 1 1.0208 1.0192 1.0223 > > 2 2.1476 2.0460 2.2542 > > 3 0.9926 0.9457 1.0419 > > X1 X2 X3 n0 n1 Calc Prob > 40.0000 0.0000 0.0000 1416 1485 0.5105 > 40.0000 0.0000 1.0000 1919 1017 0.5086 > 40.0000 1.0000 0.0000 1053 2575 0.6913 > 40.0000 1.0000 1.0000 787 2072 0.6897 > 50.0000 0.0000 0.0000 416 1068 0.5615 > 50.0000 0.0000 1.0000 701 1040 0.5597 > 50.0000 1.0000 0.0000 642 1250 0.7333 > 50.0000 1.0000 1.0000 484 1231 0.7319 > 60.0000 0.0000 0.0000 304 645 0.6113 > 60.0000 0.0000 1.0000 366 714 0.6096 > 60.0000 1.0000 0.0000 343 1087 0.7716 > 60.0000 1.0000 1.0000 107 714 0.7703 > 70.0000 0.0000 0.0000 252 534 0.6589 > 70.0000 0.0000 1.0000 278 758 0.6572 > 70.0000 1.0000 0.0000 182 765 0.8058 > 70.0000 1.0000 1.0000 128 954 0.8046 > 80.0000 0.0000 0.0000 160 430 0.7035 > 80.0000 0.0000 1.0000 103 197 0.7019 > 80.0000 1.0000 0.0000 171 310 0.8359 > 80.0000 1.0000 1.0000 108 520 0.8349 > 90.0000 0.0000 0.0000 27 24 0.7445 > 90.0000 0.0000 1.0000 46 152 0.7431 > 90.0000 1.0000 0.0000 30 304 0.8622 > 90.0000 1.0000 1.0000 26 137 0.8613 > 100.0000 0.0000 0.0000 67 98 0.7816 > 100.0000 0.0000 1.0000 50 189 0.7803 > 100.0000 1.0000 0.0000 93 196 0.8848 > 100.0000 1.0000 1.0000 1 230 0.8841 > 110.0000 0.0000 0.0000 32 39 0.8146 > 110.0000 0.0000 1.0000 20 157 0.8135 > 110.0000 1.0000 0.0000 81 238 0.9042 > 110.0000 1.0000 1.0000 0 125 0.9035

