Thanks very much for looking at the files and in particular, at u on c*L.
1. Some information re c and L
?There are 101 different values of L, 24,...,124. The number of sequences with any given length drops irregularly but roughly linearly with the log of the length.?
It may be important for you to know that in order to avoid potential ?Bayesian? objections, I do not allow for overlapping message segments when I collect message segments of a given length for a given protein message. So, for example, if a protein chain has 120 peptides (amino acids) and therefore an associated message of 120 codons, I can maximally extract the following numbers of message segments of various lengths:
length 60: just 2 segments length 40: just 3 segments length 30: just 4 segments lenght 24: just 5 segments
(But note that: i) if a chain segment of a given length does NOT contain any dipeptides of interest, i.e. if ?c? = 0 for the segment, then I ignore the segment; ii) collected segments do not have to be consecutive ? there can be gaps in between them, e.g. a segment of length 24 from position 3 to 26 followed by a segment of length 24 from position 32 to 55.)
This ?non-overlapping? method of segment collection may of course be contributing to the rough linear increase of N with log of L.
2. Distribution of c relative to L.
?There are 856 different values of c, from 1.785714286 to 34.66666667. Since only 31 of those are integers, I guess you're using the averaged c. (The frequency distribution of c resembles Figure 2 in Clemons, T., & Pagano, M. (1999). Are babies normal? American Statistician, 53(4), 298-302. in that it is clearly a mixture of two different systems.) c increases roughly linearly with L, but the scatterplot bifurcates above about L = 90. Two "families"? ?
Perhaps something like ?two families?, Ray, in the following sense.
I have sent you offline two PDFs ? one showing the 3D shapes of two ?a1? (hemoglobing) protein chains with segments in the a1 data set that you?ve just analyzed. The ?RasWin? PDF shows that both chains are ?helix-bundles?, easch containing 3 ?helix-turn-helix? components (these are the blue/light-blue, green/green, and orange/red components.) The ?Stride? PDF shows how these helix-turn-helix components are positioned relatrive to the linear chains of the two proteins.
So, you can see from these examples that when we sample the hemoglobins at shorter lengths, we may well be sampling ?like? structural units, whereas when we sample at longer lengths, we may well be sampling composites of these units.
And I suspect this kind of thing may well be responsible for the bifurcation you?ve noticed, which would be an extremely important and interesting idea to tease out later on.
3. Why I looked at u on c*L
?Finally, a comment and question. It is unusual to regress a d.v. on just the product of two variables. The usual case has the product as the third predictor, with the first two being the two variables involved in the product. The usual reason is that the slope and/or intercept in the regression of the d.v. on one of the variables are suspected of changing approximately inearly with the other variable. Is that what you suspect, or does c*L have meaning all by itself??
I wish I could say that my reason for looking at c*L was as informed as you thought it might have been, but no.
What happened is that I did the multiple linear regressions that you requested
u on c AND L e on c AND L
and found nothing with coefficients above .10.
So I tried
u on c/L (different from our standard c/u on c/L) e on C/L (different from out standard c/e on c/L)
and again found nothing.
So, I said to myself, well, if u or e on c/L gives nothing, what about c*L? If I had any coherent thought in mind here, it was to heighten differences between shorter and longer segments (inasmuch as c generally increases with L.)
4. My next post.
I will shortly be posting a ?summary? post in which I collect results from several previous posts regarding your findings for
4.1 ln(c/e) on ln(c/L) for S63 vs S63R (significant)
4.2 ln(c/u) on ln(c/L) for S63 vs S63R (significant overall but not relative to e)
4.3 ln(c/e) on ln(c/L) for S63 vs S63R using simplified c vs averaged c (very very similar)
In addition, I will post the parallel to 4.3 for u, i.e.:
4.4 ln(c/u) on ln(c/L) for S63 vs S63R using simplified c vs averaged c
so that you can determine if the result is the same as in 4.3.
Finally, I will come back to the question of "reliability" of results involving ln(c/u) on ln(c/L) "relative to e", in particular, to ask you what would be "reliable" differences/
The reason why I want to collect these findings into one post is to have a single reference post on which you can ?sign-off?(or ask for more info) before I ?move-out? and run 4.1-4.4 on the remaining 71 cases (12 length intervals * 6 folds.) This is because 4.1-4.4 will provide the critical evidence for the ?Paper I? that I now think can be written, i.e. a paper written around the following absrtract (also in an email on which you were recently cc?d):
Abstract The literature is now replete with research reporting that choice of specific individual dicodons in protein messages (mRNAs) has practical consequences ranging from improvement of translational efficiency to impact on protein folding via the phenomenon of translational pause. It has not, however, been previously reported that if dicodon choice is defined as selecton of dicodons from a small group of the possible 3721 stop-free dicodons, then this ?group-wise? definintion of ?dicodon choice? appears to make it possible to state certain general bounds on the degree to which mutation can alter certain fundamental properties of protein messages.
Using a set of only 63 stop-free dicodons which we have previously shown to be significantly over-represented in protein messages, we adduce two statistically significant correlations involving energetic properties of protein messages, and show that mutation has not operated in such a way as to render these correlations unobtainable in many and varied samples of protein messages.
Furthermore, numerical differences among the instances of these correlations which we have obtained for six different SCOP folds (helical globin (a1), helical cytochrome (a3), beta IG (b1), beta trupsin (b47), TIM-barrel (c1), and NADP-binding (c2)) indicate that the properties of protein messages expressed by our correlations may, in fact, correlate with aspects of protein structure that characterize different classes and subclasses of protein folds.