Topic:
The same four proportional weighting factors work for each 00/01/10/11 when 0.25 is subtracted from each !!!
Replies:
506
Last Post:
Nov 20, 2012 9:21 PM
> The problem is that I see no role for anything like the 63 two-word > sequences. I don't understand anything about them -- where they came > from, how they relate to the notion of over-representation, why they > are necessary, what constraints there may be on them, etc.-
I'm sorry - that's my fault because I assumed you would recall some very important information communicated here at sci.stat.math in a reply to Art Kendall (not to you.
The point is that we already know from the research we did in 2004 and 2005 that the 63 two-word sequences are statistically overrepresented in relation to all the other 3721 two-word sequences that do not contain one of the "stop" words taa/tag/tga.
This research was described in a post to Art Kendall on April 11/5:05pm in response to a question of his posted April 11 3:59pm.
Below is that post to Art, reprinted here for your convenience so you don't have to look for it. I am hoping that once having read this post, you will understand:
a) where the 63 come from, and in particular, that: i) they arose from execution of a very rigorous set of procedures established for us by a Vandy biostatistician (now at Harvard); ii) these procedures were carried out on data entirely independent of the data with which we're now working (the "a1/a3/b1/b47/c1/c2" data;
b) why we are therefore interested in the level of over-representation of these 63 in our current data, and how: i) this level compares to the level of over-representation in our current data of other "natural alternatives" to the 63, e.g. its corresponding control group C711, and the two alternative study/control group pairs S60/C537 and S110/ C1058). (I can send you the tables for the C711. S60, C537, S119, and C1058 groups, if you would like to see them ... they're constructed according to the same logic as the S63 table I gave you above.)
So, please take whatever time you need to read the post to Art that I've reprinted below, and to formulate any questions you have. Until you are absolutely clear as to why we're focussing solely on over- represrentation of S63 in comparison to C711, and then on the comparison of S63/C711 to S60/C537 and S119/C1058, we really can't proceed further.
Thanks very much for the taking the time you will need to spend absorbing the information below, and reaching a deecision as to whether you agree with the procedures which the biostatistican recommended.
Post to Art of April 11 505pm:
i Art - nice to hear from you again.
The choice of "study" vs "control" codons was actually forced by the procedure we used to find our core set of 63 significant dicodons (out of the 61X61 possible dicodons excluding dicodons containing stop codons.)
At the advice of a Vanderbilt biostatistician who's now at Harvard, we used the following procedure to find these 63 significant dicodons.
We started with 20000 protein messages (strings of codons) which we divided into 100 sets of 200 each.
We then randomized each of the 200 messages in each of the 100 sets via randomized "codon shuffling" 100 times, to ensure that our results weren't going to be affected by known codon frequency biases (since codon shuffling doesn't affect codon frequency, just arrangement.)
So at this point, for any real set R out of the 100 real sets, we could t-test any mean computed for R 100 times (against the 100 randomizations of R.) And we could repeat these 100 "inner trials" 100 times in 100 "outer trials" (because we had 100 real sets, each randomized 100 times.)
So you can see why we this opportunity gave us the ability to sieve results thru 10000 t-tests, i.e. our standard criterion was that in > 95% of the 100 "outer" trials (one for each real set R), a mean from R had to test as significant in > 95% of the "inner" trials (the mean from R against the means from the 100 randomizations of R)
And in fact, the actual number of t-tests was 30K, not 10K, for the following reason.
We first used 10K t-tests to find all significant "34" dinucleotides (a "34" dinucleotide occupies positions 3 and 4 in a dicodon consisting of nucleotides 123456, "123" belonging to the first codon of the dicodon and "456" belonging to the second codon of the dicodon.)
Then we used another 10K t-tests to find all "significant "2345" quadri-nucleotides centered on the significant "34" dinucleotides we found in the previous step.
Then we used another 10K t-tests to find all the significant dicodons "123456" centered on the significant "2345" quadri-nucleotides we found in the previous step.
This yielded 63 dicodons encoding 49 dipeptides, so the natural "control" group is the set of 711 remaining dicodons which encode the same 49 dipeptides.
The reason why this is the "natural" control group is because any effect we get CANNOT be attributed to dipeptide identity, but only to dicodon identity (since the 63 and 711 encode the same dipeptides.)
In any event, you can see from the above how the choice of codons was forced by the dicodons we obtained via the above procedure - in the case of the HP dipeptide, only one of the eight possible HP dicodons is in the set of 63.
Also, it is probably worth noting that the statistically significant over-representation of our 63 dicodons was confirmed in tests on 6 separate single-species genomes carried out by our colleague Marvin Stodolsky (formerly of DOE), even though the original set of 20K messages was obtained from a larger cross-species sample kindly provided by Temple Smith (BU BMERC.)