On Jun 2, 8:30 am, djh <halitsk...@att.net> wrote: > I assume from your last question that you're now either able or close > to able to assess the current method of computing "u" relative to a > given group (e.g. S63). By "current method", I mean this one (taken > from a previous post): > > ****** > Suppose your boss gives you an encoded announcement in which: > > a) a 2-word sequence for AP appears just once and this 2-word > sequence > is att ccc (one of the above 63 2-word sequences) > > b) a 2-word sequence for AL appears just once and this 2-word > sequence > is gcg ctg (one of the above 63 2-word sequences) > > c) no other 2-word sequence from the above 63 occurs in the > announcement (equivalently, you will typeset an announcement in which > none of the above 49 2-letter sequences appears other than one AP and > one AL.) > > Then the degree ?u? of average over-representation of the 63 2-word > sequences in the announcement is computed by my PERL code as: > > 1+1 = 2 (1 of the 63 for AL and 1 of the 63 for AP) > > 1/12 + 1/8 = 5/24 (expected frequency for AL and expected frequency > for AP) > > 2/(5/24) = 48/5 (actual freq / expected freq) > > (48/5)/2 = 48/10 = 4.8 = u (because only two dipeptide positions in > the announcement were occupied by dipeptides in the 49 encoded by > S63). > > ******* > > If anything is unclear in the above, please let me know. And of > course, if you think there's a more correct or better way to compute > "u", I assume you'll let me know what method you would prefer. So > long as the method gives us a measure u of over-representation > relative to a given group (e.g. any one of S63, C711, S60, C537, S119, > C1058), I will use whatever method you specify. > > Thanks again, Ray. > > ******
I'm getting there, but there are some intermediate steps I need to take first to make sure I really understand things. __________
If you still have the original 20K real messages that S63 is based on, and if you should find yourself with nothing else to do, you might try the following, which occurred to me last night while I was thinking about the fact that S63 involved three arbitrary values -- the inner and outer cutoffs, both of which were 95%, and the significance level, which wasn't mentioned but was presumably 5% -- and wondering if there was a way to avoid them.
Make a 61 x 61 table whose entries are frequency counts of the all the dicodons in the 20K messages. In each dicodon, the left codon picks the row, and the right codon picks the column. (Side question: what's the standard way to refer to the left and right codons in a dicodon?)
Let fjk denote the frequency count in row j, column k. Let rj denote the total in row j. Let ck denote the total in col k. Let n denote the grand total. Make a new 61 x 61 table whose entries are
Each g is a correlation, so -1 <= g <= 1. (For significance testing, refer g*sqrt[n-1] to the standard normal distribution.) g is a measure of over/under-representation (+/-). Sort the g's, keeping track of which dicodons they correspond to. Look at the distribution. Is there any kind of break between the largest and the not-so-large, that would let us "carve nature at its joints" if we santed to pick the most over-represented dicodons, or would an arbitrary cutoff have to be used? Where do the S63 dicodons come in the distribution?