On Jun 4, 10:03 am, djh <halitsk...@att.net> wrote: > I think I can cut thru the confusion that I?ve caused here if I simply > tell you exactly what my code is doing, so that you can tell me if > this code is calculating ?u? in a proper and acceptable way, and if > not, what you would have me do instead. > > So first, let?s agree to call a dicodon a ?dicodon of interest? if > it?s in the reference set of dicodons, and to call a dipeptide a > ?dipeptide of interest? if it can be encoded by a ?dicodon of > interest? In the case we?re dealing with now, we have 63 dicodons of > interest that encode 49 dipeptides of interest, and each of these 49 > dipetides of interest has a UCP associated with it, namely, the > probablility that the dipeptide of interest will be encoded by a > dicodon of interest. > > Second, suppose I have a string Sp of n peptides encoded by a string > Sc of n codons where Sp contains n-1 dipeptides and Sc contains n-1 > dicodons. > > Third. suppose I compute a table T for the string Sp, where T has 49 > rows and each row has the following columns: > > C1: Dipeptide Label (e.g. ?AP? or ?AL?) > > C2: Number of times dipeptide is encoded in Sp by a dicodon of > interest (call this N-yes) > > C3: Number of times dipeptide is not encoded in Sp by a dicodon of > interest (call this N-no) > > C4: Number of times dipeptide appears in Sp, i.e. N-yes plus N-no > (call this N-tot) > > C5: UCP for dipeptide > > C6: N-tot * UCP = expected number of times dipeptide should be encoded > in Sp by a dicodon of interest > > Then my code currently computes ?u? by: > > 1) summing C2 down all 49 rows to get sum(C2), i.e. actual across all > dipeptides of interest > > 2) summing C6 down all 49 rows to get sum(C6), i.e. expected across > all dipeptides of interest > > 3) dividing sum(C2) by sum(C6) to get actual/expected for all > dipeptides of interest > > If you don?t agree that the final steps (1-3) are the correct thing to > do to get ?u?, please tell me how you would compute ?u? from the table > T. > > The reason I?m asking you to tell me how you?d compute ?u? from T is > that I really don?t want to change my code that computes the table T, > and I do think that this table has all the information needed to > compute ?u? correctly.
First, note that your sums are over the 49 (present case) dipeptides of interest, but mine are over the individual dicodons that encode a dipeptide of interest. In your notation, my sums have sum(N-tot) = sum(C4) terms.
Your sum(N-yes) = sum(C2) = my sum w, and your sum(N-tot * UCP) = sum(C6) = my sum UCP, so your u = sum(C2) / sum(C6) = sum w / sum UCP, as I thought.
The relation between the two systems may become clearer if you think of my sums as two-level nested sums: an outer sum, over the 49 dipeptides of interest; and 49 inner sums, each one being over the N-tot dicodons that encode one of the dipeptides of interest.
For any one dipeptide, your N_yes = my innersum w, and your N-yes/UCP = my innersum v. I didn't define anything equivalent to your N-tot, so let me borrow it. Then can we agree that an over-representation index for a single dipeptide would be
t = N-yes / (N-tot * UCP) = C2 / C6 = innersum v / N-tot .
If that's ok then here are three different ways to get an overall u by averaging the t's:
1. A simple average: sum t / 49.
2. The one you have been using -- a weighted average, using the expected values of the N-yes's as weights: sum(t * N-tot * UCP) / sum(N-tot * UCP).
3. The one I suggested -- a weighted average, using the N-tot's as weights: sum(t * N-tot) / sum(N-tot). (This can also be interpreted as a simple average of the v's, which are over-representation indices of each of the sum(N-tot) dicodons.)
Other weightings are possible, such as the number of possible encodings of each dipeptide or the number of interesting encodings of each dipeptide, both of which differ from 2 and 3 above in that they do not depend on the message whose over-representation index is sought. All of them are "right". The question is which one is best at both ignoring features of the data that you want to ignore and detecting features of the data that you want to detect.
The first part of that question is easier. If encoding is random, how far is the expected value of the index from 1 (or whatever value is supposed to characterize the null case? There is often a tradeoff between bias and standard error, so the rms error is usually looked at. But there is a complication, because the randomness in question is conditional: the probability of a dicodon, given a dipeptide. It says nothing about the dipeptides. Are all 49 equally likely? Or are their probabilities proportional to the number of possible encodings? Or are they given by some well-defined but essentially arbitrary function? Or ... ? And how does the rms error behave under different dipeptide distributions? Also, if something like ln(u), rather than u itself, is to be used in further computations then perhaps the comparisons of alternate definitions of u should be done using ln(u) (or whatever).
The second part of the question is harder. If over-representation were a simple unidimensional phenomenon then you would ask about the "information function" of the index: the ratio of the derivative of the expected value of the index with respect to the true value of the parameter, divided by the standard error of the index. But over- representation can happen in many ways; the information function is an information manifold, and the question becomes "which index is best at detecting which kinds of non-randomness". Very messy, thesis- level stuff. I suspect your time would be better spent thinking about the first part of the question.
There is a whole class of measures that I haven't mentioned, ones that treat the question symmetrically, paying as much attention to under- as to over-representation. If you've spent much time around biostatisticians, you'll have heard about risk ratios and odds rations. What you've been using is analogous to a risk ratio, and it might be better to work with the analog of an odds ratio. However, all the questions mentioned above about weighting would still need to be answered. My prime candidate would be something like ln[(N_yes/N_no)/(UCP/(1-UCP))] = ln[(N_yes/UCP)/(N_no/(1-UCP))].
Finally, I'm not sure how, or even if, the fact that dicodons can overlap affects all this. I'm operating as if it doesn't.