On May 20, 3:46 am, djh <halitsk...@att.net> wrote: > [Note to MS (and JRF/AML also) - see section 3 below, dealing with > the new sets of dicodons with the reverse complements factored in.] > > Four quasi-related new questions: > > 1. Should I generate a sample "jacknife" table for you? > > For the a1 fold, I have now computed the data for: > > a) the original study and control groups (63 and 711 dicodons) > without the intuitive omissions of outliers that I originally > thought should be excluded > > b) the new study and control groups (119 and 1058 dicodons) > > While you are trying to decide the overall strategy (jacknifing or > random-pairing or something else), I can simply continue to do the > same computation for the other five folds, since this will have to > be done eventuallly anyway. > > Or, I can do the minor modifications to my "wrapper" program that > calls Arthur's alignment program so that I can run these a1 datasets > in such a way as to get the counts of inputs vs matches that you > specified as necessary in order to try the jacknifing approach. > (Not sure if you intended to manipulate these data or tell me what > to do with it, to save you the grunt-work.) > > Which would be the best use of my time right now?
Start learning R. Jackknifing requires combining the results of n+1 analyses, where n is the number of segments. Here's how it's done.
Let b denote the vector of estimated coefficients (including the intercept) you get from the usual analysis. For i = 1,...,n, let b'i denote the vector of coefficients you get when you omit the comparisons involving segment i, and let b"i = n*b - (n-1)*b'i. For a 7-predictor model you would end up with an n x 8 matrix B". The trick is to treat B" as if its rows were independent estimates of the coefficients. The column means are reduced-bias estimates of the coefficients, and 1/n times the (inferential) covariance matrix is an empirical estimate of the covariance matrix of the estimated coefficients.
I suppose you could do all that via the interface to John's program, but it would certainly be awkward.
(By comparison, the usual analysis takes the negative of the inverse of the matrix of second partial derivatives of the log likelihood with respect to the coefficients, evaluated at the values of the coefficients that maximize the log likelihood, as an estimate of the asymptotic covariance matrix of the estimated coefficients. I distrust that matrix for your data because it assumes the input proportions have binomial sampling distributions, which yours do not.)
> > 2. Requisite size of n's. > > The runs with the new study and control dicodons generate a lot > more control group data, as would be expected from the increase > in dicodons from 711 to 1058 and the corresponding increase in > encoded dipeptides from 49 to 82. > > So I think that these runs may give you satisfactory n's for > all cases, i.e. all length intervals for all folds. But what > actually ARE the n's you're looking for, both for inputs per > cell to Arthur's program and matches/non-matches?
If I knew exactly what you're looking for then I might be able to give you a straight answer. For instance, if the goal were to demonstrate a simple mean difference between two groups then you might say something like: If the true difference between the study and control groups is X units then the probability should be P that a difference will be detected, with a type I error risk of A; furthermore, in pilot studies the average error variance has been V, with D df. Given all that, the required sample size could be solved for.
For complicated research questions, the best way to determine the sample size is by Monte Carlo simulations. You create populations with known properties, take many samples of some specified size, analyze them as if they were real data, and estimate the probability of answering the question. Then you do it over with different sample sizes and/or populations, until you're satisfied that you have identified a sample size that will give you a decent chance of answering the research question with real data. All told, you spend more time designing the study than you do analyzing real data.
> > 3. Comparison of simple linear regressions. > > For the a1 fold at least, the two driver correlations SEEM much > stronger (lower significance and higher R-Square) when the new > sets of study and control dicodons are used (119 and 1059 instead > of 63 and 711.)
The smaller p-values would follow from the increased n's, even if the R^2's were the same.
> > And this could be a very important fact scientifically. > > But I don't know how to properly compare the relevant pairs of linear > regressions, or even whether such pairs are comparable, i.e. > > c) run on the same data with the new study dicodons vs the old study > dicodons (63 vs 119) > > d) run on the same data with the new control dicodons vs the old study > dicodons (711 vs 1058). > > Is the pair of regressions in c or d comparable, and if so, how to do > it correctly?
Again, I'm not sure what the actual comparisons would be. If all else fails, you could always use the jackknife. (Many statisticians would suggest the bootstrap instead of the jackknife, but I prefer the jackknife's mild conservatism to the bootstrap's mild liberalism.)
> > 4. Sample Equalization > > Should I be thinking about selecting a subset of study group data > that is roughly equal in size to the set of control group data. > > In particular, I told you that study group data always exhibits u > ranging from 1.01 to 4.00, while control group data always exhibits > u ranging from 1.01 to 1.36. > > So I could break the study group data into ranges of u that would > yield counts equal to those for the study group data, if you think > this would help matters in the long run. And actually, it would be > interesting to see what ranges of u have to be selected for the study > group data in order to make each subset comparable in size to the > size of the control group data for u from 1.01 to 1.36.
There's a general principle that you should never throw away data just to equate sample sizes. However, if splitting the control group on the basis of u answers an interesting scientific question, then fine, but I'm in no position to judge. The statistical implications depend on exactly how u enters the analyses, whatever they may be.