On Thu, 10 Oct 2013 13:38:33 -0700 (PDT), firstname.lastname@example.org wrote:
>I've had my head into intro stats and simple hypothesis testing for about a year. It all seems to make sense, especially if you want to demonstrate an effect for a drug and H0 corresponds to no effect. For example, if you set a 95% confidence limit, then the rejection region for the H0's PDF is 5%. If your sample statistic lies outside of the confidence limits, then under the scenario that H0 is true, there is only a 5% chance that you could have gotten your result by chance. The closer you set your confidence limit to 100%, the less likely that a result outside of that limit is due to chance, under the H0 scenario. So you're more justified in rejection H0. > >What I don't get is the common practice of choosing high confidence limits when you're interested in showing that rejection of H0 is *not* justified. [break]
I don't know that this is a "common practice" in good literature. But I do agree that this is an area where non-statisticians are much more likely to apply their rules-of-thumb in ways that turn out to be backwards ("small p-levels are always better"). And the real statisticians have to notice that they need to be careful of where their logic is leading them.
The situation seems to be better these days in "bioequivalence studies", where the purpose is showing that meds are similar. Nowadays, they insist on some efficacy in the study, and put a CI on similarity.
===start of Wikip section on bioequivalence (Oct 10, 2013) Regulatory definition Australia
In Australia, the Therapeutics Goods Administration (TGA) considers preparations to be bioequivalent if the 90% confidence intervals (90% CI) of the rate ratios, between the two preparations, of Cmax and AUC lie in the range 0.80-1.25. Tmax should also be similar between the products. ====end of quote; entries fairly similar for other countries.
I can mention that the FDA papers that come up by Google are difficult to read. When I looked into this 10 years ago, it seemed to me that the formal standards used by the FDA in the US in the 1980s were almost as silly as what you describe. I don't know whether they were applied so loosely, but it *looked* like someone could have declared bioequivalence by collecting data where the sample was too small (or the study too sloppy) to show an effect in either group.
>... For example, at http://www.r-bloggers.com/the-many-uses-of-q-q-plots, 95% confidence bands are chosen for a Q-Q plot, with H0 being that the data is normally distributed. To highlight why I'm confused by this, lets consider the extreme case of 99.99% confidence limits. That means the rejection region is only 0.001%, and the bands widen out. Almost any scattering of data will fit within the confidence limits. So if all my data points are within the confidence limits, it says very little. In contrast, if I chose 50% confidence limits, having all my data points within those limits is a harder test to meet. I know that hypothesis testing does not allow one to determine whether H0 is true, but having all data points within 50% confidence limits allows one to definitively rule out any evidence against H0, which is a lot more info than >what can be gleaned from 99.99% confidence limits. From browsing the web, weak evidence starts to buildup if p-values drop as low as 0.10.
As a very general rule, p-levels are not good indicators of effect size. Or, conversely, for lack-of-effect size. N is the huge confounder, plus, it depends on how sensitive your test is for finding an "effect."
In Repeated Measures testing, there is a Mauchley test for sphericity that is regularly reported by stats programs, which especially conditions the confidence that one should place in subcontrasts. Naively, one would think that "5%" warns of hazard. However, for most data with reasonable power for the main hypothesis, "50%" is a cutoff that gives the confidence that you want.
Or, going the opposite direction, a two-sample t-test with equal Ns is so robust that it hardly matters what a test on equal-variances tells you. Further, it is always a *bad* idea to use that test to condition whether you use the "test that assumes equal variances."
> >So if one wants to demonstrate a lack of evidence against H0, then why wouldn't one choose to use as low a confidence limit as possible? I described my confusion in the context of Q-Q plots, but the same conceptual questions dogs me for simple scalar hypothesis testing.
When you want to show similarity, you want to show a narrow CI on some relevant difference, for some meaningful measure... not merely an absence of "demonstrated" difference. Crappy measures can give you apparent similarities.