Search All of the Math Forum:
Views expressed in these public forums are not endorsed by
NCTM or The Math Forum.


Luis A. Afonso
Posts:
4,758
From:
LIsbon (Portugal)
Registered:
2/16/05


Monte Carlo Gof ChiSquare Test for continuous Distributions
Posted:
Jul 24, 2013 6:42 PM


Monte Carlo Gof ChiSquare Test for continuous Distributions
0 ___Preliminary This K. Pearson´s test (Goodness of Fit) is pointed as historically the first one to achieve the goal H0: data X and F(x) are not incompatible? The algorithm starts by forecasting, from the Distribution F(x), Model j (1>=j<=k), the sample items content, for each one of the k same amplitude intervals (bins) data is divided. In spite k be in principle arbitrary, statisticians had noted that the bins contents, Observed j, should not be less than 5 or 6 in order that the sample statistics: Q = Sum (Observed j  Model j)^2/Model j, do follow reasonably well the theoretical Chi  Squared Distribution when H0 is not inraisonable odd.
1 ___New approach Let be n the sample under analysis size. And suppose that we set each interval the same content n/k, which contrary to the classic way is not conditioned to a minimum: it could be any value. By other words: the interval amplitudes are adjusted such that is constant, k/n, their probability to contain the items. An even more important feature of the proposed algorithm is that we do not depend on any Distribution to reach the decision: the pvalue follows directly from Monte Carlo simulations. Therefore we define the Ij= (aj, bj) intervals such that we have F(bj)  F(aj) = k/n, possibily a1= infinity, bk= +infinity.
2___Application example Twohundred years ago, from the XVIII end to the beggining XIX siecle, Henry Cavendish (17311810) reported 23 measurements of the mean density of Earth, 5.488 g/cm3; today´s value is 5.513/gcm3 which differs from 0.45 % only. We will search, by the Chi squared test, if the former data could follow a Normal Distribution. Following the usual method we estimate the mean m= 5.488 and standard deviation s, then reduction data by Z(i)=(X(i)m)/s. Based on k=5 the intervals, 0.2 probability each, are: _____(infinity, 0.842)_____4 values_____bin 1 _____(0.842, 0.253)_______ 7_____________ 2 _____(0.253, 0.253)_______ 3_____________ 3 _____( 0.253. 0.842 )_______ 4_____________ 4 _____( 0.842, +infinity)_____5_____________ 5 Q0= 2.000 and prob(Q>Q0) = 0.632= pvalue with 521=2 Chisquare 2 degrees of freedom from NET software. A more correct pvalue can be obtained by Monte Carlo simulations as follows (see Program CAVDISH). From the statement c=int(RND*23/5)+1 we obtain with the same probability one of the 5 values c= 1, 2, 3, 4, 5 indicating the bin the value will fall. Performing 23 times we got a *vector* n1 to n5 which provide the simulated Q relative to the sample. This is the current sample which is repeated M times: the fraction of Q´s larger than the source sample Q0=2.000 do provide the exact pvalue. From 4 repetitions (M=1 million each) we find 0.644, 0.644, 0.643, 0.644. Conclusion: Both methods leads to not disapprove that density data is Normal, at the usual alpha = 0.05. The main feature attached to the Monte Carlo one here presented is that, surely, ChiSquare Distribution approximate methodology can be easily avoided which is particularly useful when alpha and pvalue do tend to be equal.
Luis A. Afonso
REM "Cavdish" CLS PRINT " <CAVDISH> Bhattacharyya , Johnson pg. 283 " DEFDBL AZ PRINT " k=5 "; PRINT " 0.842 , 0.253 , 0.253 , 0.842 " kk(1) = .842: kk(2) = .253: kk(3) = kk(2): kk(4) = kk(1) DIM X(23), z(23) DATA .36,.62,.27,.46 DATA .29,.39,.39,.30 DATA .58,.44,.42,.75 DATA .65,.34,.47,.68 DATA .57,.79,.63,.85 DATA .53,.10,.34 FOR i = 1 TO 23: READ X: X(i) = X + 5: mu = mu + X(i) / 23: sq = sq + X(i) * X(i): NEXT i ssd = sq  23 * mu * mu: var = ssd / (23  1) LOCATE 5, 50: PRINT USING " mean= #.### "; mu FOR i = 1 TO 23: z(i) = (X(i)  mu) / SQR(var) IF z(i) < kk(1) THEN obsv(1) = obsv(1) + 1 IF z(i) >= kk(1) AND z(i) < kk(2) THEN obsv(2) = obsv(2) + 1 IF z(i) >= kk(2) AND z(i) < kk(3) THEN obsv(3) = obsv(3) + 1 IF z(i) >= kk(3) AND z(i) < kk(4) THEN obsv(4) = obsv(4) + 1 IF z(i) >= kk(4) THEN obsv(5) = obsv(5) + 1 NEXT i: iexp = 23 / 5 FOR ki = 1 TO 5 PRINT USING "##########"; obsv(ki); : NEXT ki FOR kx = 1 TO 5: q0 = q0 + (obsv(kx)  iexp) ^ 2 / iexp NEXT kx COLOR 14: LOCATE 10, 50 PRINT USING " Q0 = ##.#### "; q0 COLOR 7 PRINT : PRINT RANDOMIZE TIMER all = 1000000 FOR rpt = 1 TO all LOCATE 9, 50: PRINT USING "########"; all  rpt FOR u = 1 TO 5: a(u) = 0: NEXT u: tot = 0 FOR t = 1 TO 23 c = INT(23 / 5 * RND) + 1 a(c) = a(c) + 1 NEXT t FOR v = 1 TO 5: tot = tot + (a(c)  iexp) ^ 2 / iexp NEXT v IF tot >= q0 THEN outt = outt + 1 NEXT rpt LOCATE 11, 50: COLOR 3 PRINT USING "alpha= .### "; outt / all END



