Search All of the Math Forum:

Views expressed in these public forums are not endorsed by NCTM or The Math Forum.

Notice: We are no longer accepting new posts, but the forums will continue to be readable.

Topic: Monte Carlo Gof Chi-Square Test for continuous Distributions
Replies: 1   Last Post: Jul 25, 2013 3:18 AM

 Messages: [ Previous | Next ]
 Luis A. Afonso Posts: 4,758 From: LIsbon (Portugal) Registered: 2/16/05
Monte Carlo Gof Chi-Square Test for continuous Distributions
Posted: Jul 24, 2013 6:42 PM

Monte Carlo Gof Chi-Square Test for continuous Distributions

0 ___Preliminary
This K. Pearson´s test (Goodness of Fit) is pointed as historically the first one to achieve the goal H0: data X and F(x) are not incompatible?
The algorithm starts by forecasting, from the Distribution F(x), Model j (1>=j<=k), the sample items content, for each one of the k same amplitude intervals (bins) data is divided. In spite k be in principle arbitrary, statisticians had noted that the bins contents, Observed j, should not be less than 5 or 6 in order that the sample statistics:
Q = Sum (Observed j - Model j)^2/Model j,
do follow reasonably well the theoretical Chi - Squared Distribution when H0 is not inraisonable odd.

1 ___New approach
Let be n the sample under analysis size. And suppose that we set each interval the same content n/k, which contrary to the classic way is not conditioned to a minimum: it could be any value. By other words: the interval amplitudes are adjusted such that is constant, k/n, their probability to contain the items. An even more important feature of the proposed algorithm is that we do not depend on any Distribution to reach the decision: the p-value follows directly from Monte Carlo simulations.
Therefore we define the Ij= (aj, bj) intervals such that we have F(bj) - F(aj) = k/n, possibily a1= -infinity, bk= +infinity.

2___Application example
Two-hundred years ago, from the XVIII end to the beggining XIX siecle, Henry Cavendish (1731-1810) reported 23 measurements of the mean density of Earth, 5.488 g/cm3; today´s value is 5.513/gcm3 which differs from 0.45 % only.
We will search, by the Chi squared test, if the former data could follow a Normal Distribution.
Following the usual method we estimate the mean m= 5.488 and standard deviation s, then reduction data by Z(i)=(X(i)-m)/s. Based on k=5 the intervals, 0.2 probability each, are:
_____(-infinity, -0.842)_____4 values_____bin 1
_____(-0.842, -0.253)_______ 7_____________ 2
_____(-0.253, 0.253)_______ 3_____________ 3
_____( 0.253. 0.842 )_______ 4_____________ 4
_____( 0.842, +infinity)_____5_____________ 5
Q0= 2.000 and prob(Q>Q0) = 0.632= p-value with 5-2-1=2 Chi-square 2 degrees of freedom from NET software.
A more correct p-value can be obtained by Monte Carlo simulations as follows (see Program CAV-DISH). From the statement c=int(RND*23/5)+1 we obtain with the same probability one of the 5 values c= 1, 2, 3, 4, 5 indicating the bin the value will fall. Performing 23 times we got a *vector* n1 to n5 which provide the simulated Q relative to the sample. This is the current sample which is repeated M times: the fraction of Q´s larger than the source sample Q0=2.000 do provide the exact p-value. From 4 repetitions (M=1 million each) we find 0.644, 0.644, 0.643, 0.644.
Conclusion: Both methods leads to not disapprove that density data is Normal, at the usual alpha = 0.05. The main feature attached to the Monte Carlo one here presented is that, surely, Chi-Square Distribution approximate methodology can be easily avoided which is particularly useful when alpha and p-value do tend to be equal.

Luis A. Afonso

REM "Cav-dish"
CLS
PRINT " <CAV-DISH> Bhattacharyya , Johnson pg. 283 "
DEFDBL A-Z
PRINT " k=5 ";
PRINT " -0.842 , -0.253 , 0.253 , 0.842 "
kk(1) = -.842: kk(2) = -.253: kk(3) = -kk(2): kk(4) = -kk(1)
DIM X(23), z(23)
DATA .36,.62,.27,.46
DATA .29,.39,.39,.30
DATA .58,.44,.42,.75
DATA .65,.34,.47,.68
DATA .57,.79,.63,.85
DATA .53,.10,.34
FOR i = 1 TO 23: READ X: X(i) = X + 5:
mu = mu + X(i) / 23: sq = sq + X(i) * X(i): NEXT i
ssd = sq - 23 * mu * mu: var = ssd / (23 - 1)
LOCATE 5, 50: PRINT USING " mean= #.### "; mu
FOR i = 1 TO 23: z(i) = (X(i) - mu) / SQR(var)
IF z(i) < kk(1) THEN obsv(1) = obsv(1) + 1
IF z(i) >= kk(1) AND z(i) < kk(2) THEN obsv(2) = obsv(2) + 1
IF z(i) >= kk(2) AND z(i) < kk(3) THEN obsv(3) = obsv(3) + 1
IF z(i) >= kk(3) AND z(i) < kk(4) THEN obsv(4) = obsv(4) + 1
IF z(i) >= kk(4) THEN obsv(5) = obsv(5) + 1
NEXT i: iexp = 23 / 5
FOR ki = 1 TO 5
PRINT USING "##########"; obsv(ki); : NEXT ki
FOR kx = 1 TO 5: q0 = q0 + (obsv(kx) - iexp) ^ 2 / iexp
NEXT kx
COLOR 14: LOCATE 10, 50
PRINT USING " Q0 = ##.#### "; q0
COLOR 7
PRINT : PRINT
RANDOMIZE TIMER
all = 1000000
FOR rpt = 1 TO all
LOCATE 9, 50: PRINT USING "########"; all - rpt
FOR u = 1 TO 5: a(u) = 0: NEXT u: tot = 0
FOR t = 1 TO 23
c = INT(23 / 5 * RND) + 1
a(c) = a(c) + 1
NEXT t
FOR v = 1 TO 5: tot = tot + (a(c) - iexp) ^ 2 / iexp
NEXT v
IF tot >= q0 THEN outt = outt + 1
NEXT rpt
LOCATE 11, 50: COLOR 3
PRINT USING "alpha= .### "; outt / all
END

Date Subject Author
7/24/13 Luis A. Afonso
7/25/13 Luis A. Afonso