OK, first I thought it was because the number of runs was too low. This could have been a problem because on the tail of the distributions n_expected(bin i) a fraction when it divides the squared difference might inflate the values.
No such luck.
A statistician friend of mine says when the number of simulation runs is very high (>10^6), the chi square test is so sensitive that it can be thrown off by deviations from ideality.
All of which is totally counter-intuitive to my notion that more sampling should yield more accurate/better results.