> > You have made an incorrect independence assumption. As both > > "naughty" and "money" are only present in "spam" documents, which > > form half o the total number of documents, they are dependent > > variables. But, you calculate p(e) as p(money) * p(naughty) which > > is assuming that the variables are independent. Hence your > > problem. > > Sorry, that's not an assumption, that's the way the problem > definition goes, the words "naughty" and "money" are indeed only > present in "spam". > > And they are independent variables, the presence of "naughty" is not > dependent on "money", (and vice versa). > > The formula is P(C|F1...Fn) = P(C)P(F1|C)...P(Fn|C) > ----------------- > P(F1)...P(Fn) > > So, given the problem in my original post, the result is not between > 0 and 1.
Probability theory only gives you P(C | F1...Fn) = P(C) P(F1...Fn | C) / P(F1...Fn).
Then come the independence assumptions which allow you to expand P(F1...Fn | C) as P(F1 | C)...P(Fn | C) and P(F1...Fn) similarly. These give Naive Bayes its first name.
If "naughty" and "money" were exactly independent and probabilities exactly relative frequencies in your document collection, there should be half a document that contains them both. Half a document does not quite make sense, but there's worse: if "naughty" and "money" were exactly independent given "spam", there should be _one_ document that contains both "naughty" and "money" (and is classified as "spam").
Since we don't want to accept 1/2 = 1 and we think that relative frequencies do have the formal properties of probabilities, we blame the independence assumptions. I suppose they would be approximately closer to the truth much of the time in a larger population.