
Re: Probabilities always >= 0 and <= 1?
Posted:
May 8, 2012 10:41 AM


FFMG writes:
> > You have made an incorrect independence assumption. As both > > "naughty" and "money" are only present in "spam" documents, which > > form half o the total number of documents, they are dependent > > variables. But, you calculate p(e) as p(money) * p(naughty) which > > is assuming that the variables are independent. Hence your > > problem. > > Sorry, that's not an assumption, that's the way the problem > definition goes, the words "naughty" and "money" are indeed only > present in "spam". > > And they are independent variables, the presence of "naughty" is not > dependent on "money", (and vice versa). > > The formula is P(CF1...Fn) = P(C)P(F1C)...P(FnC) >  > P(F1)...P(Fn) > > So, given the problem in my original post, the result is not between > 0 and 1.
Probability theory only gives you P(C  F1...Fn) = P(C) P(F1...Fn  C) / P(F1...Fn).
Then come the independence assumptions which allow you to expand P(F1...Fn  C) as P(F1  C)...P(Fn  C) and P(F1...Fn) similarly. These give Naive Bayes its first name.
If "naughty" and "money" were exactly independent and probabilities exactly relative frequencies in your document collection, there should be half a document that contains them both. Half a document does not quite make sense, but there's worse: if "naughty" and "money" were exactly independent given "spam", there should be _one_ document that contains both "naughty" and "money" (and is classified as "spam").
Since we don't want to accept 1/2 = 1 and we think that relative frequencies do have the formal properties of probabilities, we blame the independence assumptions. I suppose they would be approximately closer to the truth much of the time in a larger population.

