
Re: Probabilities always >= 0 and <= 1?
Posted:
May 9, 2012 12:14 AM


FFMG writes:
> On Tuesday, 8 May 2012 11:35:35 UTC+2, FFMG wrote: > > Hi, > > > > I was looking at a a site, > > (http://bionicspirit.com/blog/2012/02/09/howtobuildnaivebayesclassifier.html), > > basically talking about a Naive Bayes Classifier. > > > > But in some cases the formula gives me probabilities greater than 1. > > How it is possible? > > > > // Total of 18 documents. > > // * 9 documents out of a total of 18 are spam messages > > // * 3 documents out of those 18 contain the word "naughty" > > // * 3 documents containing the word "naughty" have been marked as spam > > // * 3 documents out of the total contain the word "money" > > // * 3 emails out of those have been marked as spam > > > > P(spamnaughty,money) = P(moneyspam) * P(moneyspam) * P(spam) > >  > > P(naughty) * P(money) > > > > P(spamnaughty,money) = 3/9 * 3/9 * 9/18 = 2 > >  > > 3/18 * 3/18 > > > > But how can a probability be outside of 0 and 1? Must I always > > force the numbers to be between 0 and 1 and accept that in some > > cases they will fall outside the range? > > > > Many thanks for suggestions as to where I might have gone wrong. > > > > Regards, > > > > FFMG > > Thanks for all the replies, I guess I will force the documents > classification between 0 and 1, because in my case I will have 100 > of thousands of documents, (we have +200000 currently), and > hopefully it will not take more than 5000 'training' to get some > meaningful data classification. > > I just thought that even with my 18 documents I should still get a > probability between 0 and 1. > > My main task was to write unit tests, and if the correct result in > my test with 18 documents is a probability of '2' then I guess the > calculations are valid. > > Thanks for all inputs and suggestions.
Look again at Ray Vickson's post. I think he hit the nail on the head.
He used the law of total probablility to expand the denominator as
P("naughty", "money") = P("naughty", "money"  "spam" or "not spam") = P("naughty", "money"  "spam") + P("naughty", "money"  "not spam")
after which you can use the _same_ independence assumption in both the numerator and the denominator. That shouldn't lead to such a blatant contradiction.
Perhaps this is how Naive Bayes is always done. I haven't checked.

