The Math Forum

Search All of the Math Forum:

Views expressed in these public forums are not endorsed by NCTM or The Math Forum.

Math Forum » Discussions » sci.math.* » sci.math

Notice: We are no longer accepting new posts, but the forums will continue to be readable.

Topic: Probabilities always >= 0 and <= 1?
Replies: 20   Last Post: May 9, 2012 12:14 AM

Advanced Search

Back to Topic List Back to Topic List Jump to Tree View Jump to Tree View   Messages: [ Previous | Next ]
Jussi Piitulainen

Posts: 355
Registered: 12/12/04
Re: Probabilities always >= 0 and <= 1?
Posted: May 8, 2012 10:41 AM
  Click to see the message monospaced in plain text Plain Text   Click to reply to this topic Reply

FFMG writes:

> > You have made an incorrect independence assumption. As both
> > "naughty" and "money" are only present in "spam" documents, which
> > form half o the total number of documents, they are dependent
> > variables. But, you calculate p(e) as p(money) * p(naughty) which
> > is assuming that the variables are independent. Hence your
> > problem.

> Sorry, that's not an assumption, that's the way the problem
> definition goes, the words "naughty" and "money" are indeed only
> present in "spam".
> And they are independent variables, the presence of "naughty" is not
> dependent on "money", (and vice versa).
> The formula is P(C|F1...Fn) = P(C)P(F1|C)...P(Fn|C)
> -----------------
> P(F1)...P(Fn)
> So, given the problem in my original post, the result is not between
> 0 and 1.

Probability theory only gives you
P(C | F1...Fn) = P(C) P(F1...Fn | C) / P(F1...Fn).

Then come the independence assumptions which allow you to expand
P(F1...Fn | C) as P(F1 | C)...P(Fn | C) and P(F1...Fn) similarly.
These give Naive Bayes its first name.

If "naughty" and "money" were exactly independent and probabilities
exactly relative frequencies in your document collection, there should
be half a document that contains them both. Half a document does not
quite make sense, but there's worse: if "naughty" and "money" were
exactly independent given "spam", there should be _one_ document that
contains both "naughty" and "money" (and is classified as "spam").

Since we don't want to accept 1/2 = 1 and we think that relative
frequencies do have the formal properties of probabilities, we blame
the independence assumptions. I suppose they would be approximately
closer to the truth much of the time in a larger population.

Point your RSS reader here for a feed of the latest messages in this topic.

[Privacy Policy] [Terms of Use]

© The Math Forum at NCTM 1994-2018. All Rights Reserved.