Search All of the Math Forum:

Views expressed in these public forums are not endorsed by NCTM or The Math Forum.

Notice: We are no longer accepting new posts, but the forums will continue to be readable.

Topic: Is a most-likely probability 'better' depending on the size of the
next-most-likely?

Replies: 4   Last Post: Feb 3, 2010 12:28 PM

 Messages: [ Previous | Next ]
Re: Is a most-likely probability 'better' depending on the size of
the next-most-likely?

Posted: Feb 2, 2010 10:32 AM

On 2 Feb, 12:59, "David Jones" <dajx...@ceh.ac.uk> wrote:
> Steve wrote:
> > Hi,
>
> > I'm working on an algorithm to guess the correct English word within
> > text in which some words have become illegible.
> > It boils down to creating a list of candidate words, along with their
> > probabilities, and choosing the most likely.
> > Alternating between training data, and new test data, I can establish
> > that the probability estimations are fairly accurate. (Though to be
> > useful, the algorithm needs to provide a shorter candidate list in the
> > first place!)

>
> > Suppose I have two cases:
> > A) There are 2 candidate words with probabilities 0.51 and 0.49.
> > B) There are 101 candidate words, one with P=0.51, and a hundred
> > others all with P = .0049.

>
> > One of the approaches the algorithm takes is based on the N recent
> > known words prior to the unknown word (its Ngram), so there are
> > inevitably situations when the Ngram contains words that have
> > themselves been corrected in a prior step. If this is the case, I need
> > to know how much I can rely on that previous result.
> > Is there any basis for believing that in case B) the result is more
> > trustworthy? After all, the choice with P=0.51 is more than 100 times
> > more likely than the next best word. But in case A) there's virtually
> > nothing to choose between them.
> > Rightly or wrongly, that's how I intuitively feel about the choices,
> > but then I remember... both 'best choices' will be wrong 49% of the
> > time, so it doesn't make any difference!

>
> > Is there a measure for this, or is it totally irrelevant?
> > ------------

>
> > Eventually the goal is to have a much higher confidence than 0.51 in a
> > single choice, but there will occasionally be situations with these
> > borderline results. In these cases I'll offer the user a drop-down
> > replacement list with all the choices and their probabilities, for
> > them to pick from.
> > way that they would be more confident making a choice in case B) than
> > A) .

>
> > Any thoughts?... Is this a bit of a Monty Hall problem?
>
> > Thanks
>
> > Steve
>
> Have you thought of involving a cost function? This would give a value/cost/utility to choosing word B, if word A is actually correct. Then some aspects of your problem would eventually become generalised to comparing a single alternative with its cost, with lots of small probabilities each having different costs. In such a case, you might prefer the second if a lot of the small probabilities are associated with small costs and only a few with high costs.
>
> Here "cost" might be used to distinguish similar words with similar meanings from similar words with different meanings.
>
> David Jones

Thanks David, I hadn't thought of that idea. There's a lot of
parameters linked to each candidate, such as meaning, part-of-speech,
useage-frequency, collocation-frequency, context likelihood etc. so I
could certainly shape some kind of cost for going against the grain of
these.

I'm still wondering if there's some simple heuristic involved with
cases like these though.

An example might be if I collected millions of usenet postings and
found significant amounts of these examples:

"Just my two * worth" and found * = cents 51% of the time and found
rupees, yen etc 4.9% of the time for 10 variations.

and

"I married my * in a church" with * = wife 51% and husband 49%.

Even if you were sure the probabilities were very accurate, the
'cents' example just seems a safer bet because each alternative is
quite unlikely.

Steve