Search All of the Math Forum:

Views expressed in these public forums are not endorsed by NCTM or The Math Forum.

Notice: We are no longer accepting new posts, but the forums will continue to be readable.

Topic: Is a most-likely probability 'better' depending on the size of the
next-most-likely?

Replies: 4   Last Post: Feb 3, 2010 12:28 PM

 Messages: [ Previous | Next ]
Is a most-likely probability 'better' depending on the size of the
next-most-likely?

Posted: Feb 2, 2010 7:17 AM

Hi,

I'm working on an algorithm to guess the correct English word within
text in which some words have become illegible.
It boils down to creating a list of candidate words, along with their
probabilities, and choosing the most likely.
Alternating between training data, and new test data, I can establish
that the probability estimations are fairly accurate. (Though to be
useful, the algorithm needs to provide a shorter candidate list in the
first place!)

Suppose I have two cases:
A) There are 2 candidate words with probabilities 0.51 and 0.49.
B) There are 101 candidate words, one with P=0.51, and a hundred
others all with P = .0049.

One of the approaches the algorithm takes is based on the N recent
known words prior to the unknown word (its Ngram), so there are
inevitably situations when the Ngram contains words that have
themselves been corrected in a prior step. If this is the case, I need
to know how much I can rely on that previous result.
Is there any basis for believing that in case B) the result is more
trustworthy? After all, the choice with P=0.51 is more than 100 times
more likely than the next best word. But in case A) there's virtually
nothing to choose between them.
Rightly or wrongly, that's how I intuitively feel about the choices,
but then I remember... both 'best choices' will be wrong 49% of the
time, so it doesn't make any difference!

Is there a measure for this, or is it totally irrelevant?
------------

Eventually the goal is to have a much higher confidence than 0.51 in a
single choice, but there will occasionally be situations with these
borderline results. In these cases I'll offer the user a drop-down
replacement list with all the choices and their probabilities, for
them to pick from.