
Is a mostlikely probability 'better' depending on the size of the nextmostlikely?
Posted:
Feb 2, 2010 7:17 AM


Hi,
I'm working on an algorithm to guess the correct English word within text in which some words have become illegible. It boils down to creating a list of candidate words, along with their probabilities, and choosing the most likely. Alternating between training data, and new test data, I can establish that the probability estimations are fairly accurate. (Though to be useful, the algorithm needs to provide a shorter candidate list in the first place!)
Suppose I have two cases: A) There are 2 candidate words with probabilities 0.51 and 0.49. B) There are 101 candidate words, one with P=0.51, and a hundred others all with P = .0049.
One of the approaches the algorithm takes is based on the N recent known words prior to the unknown word (its Ngram), so there are inevitably situations when the Ngram contains words that have themselves been corrected in a prior step. If this is the case, I need to know how much I can rely on that previous result. Is there any basis for believing that in case B) the result is more trustworthy? After all, the choice with P=0.51 is more than 100 times more likely than the next best word. But in case A) there's virtually nothing to choose between them. Rightly or wrongly, that's how I intuitively feel about the choices, but then I remember... both 'best choices' will be wrong 49% of the time, so it doesn't make any difference!
Is there a measure for this, or is it totally irrelevant? 
Eventually the goal is to have a much higher confidence than 0.51 in a single choice, but there will occasionally be situations with these borderline results. In these cases I'll offer the user a dropdown replacement list with all the choices and their probabilities, for them to pick from. Talking to nonmaths friends about this, most of them feel the same way that they would be more confident making a choice in case B) than A) .
Any thoughts?... Is this a bit of a Monty Hall problem?
Thanks
Steve

