I'm working on an algorithm to guess the correct English word within text in which some words have become illegible. It boils down to creating a list of candidate words, along with their probabilities, and choosing the most likely. Alternating between training data, and new test data, I can establish that the probability estimations are fairly accurate. (Though to be useful, the algorithm needs to provide a shorter candidate list in the first place!)
Suppose I have two cases: A) There are 2 candidate words with probabilities 0.51 and 0.49. B) There are 101 candidate words, one with P=0.51, and a hundred others all with P = .0049.
One of the approaches the algorithm takes is based on the N recent known words prior to the unknown word (its Ngram), so there are inevitably situations when the Ngram contains words that have themselves been corrected in a prior step. If this is the case, I need to know how much I can rely on that previous result. Is there any basis for believing that in case B) the result is more trustworthy? After all, the choice with P=0.51 is more than 100 times more likely than the next best word. But in case A) there's virtually nothing to choose between them. Rightly or wrongly, that's how I intuitively feel about the choices, but then I remember... both 'best choices' will be wrong 49% of the time, so it doesn't make any difference!
Is there a measure for this, or is it totally irrelevant? ------------
Eventually the goal is to have a much higher confidence than 0.51 in a single choice, but there will occasionally be situations with these borderline results. In these cases I'll offer the user a drop-down replacement list with all the choices and their probabilities, for them to pick from. Talking to non-maths friends about this, most of them feel the same way that they would be more confident making a choice in case B) than A) .
Any thoughts?... Is this a bit of a Monty Hall problem?