On 2 Feb, 12:59, "David Jones" <dajx...@ceh.ac.uk> wrote: > Steve wrote: > > Hi, > > > I'm working on an algorithm to guess the correct English word within > > text in which some words have become illegible. > > It boils down to creating a list of candidate words, along with their > > probabilities, and choosing the most likely. > > Alternating between training data, and new test data, I can establish > > that the probability estimations are fairly accurate. (Though to be > > useful, the algorithm needs to provide a shorter candidate list in the > > first place!) > > > Suppose I have two cases: > > A) There are 2 candidate words with probabilities 0.51 and 0.49. > > B) There are 101 candidate words, one with P=0.51, and a hundred > > others all with P = .0049. > > > One of the approaches the algorithm takes is based on the N recent > > known words prior to the unknown word (its Ngram), so there are > > inevitably situations when the Ngram contains words that have > > themselves been corrected in a prior step. If this is the case, I need > > to know how much I can rely on that previous result. > > Is there any basis for believing that in case B) the result is more > > trustworthy? After all, the choice with P=0.51 is more than 100 times > > more likely than the next best word. But in case A) there's virtually > > nothing to choose between them. > > Rightly or wrongly, that's how I intuitively feel about the choices, > > but then I remember... both 'best choices' will be wrong 49% of the > > time, so it doesn't make any difference! > > > Is there a measure for this, or is it totally irrelevant? > > ------------ > > > Eventually the goal is to have a much higher confidence than 0.51 in a > > single choice, but there will occasionally be situations with these > > borderline results. In these cases I'll offer the user a drop-down > > replacement list with all the choices and their probabilities, for > > them to pick from. > > Talking to non-maths friends about this, most of them feel the same > > way that they would be more confident making a choice in case B) than > > A) . > > > Any thoughts?... Is this a bit of a Monty Hall problem? > > > Thanks > > > Steve > > Have you thought of involving a cost function? This would give a value/cost/utility to choosing word B, if word A is actually correct. Then some aspects of your problem would eventually become generalised to comparing a single alternative with its cost, with lots of small probabilities each having different costs. In such a case, you might prefer the second if a lot of the small probabilities are associated with small costs and only a few with high costs. > > Here "cost" might be used to distinguish similar words with similar meanings from similar words with different meanings. > > David Jones
Thanks David, I hadn't thought of that idea. There's a lot of parameters linked to each candidate, such as meaning, part-of-speech, useage-frequency, collocation-frequency, context likelihood etc. so I could certainly shape some kind of cost for going against the grain of these.
I'm still wondering if there's some simple heuristic involved with cases like these though.
An example might be if I collected millions of usenet postings and found significant amounts of these examples:
"Just my two * worth" and found * = cents 51% of the time and found rupees, yen etc 4.9% of the time for 10 variations.
"I married my * in a church" with * = wife 51% and husband 49%.
Even if you were sure the probabilities were very accurate, the 'cents' example just seems a safer bet because each alternative is quite unlikely.