On 3 Feb, 10:51, "David Jones" <dajx...@ceh.ac.uk> wrote: > Steve wrote: > > On 2 Feb, 12:59, "David Jones" <dajx...@ceh.ac.uk> wrote: > >> Steve wrote: > >>> Hi, > > >>> I'm working on an algorithm to guess the correct English word within > >>> text in which some words have become illegible. > >>> It boils down to creating a list of candidate words, along with > >>> their probabilities, and choosing the most likely. > >>> Alternating between training data, and new test data, I can > >>> establish that the probability estimations are fairly accurate. > >>> (Though to be useful, the algorithm needs to provide a shorter > >>> candidate list in the first place!) > > >>> Suppose I have two cases: > >>> A) There are 2 candidate words with probabilities 0.51 and 0.49. > >>> B) There are 101 candidate words, one with P=0.51, and a hundred > >>> others all with P = .0049. > > >>> One of the approaches the algorithm takes is based on the N recent > >>> known words prior to the unknown word (its Ngram), so there are > >>> inevitably situations when the Ngram contains words that have > >>> themselves been corrected in a prior step. If this is the case, I > >>> need to know how much I can rely on that previous result. > >>> Is there any basis for believing that in case B) the result is more > >>> trustworthy? After all, the choice with P=0.51 is more than 100 > >>> times more likely than the next best word. But in case A) there's > >>> virtually nothing to choose between them. > >>> Rightly or wrongly, that's how I intuitively feel about the choices, > >>> but then I remember... both 'best choices' will be wrong 49% of the > >>> time, so it doesn't make any difference! > > >>> Is there a measure for this, or is it totally irrelevant? > >>> ------------ > > >>> Eventually the goal is to have a much higher confidence than 0.51 > >>> in a single choice, but there will occasionally be situations with > >>> these borderline results. In these cases I'll offer the user a > >>> drop-down replacement list with all the choices and their > >>> probabilities, for them to pick from. > >>> Talking to non-maths friends about this, most of them feel the same > >>> way that they would be more confident making a choice in case B) > >>> than A) . > > >>> Any thoughts?... Is this a bit of a Monty Hall problem? > > >>> Thanks > > >>> Steve > > >> Have you thought of involving a cost function? This would give a > >> value/cost/utility to choosing word B, if word A is actually > >> correct. Then some aspects of your problem would eventually become > >> generalised to comparing a single alternative with its cost, with > >> lots of small probabilities each having different costs. In such a > >> case, you might prefer the second if a lot of the small > >> probabilities are associated with small costs and only a few with > >> high costs. > > >> Here "cost" might be used to distinguish similar words with similar > >> meanings from similar words with different meanings. > > >> David Jones > > > Thanks David, I hadn't thought of that idea. There's a lot of > > parameters linked to each candidate, such as meaning, part-of-speech, > > useage-frequency, collocation-frequency, context likelihood etc. so I > > could certainly shape some kind of cost for going against the grain of > > these. > > > I'm still wondering if there's some simple heuristic involved with > > cases like these though. > > > An example might be if I collected millions of usenet postings and > > found significant amounts of these examples: > > > "Just my two * worth" and found * = cents 51% of the time and found > > rupees, yen etc 4.9% of the time for 10 variations. > > > and > > > "I married my * in a church" with * = wife 51% and husband 49%. > > > Even if you were sure the probabilities were very accurate, the > > 'cents' example just seems a safer bet because each alternative is > > quite unlikely. > > > Steve > > The cost approach has the potential that, with an extremely large amount of work, you could do a thorough application of decision theory to tell you what to choose in any given case. But it can also help to think about the problem. One of the essential parts are probabilities like that of "being wrong if I choose this one", as this would weight the cost of the choice. If you think about these, rather than the probability that "this one is right", then it would help to justify your feeling about the intrepretation to be made when you have lots of small probabilities ...these would convert to lots of instances where the probability of being wrong is high. > > David Jones
It's definitely given me a new perspective on weighing these kind of choices. There's no shortage of test data to try the cost approach and see how it performs compared to simple probability alone.