The problem of document classification has almost as many methods of solution as people who have worked in the field. One common approach is to use the ``Naive Bayes'' classifier, which involves a little basic probability and very little guidance from a human director. In order to better understand the Naive Bayes classifier, I decided to write an implementation of it using Perl and the mysql database, tools with which I am very familiar.
The specific classification problem I am attacking is the categorization of questions emailed to the Math Forum's ``Ask Dr. Math'' service (http://www.mathforum.com/dr.math/), which I founded in 1994 and for which I have written much of the infrastructural software. The Math Forum hosts a categorized archive of several thousand previously answered questions, which provides an ample training data set for the classifier. It should be noted, however, that the training data is not completely representative of the test data, because documents in the archive have been selected for their quality and clarity, and they have been edited for grammar. Once this flaw has been noted, it will generally be ignored on the assumption that these differences do not significantly affect the ability to classify documents.
Before describing the classifier itself, it is necessary to cover the basic probability behind its operation.
Bayes' Theorem is a way of inverting a conditional probability. It states:
P(y|x) P(x) P(x|y) = ------------- P(y)
See also http://mathforum.org/dr.math/problems/battisfore.03.22.99.html for a simple but complete example of Bayes' Theorem.
In this case, we want to know the probability of a given category given a certain string of words in a document, so we have:
P(words | cat) P(cat) P(cat | words) = -------------------- P(words)
We have applied Bayes' Theorem because
P(cat | words) is a
difficult quantity to compute directly, but
P(words | cat) and
P(cat) are accessible (see below).
The greater the expression above, the greater the probability that the given document belongs to the given category. So we want to find the maximum value. We write this as
P(words | cat) P(cat) Best category = ArgMax ----------------------- cat in cats P(words)
P(words) doesn't change over the range of categories, we
can get rid of it. That's good, because we didn't want to have to compute
these values anyway. So our new formula is:
Best category = ArgMax P(words | cat) P(cat) cat in cats
Finally, we note that if w1, w2, ... wn are the words in the document, then this expression is equivalent to:
Best category = ArgMax P(w1|cat)*P(w2|cat)*...*P(wn|cat)*P(cat) cat in cats
That's the formula I use in my document categorization code. The last step is the only non-rigorous one in the derivation, and this is the ``naive'' part of the Naive Bayes technique. It assumes that the probability of each word appearing in a document is unaffected by the presence or absence of each other word in the document. We assume this even though we know this isn't true: for example, the word ``iodized'' is far more likely to appear in a document that contains the word ``salt'' than it is to appear in a document that contains the word ``subroutine''. Luckily, as it turns out, making this assumption even when it isn't true has little effect on our results, as the following paper explains: http://www.cs.washington.edu/homes/pedrod/mlj97.ps.gz
I have written an abstract object-oriented framework that I hope can be
easily adapted to other methods of document classification, and implemented
one subclass of this framework that does Naive Bayes classification. The
framework is called
AI::Categorize, and the Bayesian subclass is called
AI::Categorize::NaiveBayes allows the user to feed it the text of several documents (the training
set), which it will parse and add to the word frequency database (excluding
any words from a list of user-specified ``stopwords'', i.e. common words
like ``and'' or ``the'' that presumably carry no category information).
After all documents are added, the user directs the system to calculate the
probabilities that the system will use in classifying documents, i.e. the
probabilities seen in the final equation of the above section. These
probabilities are calculated directly from word frequencies - for example,
if there are 14,367 total words in a certain category ``geometry'' and 221
of them are ``vertex'', then
P(vertex|geometry) is equal to 221/14,367. Likewise, if there are 253,602 total words in the
entire archive and 14,367 of them are in the ``geometry'' category, then
P(geometry) is equal to 14,367/253,602. We calculate all such probabilities and store
them in the database for later use.
Once a sufficient number of training documents have been fed to the
database and the needed probabilities have been calculated, we can start
AI::Categorize::NaiveBayes to categorize new documents that it hasn't seen before. It returns to us an
ordered list of the most probable categories for that document.
A few sample documents, chosen at random from the pool of incoming Ask Dr.
Math questions, give an indication of how well the system works. For each
document, I give the top ten categories suggested by
AI::Categorize::NaiveBayes. The human-chosen best category is marked with a '*', reasonable
categories are marked with a '+', and unreasonable categories are marked
Document 1: Document 3: * golden.elem * golden.elem - exponent.college - exponent.college - symmetry.high + golden.high + golden.high + symmetry.high + grab.elem - logarithm.middle - squareroot.elem - ratio.middle + project.elem + history.middle + project.high - squareroot.elem + defs.middle + project.high - statistics.high - negative.high - graph.eq.middle - graph.eq.middle Document 2: Document 4: - exponent.college * probability.high - golden.elem + prob.stats.middle + geometry.middle + probability.college - graph.eq.middle - permutations.high + geometry.high + statistics.high + puzzle.middle - discrete.math.high + puzzle.high - negative.high - geometry.elem - exponent.college - advanced.middle - project.elem - symmetry.high - infinity.elem + discrete.math.high - squareroot.elem (Best: numberth.high) Document 5: - exponent.college + advanced.middle - puzzle.elem - subtraction.elem + grab.elem - addition.elem - puzzle.middle * permutations.high - logarithm.middle - about.numbers.middle - squareroot.elem
Considering that there are 94 categories in the archive and generally at most 4 or 5 categories that any one question might fit into, these results seem reasonable. It's worth noting that the ``exponent.college'' category turns up on every list, but isn't appropriate for any document. This seems due to the fact that this archive category contains only four documents, and all four seem miscategorized. In fact, I think this strange category shouldn't exist at all.
Several extensions and improvements are as yet unexplored. First, the stoplist currently contains only ``the'', ``a'', ``of'', ``to'', ``is'', ``that'', ``you'', ``for'', and ``and''. This could be expanded, which would improve performance.
Second, lemmatization could be applied to words so that words with the same root would be considered as the same word. This might improve the results while also improving running times, because fewer words would be present in the database.
Finally, by throwing away the
P(words) term in our calculation, we no longer have a probability, just a relative
score. It is therefore impossible to divine the certainty of the
categorization process. If we kept this term in, we could get better
information at the cost of more computation (although admittedly not very
much more). We would also largely eliminate the risk of floating-point
underflow, as our current scores are often numbers like 2.513E-188, which
have a lot of zeros after the decimal point until they get to significant
AI::Categorize code can be found at http://mathforum.org/~ken/bayes/AI.tar
Ken Williams, firstname.lastname@example.org