Automatic Document Classification With Perl

The problem of document classification has almost as many methods of solution as people who have worked in the field. One common approach is to use the ``Naive Bayes'' classifier, which involves a little basic probability and very little guidance from a human director. In order to better understand the Naive Bayes classifier, I decided to write an implementation of it using Perl and the mysql database, tools with which I am very familiar.

The specific classification problem I am attacking is the categorization of questions emailed to the Math Forum's ``Ask Dr. Math'' service (http://www.mathforum.com/dr.math/), which I founded in 1994 and for which I have written much of the infrastructural software. The Math Forum hosts a categorized archive of several thousand previously answered questions, which provides an ample training data set for the classifier. It should be noted, however, that the training data is not completely representative of the test data, because documents in the archive have been selected for their quality and clarity, and they have been edited for grammar. Once this flaw has been noted, it will generally be ignored on the assumption that these differences do not significantly affect the ability to classify documents.

Before describing the classifier itself, it is necessary to cover the basic probability behind its operation.


A Little Naive Bayesian Theory

Bayes' Theorem is a way of inverting a conditional probability. It states:

            P(y|x) P(x)
  P(x|y) = -------------
               P(y)

See also http://mathforum.org/dr.math/problems/battisfore.03.22.99.html for a simple but complete example of Bayes' Theorem.


Derivation

In this case, we want to know the probability of a given category given a certain string of words in a document, so we have:

                    P(words | cat) P(cat)
  P(cat | words) = --------------------
                           P(words)

We have applied Bayes' Theorem because P(cat | words) is a difficult quantity to compute directly, but P(words | cat) and P(cat) are accessible (see below).

The greater the expression above, the greater the probability that the given document belongs to the given category. So we want to find the maximum value. We write this as

                                 P(words | cat) P(cat)
  Best category =   ArgMax      -----------------------
                   cat in cats          P(words)

Since P(words) doesn't change over the range of categories, we can get rid of it. That's good, because we didn't want to have to compute these values anyway. So our new formula is:

  Best category =   ArgMax      P(words | cat) P(cat)
                   cat in cats

Finally, we note that if w1, w2, ... wn are the words in the document, then this expression is equivalent to:

  Best category =   ArgMax      P(w1|cat)*P(w2|cat)*...*P(wn|cat)*P(cat)
                   cat in cats

That's the formula I use in my document categorization code. The last step is the only non-rigorous one in the derivation, and this is the ``naive'' part of the Naive Bayes technique. It assumes that the probability of each word appearing in a document is unaffected by the presence or absence of each other word in the document. We assume this even though we know this isn't true: for example, the word ``iodized'' is far more likely to appear in a document that contains the word ``salt'' than it is to appear in a document that contains the word ``subroutine''. Luckily, as it turns out, making this assumption even when it isn't true has little effect on our results, as the following paper explains: http://www.cs.washington.edu/homes/pedrod/mlj97.ps.gz


The AI::Categorize Perl Module

I have written an abstract object-oriented framework that I hope can be easily adapted to other methods of document classification, and implemented one subclass of this framework that does Naive Bayes classification. The framework is called AI::Categorize, and the Bayesian subclass is called AI::Categorize::NaiveBayes.

AI::Categorize::NaiveBayes allows the user to feed it the text of several documents (the training set), which it will parse and add to the word frequency database (excluding any words from a list of user-specified ``stopwords'', i.e. common words like ``and'' or ``the'' that presumably carry no category information). After all documents are added, the user directs the system to calculate the probabilities that the system will use in classifying documents, i.e. the probabilities seen in the final equation of the above section. These probabilities are calculated directly from word frequencies - for example, if there are 14,367 total words in a certain category ``geometry'' and 221 of them are ``vertex'', then P(vertex|geometry) is equal to 221/14,367. Likewise, if there are 253,602 total words in the entire archive and 14,367 of them are in the ``geometry'' category, then P(geometry) is equal to 14,367/253,602. We calculate all such probabilities and store them in the database for later use.

Once a sufficient number of training documents have been fed to the database and the needed probabilities have been calculated, we can start asking AI::Categorize::NaiveBayes to categorize new documents that it hasn't seen before. It returns to us an ordered list of the most probable categories for that document.


Results

A few sample documents, chosen at random from the pool of incoming Ask Dr. Math questions, give an indication of how well the system works. For each document, I give the top ten categories suggested by AI::Categorize::NaiveBayes. The human-chosen best category is marked with a '*', reasonable categories are marked with a '+', and unreasonable categories are marked with '-'.

        Document 1:             Document 3:
         * golden.elem           * golden.elem
         - exponent.college      - exponent.college
         - symmetry.high         + golden.high
         + golden.high           + symmetry.high
         + grab.elem             - logarithm.middle
         - squareroot.elem       - ratio.middle
         + project.elem          + history.middle
         + project.high          - squareroot.elem
         + defs.middle           + project.high
         - statistics.high       - negative.high
         - graph.eq.middle       - graph.eq.middle
                               
        Document 2:             Document 4:
         - exponent.college      * probability.high
         - golden.elem           + prob.stats.middle
         + geometry.middle       + probability.college
         - graph.eq.middle       - permutations.high
         + geometry.high         + statistics.high
         + puzzle.middle         - discrete.math.high
         + puzzle.high           - negative.high
         - geometry.elem         - exponent.college
         - advanced.middle       - project.elem
         - symmetry.high         - infinity.elem
         + discrete.math.high    - squareroot.elem
       (Best: numberth.high)
                               
                   Document 5:
                    - exponent.college
                    + advanced.middle
                    - puzzle.elem
                    - subtraction.elem
                    + grab.elem
                    - addition.elem
                    - puzzle.middle
                    * permutations.high
                    - logarithm.middle
                    - about.numbers.middle
                    - squareroot.elem

Considering that there are 94 categories in the archive and generally at most 4 or 5 categories that any one question might fit into, these results seem reasonable. It's worth noting that the ``exponent.college'' category turns up on every list, but isn't appropriate for any document. This seems due to the fact that this archive category contains only four documents, and all four seem miscategorized. In fact, I think this strange category shouldn't exist at all.


Further Work

Several extensions and improvements are as yet unexplored. First, the stoplist currently contains only ``the'', ``a'', ``of'', ``to'', ``is'', ``that'', ``you'', ``for'', and ``and''. This could be expanded, which would improve performance.

Second, lemmatization could be applied to words so that words with the same root would be considered as the same word. This might improve the results while also improving running times, because fewer words would be present in the database.

Finally, by throwing away the P(words) term in our calculation, we no longer have a probability, just a relative score. It is therefore impossible to divine the certainty of the categorization process. If we kept this term in, we could get better information at the cost of more computation (although admittedly not very much more). We would also largely eliminate the risk of floating-point underflow, as our current scores are often numbers like 2.513E-188, which have a lot of zeros after the decimal point until they get to significant digits.


The Code

The AI::Categorize code can be found at http://mathforum.org/~ken/bayes/AI.tar .


Author

Ken Williams, ken@mathforum.org