```Date: Feb 23, 2012 11:27 AM
Author: Jennifer Murphy
Subject: Please critique my scheme for re-weighting source data

I have a table of several thousand words showing how many times eachword occurs in a corpus of several hundred million words.The table has 7 columns. Here is some sample data.  Word        Total     A     B     C     D     Eaardvark         30     3     9     8     8     2aback           990   112   542   135   145    56abacus          119     9    47    25    26    12abalone         180     0    34    66    59    21abattoir        116     4    22    24     3    63abbess          171     1   125     7     6    32abbey           376    35   138    78    29    96abnormality    1261   153    37   387    83   601acculturation  1613     1     2    23    18  1569coefficient    4499     7    23    77     7  4385covariate       668     0     0     0     0   668curricular     1714     7     3    29    17  1658operand         186     0     0     0     0   186subscale       4160     1     0     3     0  4156Columns A-E represent tallies from different types of sources:   Col     Source    A   Spoken sources (TV, radio, movies)    B   Fiction (books)    C   Popular magazines    D   Newspapers    E   Academic journalsThe Total column represents the arithmetic sum of columns A-E.The problem is that the sources contain very different types of words.The biggest problem is the Academic genre. Those sources tend to usehighly technical terms and jargon and they use some common words insomewhat unusual ways. There are over 17,000 words with academic talliesthat are at least double the average of the other four genres, over4,000 that are at least 10 times higher, over 900 that are at least 100times higher, and almost 500 that are only in the academic genre.Several examples are included in the table above.The Spoken genre is also skewed by slang and casual terminology, but toa much lower degree.I could just eliminate those two columns, but I would prefer to keepthem in the mix, but at a lower weight. I would like to come up withsome scheme for assigning weighting factors to each column.One scheme is to assign each column a relative weight. Let's say I wantto give column A 3/4 weight (0.75) and column E 1/4 weight (0.25) ascompared to the other columns. If I assigned weighting factors of 0.751.0 1.0 1.0 0.25, I could multiply each score in each column by thecorresponding weighting factors.This actually produces better results, but it reduces the totals by 4/5.To keep the overall totals about the same, I could multiply the resultby 5/4 to compensate for reducing the overall weight from 5 to 4.I would appreciate any comments on this method and any suggestions for abetter one.Specifically,1. Is my discounting scheme a reasonable one?2. Is my readjustment solution appropriate?3. Is there a better way to do this?
```