Columns A-E represent tallies from different types of sources:
Col Source A Spoken sources (TV, radio, movies) B Fiction (books) C Popular magazines D Newspapers E Academic journals
The Total column represents the arithmetic sum of columns A-E.
The problem is that the sources contain very different types of words. The biggest problem is the Academic genre. Those sources tend to use highly technical terms and jargon and they use some common words in somewhat unusual ways. There are over 17,000 words with academic tallies that are at least double the average of the other four genres, over 4,000 that are at least 10 times higher, over 900 that are at least 100 times higher, and almost 500 that are only in the academic genre. Several examples are included in the table above.
The Spoken genre is also skewed by slang and casual terminology, but to a much lower degree.
I could just eliminate those two columns, but I would prefer to keep them in the mix, but at a lower weight. I would like to come up with some scheme for assigning weighting factors to each column.
One scheme is to assign each column a relative weight. Let's say I want to give column A 3/4 weight (0.75) and column E 1/4 weight (0.25) as compared to the other columns. If I assigned weighting factors of 0.75 1.0 1.0 1.0 0.25, I could multiply each score in each column by the corresponding weighting factors.
This actually produces better results, but it reduces the totals by 4/5. To keep the overall totals about the same, I could multiply the result by 5/4 to compensate for reducing the overall weight from 5 to 4.
I would appreciate any comments on this method and any suggestions for a better one.