Search All of the Math Forum:
Views expressed in these public forums are not endorsed by
Drexel University or The Math Forum.
|
|
|
|
Please critique my scheme for re-weighting source data
Posted:
Feb 23, 2012 11:27 AM
|
|
I have a table of several thousand words showing how many times each word occurs in a corpus of several hundred million words.
The table has 7 columns. Here is some sample data.
Word Total A B C D E aardvark 30 3 9 8 8 2 aback 990 112 542 135 145 56 abacus 119 9 47 25 26 12 abalone 180 0 34 66 59 21 abattoir 116 4 22 24 3 63 abbess 171 1 125 7 6 32 abbey 376 35 138 78 29 96 abnormality 1261 153 37 387 83 601 acculturation 1613 1 2 23 18 1569 coefficient 4499 7 23 77 7 4385 covariate 668 0 0 0 0 668 curricular 1714 7 3 29 17 1658 operand 186 0 0 0 0 186 subscale 4160 1 0 3 0 4156
Columns A-E represent tallies from different types of sources:
Col Source A Spoken sources (TV, radio, movies) B Fiction (books) C Popular magazines D Newspapers E Academic journals
The Total column represents the arithmetic sum of columns A-E.
The problem is that the sources contain very different types of words. The biggest problem is the Academic genre. Those sources tend to use highly technical terms and jargon and they use some common words in somewhat unusual ways. There are over 17,000 words with academic tallies that are at least double the average of the other four genres, over 4,000 that are at least 10 times higher, over 900 that are at least 100 times higher, and almost 500 that are only in the academic genre. Several examples are included in the table above.
The Spoken genre is also skewed by slang and casual terminology, but to a much lower degree.
I could just eliminate those two columns, but I would prefer to keep them in the mix, but at a lower weight. I would like to come up with some scheme for assigning weighting factors to each column.
One scheme is to assign each column a relative weight. Let's say I want to give column A 3/4 weight (0.75) and column E 1/4 weight (0.25) as compared to the other columns. If I assigned weighting factors of 0.75 1.0 1.0 1.0 0.25, I could multiply each score in each column by the corresponding weighting factors.
This actually produces better results, but it reduces the totals by 4/5. To keep the overall totals about the same, I could multiply the result by 5/4 to compensate for reducing the overall weight from 5 to 4.
I would appreciate any comments on this method and any suggestions for a better one.
Specifically,
1. Is my discounting scheme a reasonable one?
2. Is my readjustment solution appropriate?
3. Is there a better way to do this?
|
|
|
|