Drexel dragonThe Math ForumDonate to the Math Forum

Search All of the Math Forum:

Views expressed in these public forums are not endorsed by Drexel University or The Math Forum.

Math Forum » Discussions » sci.math.* » sci.math.independent

Topic: Please critique my scheme for re-weighting source data
Replies: 8   Last Post: May 27, 2012 11:57 AM

Advanced Search

Back to Topic List Back to Topic List Jump to Tree View Jump to Tree View   Messages: [ Previous | Next ]
Jennifer Murphy

Posts: 24
Registered: 2/23/12
Please critique my scheme for re-weighting source data
Posted: Feb 23, 2012 11:27 AM
  Click to see the message monospaced in plain text Plain Text   Click to reply to this topic Reply

I have a table of several thousand words showing how many times each
word occurs in a corpus of several hundred million words.

The table has 7 columns. Here is some sample data.

Word Total A B C D E
aardvark 30 3 9 8 8 2
aback 990 112 542 135 145 56
abacus 119 9 47 25 26 12
abalone 180 0 34 66 59 21
abattoir 116 4 22 24 3 63
abbess 171 1 125 7 6 32
abbey 376 35 138 78 29 96
abnormality 1261 153 37 387 83 601
acculturation 1613 1 2 23 18 1569
coefficient 4499 7 23 77 7 4385
covariate 668 0 0 0 0 668
curricular 1714 7 3 29 17 1658
operand 186 0 0 0 0 186
subscale 4160 1 0 3 0 4156

Columns A-E represent tallies from different types of sources:

Col Source
A Spoken sources (TV, radio, movies)
B Fiction (books)
C Popular magazines
D Newspapers
E Academic journals

The Total column represents the arithmetic sum of columns A-E.

The problem is that the sources contain very different types of words.
The biggest problem is the Academic genre. Those sources tend to use
highly technical terms and jargon and they use some common words in
somewhat unusual ways. There are over 17,000 words with academic tallies
that are at least double the average of the other four genres, over
4,000 that are at least 10 times higher, over 900 that are at least 100
times higher, and almost 500 that are only in the academic genre.
Several examples are included in the table above.

The Spoken genre is also skewed by slang and casual terminology, but to
a much lower degree.

I could just eliminate those two columns, but I would prefer to keep
them in the mix, but at a lower weight. I would like to come up with
some scheme for assigning weighting factors to each column.

One scheme is to assign each column a relative weight. Let's say I want
to give column A 3/4 weight (0.75) and column E 1/4 weight (0.25) as
compared to the other columns. If I assigned weighting factors of 0.75
1.0 1.0 1.0 0.25, I could multiply each score in each column by the
corresponding weighting factors.

This actually produces better results, but it reduces the totals by 4/5.
To keep the overall totals about the same, I could multiply the result
by 5/4 to compensate for reducing the overall weight from 5 to 4.

I would appreciate any comments on this method and any suggestions for a
better one.


1. Is my discounting scheme a reasonable one?

2. Is my readjustment solution appropriate?

3. Is there a better way to do this?

Point your RSS reader here for a feed of the latest messages in this topic.

[Privacy Policy] [Terms of Use]

© Drexel University 1994-2014. All Rights Reserved.
The Math Forum is a research and educational enterprise of the Drexel University School of Education.