Topic: Please critique my scheme for re-weighting source data
 Richard Ulrich Posts: 2,961 Registered: 12/13/04
Re: Please critique my scheme for re-weighting source data
Posted: Feb 23, 2012 1:56 PM

You give no hint, that I notice, of what it is that you
are trying to accomplish.

For most purposes of inference that come to my mind,
the extreme cases -- the ones that you seem to propose
to drop -- are the most informative and most interesting.
So I conclude that your interests are probably the opposite
(in some fashion) from what my naive interests would be.

I repeat-- What are you trying to do?

--
Rich Ulrich

On Thu, 23 Feb 2012 08:27:19 -0800, Jennifer Murphy
<JenMurphy@jm.invalid> wrote:

>I have a table of several thousand words showing how many times each
>word occurs in a corpus of several hundred million words.
>
>The table has 7 columns. Here is some sample data.
>
> Word Total A B C D E
>aardvark 30 3 9 8 8 2
>aback 990 112 542 135 145 56
>abacus 119 9 47 25 26 12
>abalone 180 0 34 66 59 21
>abattoir 116 4 22 24 3 63
>abbess 171 1 125 7 6 32
>abbey 376 35 138 78 29 96
>abnormality 1261 153 37 387 83 601
>acculturation 1613 1 2 23 18 1569
>coefficient 4499 7 23 77 7 4385
>covariate 668 0 0 0 0 668
>curricular 1714 7 3 29 17 1658
>operand 186 0 0 0 0 186
>subscale 4160 1 0 3 0 4156
>
>Columns A-E represent tallies from different types of sources:
>
> Col Source
> A Spoken sources (TV, radio, movies)
> B Fiction (books)
> C Popular magazines
> D Newspapers
>
>The Total column represents the arithmetic sum of columns A-E.
>
>The problem is that the sources contain very different types of words.
>The biggest problem is the Academic genre. Those sources tend to use
>highly technical terms and jargon and they use some common words in
>somewhat unusual ways. There are over 17,000 words with academic tallies
>that are at least double the average of the other four genres, over
>4,000 that are at least 10 times higher, over 900 that are at least 100
>times higher, and almost 500 that are only in the academic genre.
>Several examples are included in the table above.
>
>The Spoken genre is also skewed by slang and casual terminology, but to
>a much lower degree.
>
>I could just eliminate those two columns, but I would prefer to keep
>them in the mix, but at a lower weight. I would like to come up with
>some scheme for assigning weighting factors to each column.
>
>One scheme is to assign each column a relative weight. Let's say I want
>to give column A 3/4 weight (0.75) and column E 1/4 weight (0.25) as
>compared to the other columns. If I assigned weighting factors of 0.75
>1.0 1.0 1.0 0.25, I could multiply each score in each column by the
>corresponding weighting factors.
>
>This actually produces better results, but it reduces the totals by 4/5.
>To keep the overall totals about the same, I could multiply the result
>by 5/4 to compensate for reducing the overall weight from 5 to 4.
>
>I would appreciate any comments on this method and any suggestions for a
>better one.
>
>Specifically,
>
>1. Is my discounting scheme a reasonable one?
>
>2. Is my readjustment solution appropriate?
>
>3. Is there a better way to do this?
>

