Search All of the Math Forum:
Views expressed in these public forums are not endorsed by
Drexel University or The Math Forum.
|
|
|
|
Re: Please critique my scheme for re-weighting source data
Posted:
Feb 23, 2012 1:56 PM
|
|
You give no hint, that I notice, of what it is that you are trying to accomplish.
For most purposes of inference that come to my mind, the extreme cases -- the ones that you seem to propose to drop -- are the most informative and most interesting. So I conclude that your interests are probably the opposite (in some fashion) from what my naive interests would be.
I repeat-- What are you trying to do?
-- Rich Ulrich
On Thu, 23 Feb 2012 08:27:19 -0800, Jennifer Murphy <JenMurphy@jm.invalid> wrote:
>I have a table of several thousand words showing how many times each >word occurs in a corpus of several hundred million words. > >The table has 7 columns. Here is some sample data. > > Word Total A B C D E >aardvark 30 3 9 8 8 2 >aback 990 112 542 135 145 56 >abacus 119 9 47 25 26 12 >abalone 180 0 34 66 59 21 >abattoir 116 4 22 24 3 63 >abbess 171 1 125 7 6 32 >abbey 376 35 138 78 29 96 >abnormality 1261 153 37 387 83 601 >acculturation 1613 1 2 23 18 1569 >coefficient 4499 7 23 77 7 4385 >covariate 668 0 0 0 0 668 >curricular 1714 7 3 29 17 1658 >operand 186 0 0 0 0 186 >subscale 4160 1 0 3 0 4156 > >Columns A-E represent tallies from different types of sources: > > Col Source > A Spoken sources (TV, radio, movies) > B Fiction (books) > C Popular magazines > D Newspapers > E Academic journals > >The Total column represents the arithmetic sum of columns A-E. > >The problem is that the sources contain very different types of words. >The biggest problem is the Academic genre. Those sources tend to use >highly technical terms and jargon and they use some common words in >somewhat unusual ways. There are over 17,000 words with academic tallies >that are at least double the average of the other four genres, over >4,000 that are at least 10 times higher, over 900 that are at least 100 >times higher, and almost 500 that are only in the academic genre. >Several examples are included in the table above. > >The Spoken genre is also skewed by slang and casual terminology, but to >a much lower degree. > >I could just eliminate those two columns, but I would prefer to keep >them in the mix, but at a lower weight. I would like to come up with >some scheme for assigning weighting factors to each column. > >One scheme is to assign each column a relative weight. Let's say I want >to give column A 3/4 weight (0.75) and column E 1/4 weight (0.25) as >compared to the other columns. If I assigned weighting factors of 0.75 >1.0 1.0 1.0 0.25, I could multiply each score in each column by the >corresponding weighting factors. > >This actually produces better results, but it reduces the totals by 4/5. >To keep the overall totals about the same, I could multiply the result >by 5/4 to compensate for reducing the overall weight from 5 to 4. > >I would appreciate any comments on this method and any suggestions for a >better one. > >Specifically, > >1. Is my discounting scheme a reasonable one? > >2. Is my readjustment solution appropriate? > >3. Is there a better way to do this? >
|
|
|
|