The Math Forum

Search All of the Math Forum:

Views expressed in these public forums are not endorsed by NCTM or The Math Forum.

Math Forum » Discussions » sci.math.* » sci.math

Notice: We are no longer accepting new posts, but the forums will continue to be readable.

Topic: Please critique my scheme for re-weighting source data
Replies: 8   Last Post: May 27, 2012 11:57 AM

Advanced Search

Back to Topic List Back to Topic List Jump to Tree View Jump to Tree View   Messages: [ Previous | Next ]
James Beck

Posts: 19
Registered: 12/22/06
Re: Please critique my scheme for re-weighting source data
Posted: May 24, 2012 11:09 PM
  Click to see the message monospaced in plain text Plain Text   Click to reply to this topic Reply

On Fri, 24 Feb 2012 10:35:15 +0000 (UTC), JohnF
<> wrote:

>Jennifer Murphy <JenMurphy@jm.invalid> wrote:
>> Rich Ulrich <> wrote:

>>> What are you trying to do?
>> I am trying to calculate for each word the relative likeliness that it
>> would be encountered by an average well-educated person in their daily
>> activities: reading the paper, listening to the news, attending classes,
>> talking to other people, reading books, etc.
>> The raw scores that I have already do that, but I question the
>> weighting.I do not think that the average person encounters the types of
>> words typically found in academic journals at the same frequency as they
>> would those found in newspapers or magazines. Therefore, I want to
>> re-weight the five sources to reflect a more average experience.

>Don't weight the sources, weight the people.
>That is, define a person by a "state vector"
> p = <w_A,w_B,...,w_E>
>representing his inclination/weight to read each
>kind of source. You're now kind of using p=<.2,.2,.2,.2,.2>.
>Is that really "average"? Or maybe you can't define
>a single average person. College-educated will probably have
>a different vector than high-school dropouts.
> So you ultimately have a five-dimensional (that is,
>#sources-dimensional) people space, with each point in that
>space having its own "likelihood distribution" for coming
>across your words. ... Or something like that. The basic
>point, again, being to weight the people.

This thread is very stale, so you probably won't read this, but who

The core vocabulary in academic writing is actually very small
compared to the others, about 3,000 words. The broad sub-categories of
academic writing each have a technical core of about a 1,000 words
(also pretty common). However, each paper includes 3-5 specialized,
idiosyncratic words, usually familiar to the small group of people
interested in the paper, but not in wide use otherwise. Weighting the
people as you suggest preserves the worst aspect of the data and
perversely implies that the simple, core vocabulary is less likely to
be encountered than it is.

The OP is closer to the right track. As a practical matter, the
likelihood of encountering the idiosyncratic words at random is close
to zero. It would be more robust to extract the general and
subcategory cores and re-weight the rest.

Point your RSS reader here for a feed of the latest messages in this topic.

[Privacy Policy] [Terms of Use]

© The Math Forum at NCTM 1994-2018. All Rights Reserved.