Drexel dragonThe Math ForumDonate to the Math Forum

Ask Dr. Math - Questions and Answers from our Archives
_____________________________________________
Associated Topics || Dr. Math Home || Search Dr. Math
_____________________________________________

Data that Decay, and a Curve to Fit Them

Date: 07/09/2012 at 22:24:06
From: Paul
Subject: What is the equation for these data

I'm trying to derive the equation for a curve. 

The best way to describe how the data are collected is to imagine a Galton
machine and normal distribution. As each point of data is collected, it
fits into category 1-14, with 1 being the most likely, and 14 being least
likely.

Here's sample data so far:

  Category     Count
  --------     ----- 
     1          430
     2          302
     3          253
     4          182
     5          152
     6          129
     7          105
     8           68
     9           48
    10           22
    11           13
    12            7
    13            2
    14            1

If this is graphed out, it creates a reasonably smooth curved line. 

Even though there are "anomalies" just like a Galton machine would
produce, there must be an equation that describes these data. I'd like to
know what it is, and need help figuring it out.

My math background, while decent, stops short of this level. I've had 
Calc I in college, but that was some years ago. I can still remember
derivatives, maximums and minimums, but even that's a little fuzzy. I have
a little statistics knowledge as well, but again, it's been a while.

I'm guessing here, so be nice ... but looking at the shape of the graphed
data, I think the formula has a model similar to y = ax^n + bx + c or
possibly something with a 1/ax. Either way, as x approaches infinity, y
approaches 0, so there's clearly an inverse relationship between them.

x^n looks good for the general curve, but the data are skewed in favor of
lower x's, so more are needed. I'm thinking y = ax^n + bx + c is in the
right direction; but I'm guessing, and anything more would be wild
speculation.

I'd really like to be able to figure out the equation that best describes
these data. I realize I may need to do some heavy reading/learning, but I
think I'm clever enough to get there. Can you help?



Date: 07/10/2012 at 12:33:17
From: Doctor Douglas
Subject: Re: What is the equation for these data

Hi Paul,

Thanks for submitting your question to the Math Forum.  

As you are probably aware, the task of modeling real world data with
mathematical equations isn't always a straightforward one. If you had some
physical explanation for where the data comes from and how it *ought* to
be modeled, you might very well want to use that as a starting point.

But even without a physical basis from which to start, there are some
things we can do. A good starting point is to first graph y vs. x.

When we do this, we see that the shape of the curve is smooth and decaying
towards zero. This suggests that polynomial fitting isn't the best way to
go.

The next thing to do is to re-plot the graph so that both axes are
logarithmic. The x and y values are all positive, so this is reasonable to
do. Depending on how straight the curve is when you do this and whether it
curves up or down, you can infer various forms of the dependence, such as
the following:

   y = A * exp(-K*x)     <-- exponential decay, K > 0

   y = B * x^(-P)        <-- power law decay, P > 0

When I did this with your data using a curve fitting program, I found that
neither of these rules described your data over the range given. Instead,
a logarithmic dependence seemed to work better:

   y = C - D*ln(x)

In particular, let C = 425.15 and D = 168.2.

But even this equation might still not be right, since it predicts a sharp
cutoff: y goes negative when x exceeds 12.52, so that nonzero data values
for x = 13 and x = 14 are in the noise. But this curve seems to work quite
well up to x = 9 or so.

Distinguishing among all of the many possible modifications to this
equation will require more data: either more "counts" at each x location,
or extending the data to larger values of x -- or both.

I hope that this helps. Please write back if you need further help with
this.

- Doctor Douglas, The Math Forum
  http://mathforum.org/dr.math/ 



Date: 07/10/2012 at 15:10:20
From: Paul
Subject: Thank you (What is the equation for these data)

Dr. Douglas,  

Thank you so much for the response.  

I should have noted there were limits: x ranges from 1 to 14. My
apologies; that may have been useful information.

I'll repost with more specifics and see if it sheds light on the problem.

Again, a special thanks to you and all the volunteers who make this
resource great.

/respect



Date: 07/10/2012 at 15:34:19
From: Paul
Subject: What is the equation for these data

Again, thank you Dr. Douglas, for your response.

To be more specific about the data points ...

These data are calculated from a horse racing program I'm working on. 
Specifically, the x values range from 1-14, where 1 is the "favorite" in a
given race (has odds of, e.g., 2/1), 2 is the next favorite, and so on ...
to 14, which is the long shot. Each time one of these "ranks" (x value)
finishes in first place, I add it to the total. Thus (x1, y1), ..., 
(x14, y14).

In the short run, the curve that is generated by these data may not be
smooth, suggesting anomalies or just nuances of random distribution. In
the long run, the curve takes on a definite smooth shape.

The idea is to "know" the long run equation and apply it to the short run
data to identify anomalies. Perhaps having a "best fit" equation would be
more useful/likely than an improbable exact equation.

The "y = C - D*ln(x)" suggestion actually looks promising. To be honest, x
values greater than 8 aren't really important in my application of the
data; I just included them for a complete dataset.

Looks like I'm going to have to brush up on the "ln(x)" term!

Please respond if the additional information has produced fruit.

Thanks again!



Date: 07/11/2012 at 15:32:47
From: Doctor Douglas
Subject: Re: What is the equation for these data

Hi again, Paul.

Thanks for providing the context behind the data.  

Now we see that the cutoff at x = 14 arises from how the data are
provided, rather than as a result of fitting an equation. This removes the
worry that the logarithmic equation y = C - D ln(x) is somehow
misrepresenting small-but-finite probabilities at large values of x, so
overall it's much more satisfying.

Now, maybe you are curious as to why this logarithmic curve describes the
data. I know I'm certainly curious! It seems to me that it has to do with
the actual mechanics of races and horses (e.g., the variations in how
horses perform, and how to handle scratches or incomplete fields), as well
as the business and psychology of oddsmaking and betting (e.g., the
well-known bias of emphasizing longshots when placing bets).

Good Luck!

- Doctor Douglas, The Math Forum
  http://mathforum.org/dr.math/ 
Associated Topics:
High School Equations, Graphs, Translations

Search the Dr. Math Library:


Find items containing (put spaces between keywords):
 
Click only once for faster results:

[ Choose "whole words" when searching for a word like age.]

all keywords, in any order at least one, that exact phrase
parts of words whole words

Submit your own question to Dr. Math

[Privacy Policy] [Terms of Use]

_____________________________________
Math Forum Home || Math Library || Quick Reference || Math Forum Search
_____________________________________

Ask Dr. MathTM
© 1994-2013 The Math Forum
http://mathforum.org/dr.math/