Data that Decay, and a Curve to Fit Them
Date: 07/09/2012 at 22:24:06 From: Paul Subject: What is the equation for these data I'm trying to derive the equation for a curve. The best way to describe how the data are collected is to imagine a Galton machine and normal distribution. As each point of data is collected, it fits into category 1-14, with 1 being the most likely, and 14 being least likely. Here's sample data so far: Category Count -------- ----- 1 430 2 302 3 253 4 182 5 152 6 129 7 105 8 68 9 48 10 22 11 13 12 7 13 2 14 1 If this is graphed out, it creates a reasonably smooth curved line. Even though there are "anomalies" just like a Galton machine would produce, there must be an equation that describes these data. I'd like to know what it is, and need help figuring it out. My math background, while decent, stops short of this level. I've had Calc I in college, but that was some years ago. I can still remember derivatives, maximums and minimums, but even that's a little fuzzy. I have a little statistics knowledge as well, but again, it's been a while. I'm guessing here, so be nice ... but looking at the shape of the graphed data, I think the formula has a model similar to y = ax^n + bx + c or possibly something with a 1/ax. Either way, as x approaches infinity, y approaches 0, so there's clearly an inverse relationship between them. x^n looks good for the general curve, but the data are skewed in favor of lower x's, so more are needed. I'm thinking y = ax^n + bx + c is in the right direction; but I'm guessing, and anything more would be wild speculation. I'd really like to be able to figure out the equation that best describes these data. I realize I may need to do some heavy reading/learning, but I think I'm clever enough to get there. Can you help?
Date: 07/10/2012 at 12:33:17 From: Doctor Douglas Subject: Re: What is the equation for these data Hi Paul, Thanks for submitting your question to the Math Forum. As you are probably aware, the task of modeling real world data with mathematical equations isn't always a straightforward one. If you had some physical explanation for where the data comes from and how it *ought* to be modeled, you might very well want to use that as a starting point. But even without a physical basis from which to start, there are some things we can do. A good starting point is to first graph y vs. x. When we do this, we see that the shape of the curve is smooth and decaying towards zero. This suggests that polynomial fitting isn't the best way to go. The next thing to do is to re-plot the graph so that both axes are logarithmic. The x and y values are all positive, so this is reasonable to do. Depending on how straight the curve is when you do this and whether it curves up or down, you can infer various forms of the dependence, such as the following: y = A * exp(-K*x) <-- exponential decay, K > 0 y = B * x^(-P) <-- power law decay, P > 0 When I did this with your data using a curve fitting program, I found that neither of these rules described your data over the range given. Instead, a logarithmic dependence seemed to work better: y = C - D*ln(x) In particular, let C = 425.15 and D = 168.2. But even this equation might still not be right, since it predicts a sharp cutoff: y goes negative when x exceeds 12.52, so that nonzero data values for x = 13 and x = 14 are in the noise. But this curve seems to work quite well up to x = 9 or so. Distinguishing among all of the many possible modifications to this equation will require more data: either more "counts" at each x location, or extending the data to larger values of x -- or both. I hope that this helps. Please write back if you need further help with this. - Doctor Douglas, The Math Forum http://mathforum.org/dr.math/
Date: 07/10/2012 at 15:10:20 From: Paul Subject: Thank you (What is the equation for these data) Dr. Douglas, Thank you so much for the response. I should have noted there were limits: x ranges from 1 to 14. My apologies; that may have been useful information. I'll repost with more specifics and see if it sheds light on the problem. Again, a special thanks to you and all the volunteers who make this resource great. /respect
Date: 07/10/2012 at 15:34:19 From: Paul Subject: What is the equation for these data Again, thank you Dr. Douglas, for your response. To be more specific about the data points ... These data are calculated from a horse racing program I'm working on. Specifically, the x values range from 1-14, where 1 is the "favorite" in a given race (has odds of, e.g., 2/1), 2 is the next favorite, and so on ... to 14, which is the long shot. Each time one of these "ranks" (x value) finishes in first place, I add it to the total. Thus (x1, y1), ..., (x14, y14). In the short run, the curve that is generated by these data may not be smooth, suggesting anomalies or just nuances of random distribution. In the long run, the curve takes on a definite smooth shape. The idea is to "know" the long run equation and apply it to the short run data to identify anomalies. Perhaps having a "best fit" equation would be more useful/likely than an improbable exact equation. The "y = C - D*ln(x)" suggestion actually looks promising. To be honest, x values greater than 8 aren't really important in my application of the data; I just included them for a complete dataset. Looks like I'm going to have to brush up on the "ln(x)" term! Please respond if the additional information has produced fruit. Thanks again!
Date: 07/11/2012 at 15:32:47 From: Doctor Douglas Subject: Re: What is the equation for these data Hi again, Paul. Thanks for providing the context behind the data. Now we see that the cutoff at x = 14 arises from how the data are provided, rather than as a result of fitting an equation. This removes the worry that the logarithmic equation y = C - D ln(x) is somehow misrepresenting small-but-finite probabilities at large values of x, so overall it's much more satisfying. Now, maybe you are curious as to why this logarithmic curve describes the data. I know I'm certainly curious! It seems to me that it has to do with the actual mechanics of races and horses (e.g., the variations in how horses perform, and how to handle scratches or incomplete fields), as well as the business and psychology of oddsmaking and betting (e.g., the well-known bias of emphasizing longshots when placing bets). Good Luck! - Doctor Douglas, The Math Forum http://mathforum.org/dr.math/
Search the Dr. Math Library:
Ask Dr. MathTM
© 1994- The Math Forum at NCTM. All rights reserved.