
Re: Explanation for why linear regression is a poor fit
Posted:
Feb 4, 2013 7:25 PM


"David Jones" wrote in message news:kepism$als$1@speranza.aioe.org...
"Paul" wrote in message news:3721e51d6de4490d923cf1c59f013bff@googlegroups.com...
On Monday, February 4, 2013 3:55:36 PM UTC5, em.de...@gmail.com wrote:
> I haven't taken stats in a few years and recently there have been a lot > thrown around my work place, including the attached graph (and raw data). > I realize that low R2 mean that the linear regression is not a good fit,
First, as Dave notes, you have time series data here. Moreover, the spacing of the dates is irregular. If the regression is col/day v. date, I hope whoever ran the regression used the actual dates and not index (1, 2, 3, ...) for the predictor variable. (Also, as Dave mentions, there may be better tools than simple regression given that it's a time series.)
Second, your data has high variance. A low R*2 does not necessarily signal a poor fit (in the sense of incorrect model), although it may signal that the regression model does not have enough predictive power to do you much good. When the data is quite noisy, sometimes a low R^2 is the best you can do (and sometimes the model actually has some value).
> but it produces a pvalue 0.025.
If this is the pvalue of the usual Ftest, all it says is that your trend model fits better than assuming a constant mean. It does not say the trend model is correct (or that a better model cannot be found).
> I can't formulate a solid argument because I don't understand the > material well enough. Am I incorrect in saying this is a poor fit? Even > visually to me it looks like a poor fit. Additionally, he says things > like: "FC Count at Samish River/Thomas Road: N = 498, r2 = 0.01, p = > 0.025, meaning it is significant at 97.5% confidence" I know you can't > use Pvalues to describe stats like this.
Mixing "significant at" and "confidence" is IMHO sloppy use of terminology, but the underlying intent is not necessarily wrong.
> I need help explaining why this data isn't showing a significant > declining trend with a linear regression (in less of course I am > incorrect.)
It looks like declining trend to me. Whether the rate is _practically_ significant is an open question. I would not be surprised if it proved to be statistically significant even with a more careful analysis.
=================================================
If something like an Ftest is being used, the plots show that the assumptions necessary for the validity of this test clearly do not hold. However, something like a permutation test might be applicable and it looks likely to find a significantly negative slope for the regression line. However there appear to be other important changes in pattern going on, not well described by a linear trend in location. The plot suggests a change in behaviour at the lower end of the distribution. I see that previous reports from this project have used logtransformed data, and this would be easy to try and might provide a better view for the lower end of the observation scale . Similarly, past work has at least looked at differences across seasons and it would be worth extending this to look the possibility of different trends in different parts of the year. (Found previous work at https://fortress.wa.gov/ecy/publications/publications/0803029.pdf page 41 on (Nov 2008): this previous report also mixes significance and confidence badly, as noted by Paul above.)
David Jones
(addition: found also http://www.skagitcounty.net/PublicWorksCleanWater/Documents/wqreport2010/2010%20Annual%20Report%20Final.pdf , which has plots that show seasonal behaviour for various quantities including FC, possibly indicating a stepchange in behaviour for one site on Thomas Creek, data to 2009)
David Jones

