Search All of the Math Forum:
Views expressed in these public forums are not endorsed by
NCTM or The Math Forum.


Math Forum
»
Discussions
»
sci.math.*
»
sci.stat.math
Notice: We are no longer accepting new posts, but the forums will continue to be readable.
Topic:
Help with understanding outliers
Replies:
8
Last Post:
Jun 30, 2012 11:04 AM




Help with understanding outliers
Posted:
Jun 22, 2012 9:51 PM


I am not a math person and am having trouble figuring out who is right or wrong in the below. I leave the below intact but offer this summary.
There is this data: http://tmp.gallopinginsanity.com/LinuxTrendMar2012Snitvscc.png of which part of it seems to show an upward trend http://tmp.gallopinginsanity.com/LinuxTrend20112ndhalf.png
One person claims the upward trend shown in the second link is something to consider while the other says it should be ignored because it is just "outliers". Far more details are given below.
My question is if the data from that second link should be considered outliers? Can outliers be seen with such a trend? Should the data for that time period be excepted as a sign that there was an increase in the numbers even if only temporarily?
What do the 'real' math folks think?
On Jun 22, 9:38 pm, Snit <use...@gallopinginsanity.com> wrote: > On 6/22/12 8:00 AM, in article > de637d5aefbc45e4a6eedd7fce6dbea4@googlegroups.com, "cc" > > <scatnu...@hotmail.com> wrote: > > On Friday, June 22, 2012 10:34:26 AM UTC4, Snit wrote: > >> On 6/22/12 5:03 AM, in article > >> 54c0a0e17e474cb0b1506c4d4ac48fed@googlegroups.com, "cc" > >> <scatnu...@hotmail.com> wrote: > > >>> On Thursday, June 21, 2012 6:30:57 PM UTC4, Snit wrote: > >>>> On 6/21/12 1:50 PM, in article > >>>> acf3401765244ba09b608d549e16fe92@googlegroups.com, "cc" > >>>> <scatnu...@hotmail.com> wrote: > > >>>>> On Thursday, June 21, 2012 3:35:41 PM UTC4, Steve Carroll wrote: > > >>>>>> Claiming that cc is wrong without knowing what he did is just > >>>>>> stupid... and you've engaged in enough stupidity already... to the > >>>>>> point where it looks like everyone has "better knowledge" on the topic > >>>>>> than you do;) > > >>>>> Excellent link. Thanks. I wonder what sentence Snit will pull out and > >>>>> misunderstand. > > >>>> Remember, Carroll just follows me around lying and trolling. He does not > >>>> even try to make sense. > > >>> What in that link was a lie or a troll? I understand that probably alot of > >>> what was written there doesn't make sense to you, but I'm not sure how > >>> linking > >>> to a mathematical process is lying or trolling. > > >> This whole stupid debate has been filled with you lying. > > > I was unaware that you considered mathematical facts to be lies. > > > R^2 values of two trendlines over the same dataset can be compared to see > > which trend line fits better. This is a fact. Mine had a better R^2 value. > > > There is a welldefined (several actually) method for determining outliers. > > This is a fact. I found the outliers, you did not. > > We have already discussed how you missed the change in the trend... you call > them "outliers" (though you have also used the term "erroneous" and > others... all rather silly of you). Let us be more specific on why your > claim that the data from the latter half of 2011 should be seen as > "outliers" > > <http://en.wikipedia.org/wiki/Outlier> >  > An outlying observation, or outlier, is one that appears to > deviate markedly from other members of the sample in which it > occurs. >  > > But if you look at the data from the latter half of 2011: > > <http://tmp.gallopinginsanity.com/LinuxTrend20112ndhalf.png> > > Those data points show a clear and very strong trend (even if nobody > predicted that trend would continue unchanged for any great length of time). > Those data points do *not* deviate "markedly from other members of the > sample". This can be seen with the high R^2 value. Even looking at the > greater set of data: > <http://tmp.gallopinginsanity.com/LinuxTrendMar2012Snitvscc.png> > > It is *very* clear that there is an upward trend at the latter half of > 2011... those data points are forming a pattern. The same Wikipedia page > speaks of using caution that you did not: >  > Caution: Unless it can be ascertained that the deviation is > not significant, it is illadvised to ignore the presence of > outliers. >  > > From your description I have not understood what you did to "ascertained > that the deviation is not significant". Maybe you can explain that. > > From my view, the fact that it was not a single data point that seemed "off" > but a set of at least six concurrent ones in a very clear trend discount > them as being ignored as meaningless "outliers". But I am open to your > explanation... what makes you think those six data point with such a strong > and clear trend (an R^2 value of over 0.98, and this is *without* weighing > or assuming any outliers, etc.) is "*ONE* that appears to deviate markedly > from the other members of the sample" (emphasis mine... but the importance > of it being *ONE* data point is an important thing to keep in mind). This > does not mean there cannot be more than one outlier in a sample  but that > points that make a trend of their own are not occurring as a single > "outlier". > > In case you do not want to accept the single definition from Wikipedia, I > found those for you so you can better understand what an outlier is: > > <http://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm> >  > An outlier is an observation that lies an abnormal distance > from other values in a random sample from a population. In a > sense, this definition leaves it up to the analyst (or a > consensus process) to decide what will be considered > abnormal. Before abnormal observations can be singled out, it > is necessary to characterize normal observations. >  > > Again, "an observation"... and again it makes it clear that the > determination of if something is "an outlier" (one) is subjective. > > <http://www.statsoft.com/textbook/basicstatistics/#Correlationse> >  > Outliers. Outliers are atypical (by definition), infrequent > observations. >  > > But you called these 6 of 24 data points "outliers"... and to keep your > claim of 1% at all times, you might also include Feb 2012. Even if not, you > are deeming 25% of the data as being "outliers". This is not consistent > with the idea that they would be "atypical". > >  > Needless to say, one should never base important conclusions > on the value of the correlation coefficient alone (i.e., > examining the respective scatterplot is always recommended). >  > > This is what I have been telling you. Looking just at the linear trend line > is *not* sufficient, esp. when you are assuming that 25% of your data points > are "outliers" The same link gets even more clear: > >  > Nonlinear Relations between Variables. Another potential > source of problems with the linear (Pearson r) correlation is > the shape of the relation. As mentioned before, Pearson r > measures a relation between two variables only to the extent > to which it is linear; deviations from linearity will > increase the total sum of squared distances from the > regression line even if they represent a "true" and very > close relationship between two variables. The possibility of > such nonlinear relationships is another reason why examining > scatterplots is a necessary step in evaluating every > correlation. For example, the following graph demonstrates an > extremely strong correlation between the two variables which > is not well described by the linear function. >  > > As I have been telling you: when the data is nonlinear, as the data in this > case is not, then one *must* look at the data itself. You did not  hence > the reason why you missed the upward trend of the latter half of 2011. > > But there are more resources to help you understand this: > > <http://mathworld.wolfram.com/Outlier.html> >  > An outlier is an observation that lies outside the overall > pattern of a distribution (Moore and McCabe 1999). Usually, > the presence of an outlier indicates some sort of problem. > This can be a case which does not fit the model under study, > or an error in measurement. > > Outliers are often easy to spot in histograms. For example, > the point on the far left in the above figure is an outlier. >  > > If you look at the graph, you can see it shows what appears to be a *true* > outlier... a single point that is significantly different from the rest of > the data. Your "outliers" are 25% of the data and form a clear trend. This > means they are not "outliers" at all, but a trend that is seen in the > overall data. A trend that fit my vague prediction (which, to remind you, > does not prove causation). > > But there is more: > <http://www.experimentresources.com/statisticaloutliers.html> >  > Statistical outliers are data points that are far removed and > numerically distant from the rest of the points. >  > > Calling 25% of the data points "outliers" is a bit silly.... esp. when they > show such a strong trend. Points that form such a strong tend *cannot* be > "far removed and numerically distant from the rest of the points". > > And I found many more examples... pretty much any reasonable page that talks > about outliers will make it clear why > > 1) Such determinations are largely subjective  contrary to your claim that > they are "fact" > 2) Cannot include 25% of the data  esp. when that 25% of the data are > points in a direct series which show a *very* clear trend (even if a > nonlasting trend). > > Your "outlier" claim is a bit absurd  and, again, shows how you do not > really get the concept of what you are talking about. This is the same as > when you insisted sigma lines could not be based on the distance from the > mean (they can  they are based on the distance from the mean to the > inflection points) and your claim that the depictions I showed you were fine > when it was *very* clear they were not. And you *know* this... hence the > reason you repeatedly snip your own comments and refuse to answer questions > on these topics. > > You were shown to be wrong about sigma lines. Now you have shown yourself > to be wrong about outliers in data. > > Really, is there anything you can point to where you can claim to be right? > > > The latter half of 2011, which you love to point to, is mostly made up of > > outliers. This is a fact. > > A "fact"? Based on what. Also from the same Wikipedia page: >  > There is no rigid mathematical definition of what constitutes > an outlier; determining whether or not an observation is an > outlier is ultimately a subjective exercise. >  > > In other words, it is not a "fact" but a subjective *opinion*. And I find > that opinion to be rather absurd given that we are *not* talking about *ONE* > point but a set  and that set has a very clear trend. Amazingly clear, > really. > > > You have changed datasets to just using the later half of 2011 to try and > > prove your point, since my trendline refuted your original point using your > > original data set. This is a fact. > > Incorrect. > > > You refuse to acknowledge that the 2011 data you now insist on using is made > > up of almost entirely outliers, even though it's not a matter of opinion. This > > is a fact. > > Incorrect. > > > You confuse R^2 with outliers and try to point to the high R^2 value for your > > 2011 data as some sort of proof of nonexistence of outliers. This is a fact. > > Incorrect. > > > There have been no lies from me at all this entire time. I cannot say the same > > for you though. You consistently try to pretend I've said things I have not, > > and continue to repeat those things even though you've been correct numerous > > times. I'm sorry you got your ass handed to you, but unfortunately for you, > > it's math and cannot really be refuted. > > Your claims are incorrect. I am not suggesting the math is wrong. > > > Desktop Linux has been flatlined for quite some time now. Most people with > > common sense realize this to be true, and now I've proven it to you. > > Let's try different a tact here. I have been very open with the places > where I see where I have been wrong or did not handle things as well as you > should have. I am an honest and open person. For example: I was wrong in > my predictions for the trend in 2012 and I did not handle things as well as > I should have when I did not note the nonlinear nature of the trend before > I did. In one I was wrong  in the other I did not handle things as well as > I did. > > Let us test your honesty and openness: where do you think *you* have been > wrong... and no backhanded insults with that... just a sincere statement of > where you admit you were wrong. Can you think of *any* place in this whole > debate? Any at all? > > My guess: you will not be willing to admit to any. I sincerely hope you > prove me wrong... (that would give me something else to add to my list). My > guess though is you are so tied to having to "prove" you are right  no > matter how wrong you have been  that you will simply avoid this question.



