Search All of the Math Forum:

Views expressed in these public forums are not endorsed by NCTM or The Math Forum.

Topic: Lin. regression, probability that a sample belongs to the data set?
Replies: 6   Last Post: Aug 14, 2014 9:13 PM

 Messages: [ Previous | Next ]
 Tikkuhirvi Tietavainen Posts: 99 Registered: 4/22/08
Re: Lin. regression, probability that a sample belongs to the data set?
Posted: Aug 14, 2014 6:33 AM

Jeff <milleratotago@yahoo.com> wrote in message <592a8fc0-182f-4596-a605-3c312014dab5@googlegroups.com>...
> On Thursday, August 14, 2014 2:16:10 AM UTC+12, Aino wrote:
>

> > Jeff, the main reason why the linear discriminant analysis is not working for me is simply because my boss didn't like it. :) Don't ask me why.
>
> It is tough to argue with that.

>
> > However, if I have understood correctly, linear discriminant analysis is not an ideal method to use if you have badly behaving regression lines, such as lines that are crossing. Correct me if I'm wrong.
>
> No, there is no problem with crossing lines. Linear discriminant analysis does rely on some assumptions, but I think they are just the same ones that your linear regression method would rely on. If these assumptions are far from correct, maybe 2-predictor logistic regression would be a better choice (predicting sample 1/2 from X and Y).

Hmm.. This is what I mean. Where would the basic linear discriminant analysis place the one single line to separate these two groups?:

-----------------------------------------------------------
clear all;close all;clc;
rng(10)
x1=[1:100]';
x2=[1:100]';
y1=x1+10*randn(100,1);
y2=-x2+10*randn(100,1)+100;
figure;plot(x1,y1,'*');hold on;plot(x2,y2,'ro')
-----------------------------------------------------------

I'll have to look into the "2-predictor logistic regression".

>
> A simple thought experiment shows that looking for "closeness to regression lines" can't be a very good solution in general. Suppose that sample 1 has X's and Y's in the range of (say) 20-50, whereas sample 2 has them in the range of
> 50-70. Suppose further that the best fitting regression line for each sample is Y=X (slope 1, intercept 0). Since the two samples produce the same regression line, your "closeness to lines" criterion won't discriminate at all. And yet, looking at the actual X and Y values of a new point should do quite well since the centroids of the two samples are so far apart (i.e., 35 vs 60 or so).

This I need to think about.. Hmm.. At least I'm fortunate enough that both groups in my data have X's in the same range.
>
> Well, I suspect there is a lot about your application that I am misunderstanding. It might help to see a plot of the points from the two samples (plotted in different colors) on an X versus Y scattergram.

I have several data sets, that look all a bit different. Some have equal slopes, some don't. The x and y are about normally distributed in majority of the data. Some of the data is log-transformed to make them more normally distributed. Some outliers exists. X's have the same range in both groups...

Here is an artist's view of one of the data sets. Maybe it gives someone some ideas. However, regardless of what method I should use, I would very much like to solve the problem of how to determine the probabilities of an individual data point belonging to groups=1 and group=2.

------------------------------------------------------------
clear all;close all;clc;
rng(10)
x1=[1:100]';
x2=[1:100]';
y1=1.6*x1+40*randn(100,1);y1(80:3:100)=2*y1(80:3:100);
y2=-0.5*x2+15*randn(100,1)+20;y2(60:3:100)=4*y2(60:3:100);
figure;plot(x1,y1,'*');lsline;hold on;plot(x2,y2,'ro');lsline;
hold on;plot(90,100,'go');xlabel('x');ylabel('y')
-------------------------------------------------------------

-Aino

Date Subject Author
8/12/14 Tikkuhirvi Tietavainen
8/12/14 John D'Errico
8/12/14 Jeff Miller
8/13/14 Tikkuhirvi Tietavainen
8/13/14 Jeff Miller
8/14/14 Tikkuhirvi Tietavainen
8/14/14 Jeff Miller