Associated Topics || Dr. Math Home || Search Dr. Math

### Why Does the Least Squares Method Work for Line of Best Fit?

Date: 10/05/2007 at 15:11:51
From: Henry
Subject: Derivation of Least squares fit

Hi, my son is doing best fit for HS Algebra.  I have a MS in EE so I'm
in love w/ math.  Is there a place that explains the DERIVATION of the
formula, rather than just saying "Use this equation to solve for M and
C in y = Mx + C ??

Thanks.

Date: 10/13/2007 at 10:40:48
From: Doctor Fenton
Subject: Re: Derivation of Least squares fit

Hi Henry,

Thanks for writing to Dr. Math.

To find the "best" solution of an overdetermined (and inconsistent!)
system of linear equations requires a specification of the criterion
for "best".  For example, to determine the "best" fitting line to a
list of points (x-1,y_1), (x_2,y_2),...,(x_n,y_n), we are trying
to find a line y = mx + b such that

y_1 = m*x_1 + b
y_2 = m*x_2 + b
:  :   :     :
y_n = m*x_n + b   .

If there were just two points, then there is a unique solution, but
given more than two points, there is usually no line through all the
points.

Given the parameters m and b which determine a line y = mx + b, we can
describe how much the line fails to fit all the points by forming, for
each point (x_k,y_k), the "residual" r_k = y_k - (m*x_k + b) .  The
residual r_k is the distance between where the line y = mx+b crosses
the vertical line x = x_k and the point (x_k,y_k).  This list of
numbers r_1, r_2,...,r_n characterizes how well the line fits the
points, but we would prefer to have a single number to characterize
the fit.  If we simply add the residuals, there may be cancellation,
and we could get a small number, even when the fit is very poor
(large positive residuals canceling large negative residuals).

To prevent the cancellation, we could take absolute values of the
residuals, but absolute values are not easy to work with.  A more
mathematically attractive approach is to square the residuals, and
find the line for which the sum of the squares of the residuals is a
minimum. This technique is called "least squares fitting".

The usual approach is to use the fact that a necessary condition for a
minimum of a function of two or more variables is that the gradient of
the function be 0, which gives you a system of two equations in two
unknowns (the slope m and intercept b of the line).  However, this
approach requires calculus, so it isn't appropriate for an algebra course.

While the problem can be analyzed with just ordinary algebra, it is
extremely tedious, requiring "industrial strength algebra".  A matrix
formulation is much more compact, and illustrates the utility of the
matrix concept and notation.  (Essentially, the pure high school
algebra approach just writes out all the matrix algebra below without
writing the matrices.)

The equations

y_1 = b + m*x_1
y_2 = b + m*x_2
:  :   :     :
y_n = b + m*x_n

can be written in matrix notation as

[y_1] = [b + m*x_1]
[y_2] = [b + m*x_2]
[ : ] : [  :    : ]
[y_n] = [b + m*x_n]

[1 x_1]
[1 x_2] [b]
= [:  : ] [m]
[1 x_n]

If we write

[y_1]      [1 x_1]
[y_2]      [1 x_2]               [b]
[ : ] = Y ,[:  : ] = A , and X = [m] ,
[y_n]      [1 x_n]

then the equation is

Y = AX  .

The matrix of residuals is

[r_1] = [y_1 - (b + m*x_1)]
[r_2] = [y_2 - (b + m*x_2)]
[ : ] : [    :   :      : ]
[r_n] = [y_n - (b + m*x_n)]

or in matrix notation,

R = Y - AX .

The sum of squares of the residuals is

(r_1)^2+(r_2)^2+ ... +(r_n)^2 = [r_1 r_2 ... r_n][r_1]
[r_2]
[ : ]
[r_n]

or using matrix transpose notation

R^t R  .

(M^t is the matrix M with rows and columns interchanged.)

The quantity to be minimized is then

R^t R = (Y-AX)^t(Y-AX)
= (Y^t-(AX)^t)(Y-AX)
= (Y^t - X^tA^t)(Y-AX)
= Y^t(Y-AX) - X^tA^t(Y-AX)
= Y^tY - Y^tAX - X^tA^tY + X^tA^tAX  .

Remember that X = [b m]^t is the matrix of unknowns, and from the last
equation, we see that the equation is what you might call a matrix
quadratic, which we want to minimize.

That should bring to mind a similar problem in ordinary algebra, of
minimizing a quadratic expression of one variable

ax^2 + bx + c

where a > 0.  The usual approach is to complete the square: write

a(x^2 + (b/a)x  ) + c ,

and noting that (x+h)^2 = x^2 + 2hx + h^2, we see that in order to be
a perfect square, we must have

a(x^2 + (b/a)x + b^2/(4a^2)) = (x+(b/2a))^2

as the first term.  This has added a*(b^2/4a^2) = b^2/4a to the
expression, so to keep the same value we must also subtract that amount:

a(x^2 + (b/a)x +b^2/(4a^2)) + c - (b^2/(4a))

which is

a(x+(b/2a))^2 + ((4ac-b^2)/(4a) .

Since the second term does not depend upon x, and the first term is a
times a perfect square (and a > 0), the expression is clearly
minimized when x = -b/(2a).

The least squares matrix equation has a "quadratic" term of X^tA^tAX,
which is more like the ordinary quadratic

a^2x^2 + bx + c .

To minimize this expression, we would want to write it in the form

(a(x+h))^2 + k .

We can use similar reasoning to see that

(a(x+h))^2 = a^2x^2 + 2a^2hx + a^2h^2 ,

so we need 2a^2h = b, or h = b/(2a^2), giving

a^2x^2+bx+c = (a(x+(b/2a^2))^2 + c - (b^2/(4a^2)) .

Now we need to choose x = -b/(2a^2) [which is still the negative
coefficient of x divided by the coefficient of x^2, just as before].

With this idea in mind, we look at the matrix quadratic equation, and
try to write

(1) Y^tY - Y^tAX - X^tA^tY + X^tA^tAX

in the form

(A(C-X))^t(A(C-X)) + D = (AC-AX)^t(AC-AX).

Expanding the left side gives

(2) (C^tA^t - X^tA^t)(AC-AX) =

C^tA^tAC - C^tA^tAX - X^tA^tAC - X^tA^tAX .

Comparing the two terms with X and X^t in equations
(1) and (2), we see that we must choose C so that

C^tA^tAX = Y^tAX   and  X^tA^tAC = X^tA^tY .

However, since the second equation is just the transpose of the first
equation, this is just one equation, and we must have

X^t(A^tAC) = X^t(A^tY)

or

A^tAC =  A^tY  .

These equations are called the normal equations, and they are a system
of linear equations for C.  If A^tA is invertible, then

C = (A^tA)^(-1)A^tY

will be the value of X which minimizes the sum of squares of the
residuals.

For

[y_1]      [1 x_1]
[y_2]      [1 x_2]               [b]
[ : ] = Y ,[:  : ] = A , and X = [m] ,
[y_n]      [1 x_n]

we have

A^tA = [ 1   1  ...   1][1 x_1]
[x-1 x_2 ... x_n][1 x_2]
[:  : ]
[1 x_n]
n
= [ n       SUM x_k     ]
[         k=1         ]
[ n        n          ]
[SUM x_k  SUM (x_k)^2 ]
[k=1      k=1         ]  ,

and

A^tY = [ 1   1  ...   1][y_1]
[x_1 x_2 ... x_n][y_2]
[ : ]
[y_n]
[  n          ]
= [ SUM y_k     ]
[ k=1         ]
[             ]
[  n          ]
[ SUM x_k y_k ]
[ k=1         ]

so you wind up with a system of two equations for the two unknowns
b and m, whose solution gives the standard formulas for the least
squares regression line.

If you have any questions, please write back and I will try to explain
further.

- Doctor Fenton, The Math Forum
http://mathforum.org/dr.math/

Date: 08/08/2013 at 08:48:41
From: Alan
Subject: Re: Derivation of Least squares fit

In your response above, you say that absolute values (of differences) are
difficult to work with.

But calculating |a - b| seems a lot easier to calculate than (a - b)^2.
When attempting to fit a line to a set of data points, why isn't it easier
to take the absolute value of the differences instead of calculating their
squares?

Date: 08/08/2013 at 11:12:35
From: Doctor Fenton
Subject: Re: Derivation of Least squares fit

Hi Alan,

Thanks for writing to Dr. Math.

The problem with using the sum of the absolute values of the residuals
(the differences between the actual observed values and the values
predicted by a model) is not the evaluation of these residuals, but the
difficulty in finding the minimum of the sum of the absolute values.

When you use the sum of the squares, you are working with an expression
which is quadratic in the parameters, and it is much easier mathematically
to find the minimum of a quadratic expression than it is to find the
minimum of the sum of the absolute values of the residuals.

The result for absolute values may also not be unique. For example, given
a list of numbers x_1, x_2, ... x_n, consider the simpler problem of
finding the value of M which minimizes the sum of squares ...

n
SUM (x_i - M)^2
k=1

... and also the problem of finding the value of m which minimizes

n
SUM |x_i - M|
k=1

The first problem (minimizing the sum of squares) gives M to be the
arithmetic mean of the x_k's, while minimizing the second will give m to
be the median.  However, if n is even, then you can take M to be any
number between the two central observations (not just the mean of the two
central observations, as is conventionally done).

For example, in the simplest case, where we have two values x_1 = a and
x_2 = b, then

(x - a)^2 + (x - b)^2 = 2x^2 - (2a + 2b)x + a^2 + b^2

You can complete the square to show

= 2[x - (a + b)/2]^2 + {a^2 + b^2 - (a + b)^2/2}

From this formula, it is clear that the smallest possible value of the
expression occurs when x = (a + b)/2, the mean of a and b.

However, assuming that a < b, if we compute ...

|x - a| + |b - x|,

... then

for x < a,            the formula becomes (a + b) - 2x
(a downward sloping line)

for b >= x >= a,      the formula becomes (b - a) (a constant)

for x > b,            the formula becomes 2x - (a + b)
(an upward sloping line)

The graph looks like this:

\                    /
\                  /
\________________/

So any point between a and b will give a minimum value of
|x - a| + |x - b|, since the minimum of the curve is a straight line
segment.

If you have any questions, please write back and I will try to explain
further.

- Doctor Fenton, The Math Forum

Associated Topics:
High School Linear Algebra

Search the Dr. Math Library:

 Find items containing (put spaces between keywords):   Click only once for faster results: [ Choose "whole words" when searching for a word like age.] all keywords, in any order at least one, that exact phrase parts of words whole words

Submit your own question to Dr. Math
Math Forum Home || Math Library || Quick Reference || Math Forum Search