Why Does the Least Squares Method Work for Line of Best Fit?
Date: 10/05/2007 at 15:11:51 From: Henry Subject: Derivation of Least squares fit Hi, my son is doing best fit for HS Algebra. I have a MS in EE so I'm in love w/ math. Is there a place that explains the DERIVATION of the formula, rather than just saying "Use this equation to solve for M and C in y = Mx + C ?? Thanks.
Date: 10/13/2007 at 10:40:48 From: Doctor Fenton Subject: Re: Derivation of Least squares fit Hi Henry, Thanks for writing to Dr. Math. To find the "best" solution of an overdetermined (and inconsistent!) system of linear equations requires a specification of the criterion for "best". For example, to determine the "best" fitting line to a list of points (x-1,y_1), (x_2,y_2),...,(x_n,y_n), we are trying to find a line y = mx + b such that y_1 = m*x_1 + b y_2 = m*x_2 + b : : : : y_n = m*x_n + b . If there were just two points, then there is a unique solution, but given more than two points, there is usually no line through all the points. Given the parameters m and b which determine a line y = mx + b, we can describe how much the line fails to fit all the points by forming, for each point (x_k,y_k), the "residual" r_k = y_k - (m*x_k + b) . The residual r_k is the distance between where the line y = mx+b crosses the vertical line x = x_k and the point (x_k,y_k). This list of numbers r_1, r_2,...,r_n characterizes how well the line fits the points, but we would prefer to have a single number to characterize the fit. If we simply add the residuals, there may be cancellation, and we could get a small number, even when the fit is very poor (large positive residuals canceling large negative residuals). To prevent the cancellation, we could take absolute values of the residuals, but absolute values are not easy to work with. A more mathematically attractive approach is to square the residuals, and find the line for which the sum of the squares of the residuals is a minimum. This technique is called "least squares fitting". The usual approach is to use the fact that a necessary condition for a minimum of a function of two or more variables is that the gradient of the function be 0, which gives you a system of two equations in two unknowns (the slope m and intercept b of the line). However, this approach requires calculus, so it isn't appropriate for an algebra course. While the problem can be analyzed with just ordinary algebra, it is extremely tedious, requiring "industrial strength algebra". A matrix formulation is much more compact, and illustrates the utility of the matrix concept and notation. (Essentially, the pure high school algebra approach just writes out all the matrix algebra below without writing the matrices.) The equations y_1 = b + m*x_1 y_2 = b + m*x_2 : : : : y_n = b + m*x_n can be written in matrix notation as [y_1] = [b + m*x_1] [y_2] = [b + m*x_2] [ : ] : [ : : ] [y_n] = [b + m*x_n] [1 x_1] [1 x_2] [b] = [: : ] [m] [1 x_n] If we write [y_1] [1 x_1] [y_2] [1 x_2] [b] [ : ] = Y ,[: : ] = A , and X = [m] , [y_n] [1 x_n] then the equation is Y = AX . The matrix of residuals is [r_1] = [y_1 - (b + m*x_1)] [r_2] = [y_2 - (b + m*x_2)] [ : ] : [ : : : ] [r_n] = [y_n - (b + m*x_n)] or in matrix notation, R = Y - AX . The sum of squares of the residuals is (r_1)^2+(r_2)^2+ ... +(r_n)^2 = [r_1 r_2 ... r_n][r_1] [r_2] [ : ] [r_n] or using matrix transpose notation R^t R . (M^t is the matrix M with rows and columns interchanged.) The quantity to be minimized is then R^t R = (Y-AX)^t(Y-AX) = (Y^t-(AX)^t)(Y-AX) = (Y^t - X^tA^t)(Y-AX) = Y^t(Y-AX) - X^tA^t(Y-AX) = Y^tY - Y^tAX - X^tA^tY + X^tA^tAX . Remember that X = [b m]^t is the matrix of unknowns, and from the last equation, we see that the equation is what you might call a matrix quadratic, which we want to minimize. That should bring to mind a similar problem in ordinary algebra, of minimizing a quadratic expression of one variable ax^2 + bx + c where a > 0. The usual approach is to complete the square: write a(x^2 + (b/a)x ) + c , and noting that (x+h)^2 = x^2 + 2hx + h^2, we see that in order to be a perfect square, we must have a(x^2 + (b/a)x + b^2/(4a^2)) = (x+(b/2a))^2 as the first term. This has added a*(b^2/4a^2) = b^2/4a to the expression, so to keep the same value we must also subtract that amount: a(x^2 + (b/a)x +b^2/(4a^2)) + c - (b^2/(4a)) which is a(x+(b/2a))^2 + ((4ac-b^2)/(4a) . Since the second term does not depend upon x, and the first term is a times a perfect square (and a > 0), the expression is clearly minimized when x = -b/(2a). The least squares matrix equation has a "quadratic" term of X^tA^tAX, which is more like the ordinary quadratic a^2x^2 + bx + c . To minimize this expression, we would want to write it in the form (a(x+h))^2 + k . We can use similar reasoning to see that (a(x+h))^2 = a^2x^2 + 2a^2hx + a^2h^2 , so we need 2a^2h = b, or h = b/(2a^2), giving a^2x^2+bx+c = (a(x+(b/2a^2))^2 + c - (b^2/(4a^2)) . Now we need to choose x = -b/(2a^2) [which is still the negative coefficient of x divided by the coefficient of x^2, just as before]. With this idea in mind, we look at the matrix quadratic equation, and try to write (1) Y^tY - Y^tAX - X^tA^tY + X^tA^tAX in the form (A(C-X))^t(A(C-X)) + D = (AC-AX)^t(AC-AX). Expanding the left side gives (2) (C^tA^t - X^tA^t)(AC-AX) = C^tA^tAC - C^tA^tAX - X^tA^tAC - X^tA^tAX . Comparing the two terms with X and X^t in equations (1) and (2), we see that we must choose C so that C^tA^tAX = Y^tAX and X^tA^tAC = X^tA^tY . However, since the second equation is just the transpose of the first equation, this is just one equation, and we must have X^t(A^tAC) = X^t(A^tY) or A^tAC = A^tY . These equations are called the normal equations, and they are a system of linear equations for C. If A^tA is invertible, then C = (A^tA)^(-1)A^tY will be the value of X which minimizes the sum of squares of the residuals. For [y_1] [1 x_1] [y_2] [1 x_2] [b] [ : ] = Y ,[: : ] = A , and X = [m] , [y_n] [1 x_n] we have A^tA = [ 1 1 ... 1][1 x_1] [x-1 x_2 ... x_n][1 x_2] [: : ] [1 x_n] n = [ n SUM x_k ] [ k=1 ] [ n n ] [SUM x_k SUM (x_k)^2 ] [k=1 k=1 ] , and A^tY = [ 1 1 ... 1][y_1] [x_1 x_2 ... x_n][y_2] [ : ] [y_n] [ n ] = [ SUM y_k ] [ k=1 ] [ ] [ n ] [ SUM x_k y_k ] [ k=1 ] so you wind up with a system of two equations for the two unknowns b and m, whose solution gives the standard formulas for the least squares regression line. If you have any questions, please write back and I will try to explain further. - Doctor Fenton, The Math Forum http://mathforum.org/dr.math/
Search the Dr. Math Library:
Ask Dr. MathTM
© 1994-2013 The Math Forum