Linear regression is used to find linear relationships in data. In a set of data, it is the line (or plane) of best fit. You have probably seen linear regression in a scientific or financial graph where there is a set of plotted points and a line through the points.
To use linear regression, you need data with one or more features that are being used to predict a 1-dimensional outcome. For example, you could use location (2 coordinates), square feet and number of bathrooms to predict the price of a house. You will need at least one data point for each feature.
Click in the box to add points, then click the Run! button below to draw a linear regression line.
In this example, the feature is x and the outcome is y.
You will need at least 2 points before the code will run.
Suppose you have data with n features that you use to predict a real valued outcome. Suppose that you also have m≥n data points. Data point di will have feature values (a1i,a2i,…,ani) and outcome bi. To approximate the data points with a linear function, you need values x1,x2,…,xn such that
a11x1+a12x2+⋯+a1nxn=b1a21x1+a22x2+⋯+a2nxn=b2⋮am1x1+am2x2+⋯+amnxn=bm
Once you have found x1,x2,…,xn, given a data point d=(c1,c2,…,cn), you can predict the outcome is
c1x1+c2x2+⋯+cnxn
Rewriting the formula for the known data in matrix notation, we want to solve for x1,x2,…,xn in the formula
[a11a12…a1na21a22…a2n⋮am1am2…amn][x1x2⋮xn]=[b1b2⋮bm]
Writing the (aij) matrix as A, the (xi) vector as →x, and the (bi) vector as →b, we can write this as
A→x=→b
Since m≥n there will be 0 or 1 solutions. Most likely there will be 0 solutions if m>n.
In the case that there is 1 solution →x, that is the solution you should use to estimate outcomes.
In the case that there are 0 solutions, the linear regression algorithm says you should use the vector ˆx which minimizes the square error. In this case, given a solution →x, the error term for the ith data point is ei(→x) defined by
ei(→x)=(ai1x1+ai2x2+⋯+ainxn)−bi
and the squared error of the vector ¯x=(x1,x2,…,xn) is the sum of the squared errors for each data point:
e1(→x)2+e2(→x)2+⋯+em(→x)2
Since the squared error is positive quadratic there is exactly one minimum, ˆx. It is known from linear algebra that this minimum is
ˆx=(A⊺A)−1A⊺→b
In the case that you have one input and one output, you are trying to find the line of best fit. You need m>1, or m≥2 data points. Data point di will have feature value (ai and output bi.
The slope-intercept form of the equation of a line is
y=mx+b
To match the notation used above, we will replace m by x1 and b by x2. So, to find an outcome that best fits the data, we need to find a slope x1 and a y-intercept x2. To incorporate the constant into the equation, we can add a constant feature that has no effect on the outcome. This feature will always have input 1. So, data point di will have features (ai,1) and outputs bi. (You can use this trick to incorporate a constant in any number of dimensions.\)
We want to find values x1 and x2 such that
x1a1+x2=b1x1a2+x2=b2⋮x1am+x2=bm
Written in matrix form, this is
[a11a21⋮am1][x1x2]=[b1b2⋮bm]
The least squares solution is ˆx=(A⊺A)−1A⊺→b. In this case,
A⊺A=[a1a2…am11…1][a11a21⋮am1]=[a21+a22+⋯+a2ma1+a2+⋯+ama1+a2+⋯+amm]
We shall multiply and divide by m so that we can write the notation more concisely. Namely, the mean of the squares of the data is
¯a2=a21+a22+⋯+a2mm
and the mean of the data is
¯a=a1+a2+⋯+amm
Using this notation, we have
A⊺A=m[¯a2¯a¯a1]
Therefore the inverse is
(A⊺A)−1=1m(¯a2−¯a2)[1−¯a−¯a¯a2]
Let σ2a be the variance in the data. So,
σ2a=¯a2−¯a2
Then
(A⊺A)−1=1mσ2a[1−¯a−¯a¯a2]
The rest of the expression is
A⊺→b=[a1a2…am11…1][b1b2⋮bm]=[a1b1+a2b2+…ambmb1+b2+…bm]=m[¯ab¯b]
Therefore,
ˆx=(A⊺A)−1A⊺→b=1mσ2a[1−¯a−¯a¯a2]⋅m[¯ab¯b]=1σ2a[¯ab−¯a⋅¯b−¯a⋅¯ab+¯a2⋅¯b]=1σ2a[Cov(a,b)−¯a(Cov(a,b)+¯a⋅¯b)+(σ2a+¯a2)⋅¯b]=[Cov(a,b)/σ2a¯b−¯aCov(a,b)/σ2a]
This solves for the slope and y-intercept.
Slope: x1=Cov(a,b)/σ2ay-intercept: x2=¯b−¯aCov(a,b)/σ2a
One way to measure how well the line fits the data is with Pearson's Correlation Coefficient, which is represented with an r. The value r measures the linear correlation between the x and y values, and can be computed as r=n∑xiyi−¯x¯y√(n∑x2i−¯x2)(n∑y2i−¯y2)