AMH - g_ud.18_lin

Math behind Linear Regression

Suppose you have data with $n$ features that you use to predict a real valued outcome. Suppose that you also have $m \geq n$ data points. Data point $d_i$ will have feature values $(a_{1i}, a_{2i}, \dots, a_{ni})$ and outcome $b_i.$ To approximate the data points with a linear function, you need values $x_1, x_2, \dots, x_n$ such that $\begin{align} & a_{11}x_1 + a_{12}x_2 + \dots + a_{1n}x_n = b_1 \\ & a_{21}x_1 + a_{22}x_2 + \dots + a_{2n}x_n = b_2 \\ & \vdots \\ & a_{m1}x_1 + a_{m2}x_2 + \dots + a_{mn}x_n = b_m \\ \end{align}$ Once you have found $x_1, x_2, \dots, x_n,$ given a data point $d = (c_1, c_2, \dots, c_n),$ you can predict the outcome is $c_1x_1 + c_2x_2 + \dots + c_nx_n$ Rewriting the formula for the known data in matrix notation, we want to solve for $x_1, x_2, \dots, x_n$ in the formula $\begin{bmatrix} a_{11} & a_{12} & \dots & a_{1n} \\ a_{21} & a_{22} & \dots & a_{2n} \\ \vdots & & & \\ a_{m1} & a_{m2} & \dots & a_{mn} \\ \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} = \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_m \end{bmatrix}$ Writing the $(a_{ij})$ matrix as $A,$ the $(x_i)$ vector as $\overrightarrow{x},$ and the $(b_i)$ vector as $\overrightarrow{b},$ we can write this as $A\overrightarrow{x}=\overrightarrow{b}$ Since $m \geq n$ there will be $0$ or $1$ solutions. Most likely there will be $0$ solutions if $m > n.$

In the case that there is $1$ solution $\overrightarrow{x},$ that is the solution you should use to estimate outcomes.

In the case that there are $0$ solutions, the linear regression algorithm says you should use the vector $\hat{x}$ which minimizes the square error. In this case, given a solution $\overrightarrow{x},$ the error term for the $i$ th data point is $e_i(\overrightarrow{x})$ defined by $e_i(\overrightarrow{x}) = (a_{i1}x_1 + a_{i2}x_2 + \dots + a_{in}x_n) - b_i$ and the squared error of the vector $\overline{x} = (x_1, x_2, \dots, x_n)$ is the sum of the squared errors for each data point: $e_1(\overrightarrow{x})^2 + e_2(\overrightarrow{x})^2 + \dots + e_m(\overrightarrow{x})^2$ Since the squared error is positive quadratic there is exactly one minimum, $\hat{x}.$ It is known from linear algebra that this minimum is $\hat{x} = (A^\intercal A)^{-1} A^\intercal \overrightarrow{b}$

2D Case

In the case that you have one input and one output, you are trying to find the line of best fit. You need $m > 1,$ or $m \geq 2$ data points. Data point $d_i$ will have feature value $(a_i$ and output $b_i.$

The slope-intercept form of the equation of a line is $y = mx + b$ To match the notation used above, we will replace $m$ by $x_1$ and $b$ by $x_2.$ So, to find an outcome that best fits the data, we need to find a slope $x_1$ and a $y$ -intercept $x_2.$ To incorporate the constant into the equation, we can add a constant feature that has no effect on the outcome. This feature will always have input $1.$ So, data point $d_i$ will have features $(a_i, 1)$ and outputs $b_i.$ (You can use this trick to incorporate a constant in any number of dimensions.\)

We want to find values $x_1$ and $x_2$ such that $\begin{align} & x_1a_1 + x_2 = b_1 \\ & x_1a_2 + x_2 = b_2 \\ & \vdots \\ & x_1a_m + x_2 = b_m \\ \end{align}$ Written in matrix form, this is $\begin{bmatrix} a_1 & 1 \\ a_2 & 1 \\ \vdots & \\ a_m & 1 \\ \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_m \end{bmatrix}$ The least squares solution is $\hat{x} = (A^\intercal A)^{-1} A^\intercal \overrightarrow{b}.$ In this case, $A^\intercal A = \begin{bmatrix} a_1 & a_2 & \dots & a_m \\ 1 & 1 & \dots & 1\\ \end{bmatrix} \begin{bmatrix} a_1 & 1 \\ a_2 & 1 \\ \vdots & \\ a_m & 1 \\ \end{bmatrix} = \begin{bmatrix} a_1^2+a_2^2+\dots+a_m^2 & a_1+a_2+\dots+a_m \\ a_1+a_2+\dots+a_m & m \\ \end{bmatrix}$ We shall multiply and divide by $m$ so that we can write the notation more concisely. Namely, the mean of the squares of the data is $\overline{a^2} = \frac{a_1^2+a_2^2+\dots+a_m^2}{m}$ and the mean of the data is $\overline{a} = \frac{a_1+a_2+\dots+a_m}{m}$ Using this notation, we have $A^\intercal A = m\begin{bmatrix} \overline{a^2} & \overline{a} \\ \overline{a} & 1 \\ \end{bmatrix}$ Therefore the inverse is $(A^\intercal A)^{-1} = \frac{1}{m(\overline{a^2}-\overline{a}^2)} \begin{bmatrix} 1 & -\overline{a} \\ -\overline{a} & \overline{a^2} \\ \end{bmatrix}$ Let $\sigma_a^2$ be the variance in the data. So, $\sigma_a^2 = \overline{a^2}-\overline{a}^2$ Then $(A^\intercal A)^{-1} = \frac{1}{m\sigma_a^2} \begin{bmatrix} 1 & -\overline{a} \\ -\overline{a} & \overline{a^2} \\ \end{bmatrix}$ The rest of the expression is $A^\intercal \overrightarrow{b} = \begin{bmatrix} a_1 & a_2 & \dots & a_m \\ 1 & 1 & \dots & 1\\ \end{bmatrix} \begin{bmatrix} b_1 \\ b_2 \\ \vdots \\ b_m \end{bmatrix} = \begin{bmatrix} a_1 b_1 + a_2 b_2 + \dots a_m b_m \\ b_1 + b_2 + \dots b_m \\ \end{bmatrix} = m \begin{bmatrix} \overline{ab} \\ \overline{b} \\ \end{bmatrix}$ Therefore, $\begin{align} \hat{x} & = (A^\intercal A)^{-1} A^\intercal \overrightarrow{b} \\ & = \frac{1}{m\sigma_a^2} \begin{bmatrix} 1 & -\overline{a} \\ -\overline{a} & \overline{a^2} \\ \end{bmatrix} \cdot m \begin{bmatrix} \overline{ab} \\ \overline{b} \\ \end{bmatrix} \\ & = \frac{1}{\sigma_a^2} \begin{bmatrix} \overline{ab}-\overline{a}\cdot\overline{b} \\ -\overline{a}\cdot\overline{ab} + \overline{a^2}\cdot\overline{b} \\ \end{bmatrix} \\ & = \frac{1}{\sigma_a^2} \begin{bmatrix} \text{Cov}(a,b) \\ -\overline{a}(\text{Cov}(a,b)+\overline{a}\cdot\overline{b}) + (\sigma_a^2 + \overline{a}^2)\cdot\overline{b} \\ \end{bmatrix} \\ & = \begin{bmatrix} \text{Cov}(a,b)/\sigma_a^2 \\ \overline{b} - \overline{a}\text{Cov}(a,b)/\sigma_a^2 \\ \end{bmatrix} \\ \end{align}$ This solves for the slope and $y$ -intercept. $\begin{align} & \text{Slope: } x_1 = \text{Cov}(a,b)/\sigma_a^2 & \text{y-intercept: } x_2 = \overline{b} - \overline{a}\text{Cov}(a,b)/\sigma_a^2 \end{align}$

Math behind Linear Regression

2D Case

Pearson's Correlation Coefficient - rr Score

Pearson's Correlation Coefficient - $r$ Score