Linear Regression

Linear Regression#

Problem#

Given the training dataset $x_i\in\mathbb{R}$, $y_i\in\mathbb{R}$, $i= 1,2,..., N$, we want to find the linear function

\[ y\approx f(x) = wx + b \]

that fits the relations between $x_i$ and $y_i$. So that given any $x$, we can make the prediction

\[ \hat{y} = w x+b \]

Loss Function and Optimization#

With the training dataset, define the loss function $L(w,b)$ of parameter $w$ and $b$, which is also called mean squared error (MSE)

\[L(w,b)=\frac{1}{N}\sum_{i=1}^N\big(\hat{y}^{(i)}-y_i\big)^2=\frac{1}{N}\sum_{i=1}^N\big((wx_i+b)-y_i\big)^2,\]

where $\hat{y}^{(i)}$ denotes the predicted value of y at $x_i$, i.e. $\hat{y}^{(i)} = wx_i+b$.

Our goal is to find the optimal $w$ and $b$ that minimize the loss function $L(w,b)$, i.e.

\[\min_{w,b} L(w,b)\]

This is a function of $w$ and $b$, and we can analytically solve $\partial_{w}L = \partial_{b}L =0$, and yields

\[w^* = \frac{\sum_{i=1}^{N} (x_i - \bar{X})(y_i - \bar{Y})}{\sum_{i=1}^{N} (x_i - \bar{X})^2} = \frac{\text{Cov}(X,Y)}{\text{Var}(X)}\]

\[b^* = \bar{Y} - w^*\bar{X}\]

where $\bar{X}$ and $\bar{Y}$ are the mean of $x$ and of $y$, and $\text{Cov}(X,Y)$ denotes the estimated covariance (or called sample covariance) between $X$ and $Y$, $\text{Var}(Y)$ denotes the sample variance of $Y$.

Evaluating the model#

MSE: The smaller MSE indicates better performance.

MSE depends on the unit. For example, if we replace y[m] by y[mm], than MSE will be $1000^2$ times the original MSE.

$R^{2}$ (coefficient of determination): The larger $R^{2}$ (closer to 1) indicates better performance. Compared with MSE, R-squared is dimensionless, not dependent on the units of variable.

\[R^{2} = 1 - \frac{\sum_{i=1}^{N}(y_i-\hat{y}^{(i)})^{2}}{\sum_{i=1}^{N}(y_i-\bar{y})^{2}} = 1 - \frac{\frac{1}{N}\sum_{i=1}^{N}(y_i-\hat{y}^{(i)})^{2}}{\frac{1}{N}\sum_{i=1}^{N}(y_i-\bar{y})^{2}} = 1 - \frac{\text{MSE}}{\text{Var}(Y)}\]

Intuitively, MSE is the “unexplained” variance, and $R^{2}$ is the proportion of the variance that is explained by the model: If our model is perfect, then MSE is 0, and $R^{2}$ is 1. If we are using a constant model, then our best prediction is the mean of $y$, and MSE is the variance of $y$, and $R^{2}$ is 0;

import numpy as np
import matplotlib.pyplot as plt

class myLinearRegression:
    '''
    The single-variable linear regression estimator.
    This serves as an example of the regression models from sklearn, with methods fit, predict, and score.
    '''
    def __init__(self):
        '''
        '''
        self.w = None
        self.b = None
    
    def fit(self, x, y):
        # covariance matrix, 
        # bias = True makes the factor 1/N, otherwise 1/(N-1)
        # but it doesn't matter here, since the factor will be cancelled out in the calculation of w
        
        cov_mat = np.cov(x, y, bias=True)
        # cov_mat[0, 1] is the covariance of x and y, and cov_mat[0, 0] is the variance of x

        self.w = cov_mat[0, 1] / cov_mat[0, 0]
        self.b = np.mean(y) - self.w * np.mean(x)

        # :.3f means 3 decimal places
        print(f'w = {self.w:.3f}, b = {self.b:.3f}')

    def predict(self, x):
        '''
        Predict the output values for the input value x, based on trained parameters

        '''
        ypred = self.w * x + self.b
        return ypred

    def score(self, x, y):
        '''
        Calculate the R^2 score of the model
        '''
        mse =  np.mean((y - self.predict(x))**2)
        var = np.mean((y - np.mean(y))**2)
        Rsquare = 1 - mse / var
        return Rsquare

# Generate synthetic data

np.random.seed(0) # for reproducibility

a = 0
b = 2
N = 20

x = np.random.uniform(a,b,N)
y = 2 * x + 1 + np.random.randn(N)
y_gt = 2 * x + 1

# Fit the model
lm = myLinearRegression()
lm.fit(x, y)
score = lm.score(x, y)


# plot data
plt.plot(x, y, 'o', label='data')
# plot ground truth
plt.plot(x, y_gt, label='ground truth')


# plot the linear regression model
xs = np.linspace(a, b, 100)
plt.plot(xs, lm.predict(xs), label=f'pred, $R^2$={score:.3f}')
plt.legend()

print(f'score = {score:.3f}')

w = 2.467, b = 0.462
score = 0.580

../_images/00aace7c64e3ea6eb2bb8ac0f8b90dfc7b1e90bd75baefe79b4a224a9550bdf7.png

We can also use LinearRegression from sklearn.linear_model to fit the linear model.

# Use the following command to install scikit-learn
# %conda install scikit-learn
# or
# %pip install scikit-learn

from sklearn.linear_model import LinearRegression

reg = LinearRegression()

# x need to be n-sample by p features
reg.fit(x.reshape(-1, 1), y)
print(f'w = {reg.coef_[0]:.3f}, b = {reg.intercept_:.3f}')

score = reg.score(x.reshape(-1, 1), y)
print(f'score = {score:.3f}')

w = 2.467, b = 0.462
score = 0.580

What is the effect of centering the data?#

When $X$ is centered, the slope $w$ remains the same, but the intercept $b$ changes.

The intercept now means the predicted Y when X is at the mean.

x_center = x - np.mean(x)

reg.fit(x_center.reshape(-1, 1), y)
print(f'w = {reg.coef_[0]:.3f}, b = {reg.intercept_:.3f}')

w = 2.467, b = 3.332

When Y is centered, then the intercept $b$ is 0, and the slope $w$ is the correlation between $X$ and $Y$.

y_center = y - np.mean(y)
reg.fit(x_center.reshape(-1, 1), y_center)
print(f'w = {reg.coef_[0]:.3f}, b = {reg.intercept_:.3f}')

w = 2.467, b = 0.000

What is the effect of scaling the data?#

When $X/c$ is used, the optimal slope is now $c w$, and the intercept does not change

# suppose we want to scale x to [0, 1]
scale_factor = np.max(x)
x_scale = x / scale_factor
reg.fit(x_scale.reshape(-1, 1), y)
print(f'w = {reg.coef_[0]:.3f}, b = {reg.intercept_:.3f}')

w = 4.756, b = 0.462

For linear regression, centering and scaling the data (both training and testing data ) does not affect the performance of the model.

For some other models, centering and scaling can be essential

Appendix#

Detailed Derivation of the Optimal Parameters (Not exam material).

The loss function for simple linear regression is given by:

\[L(w, b) = \frac{1}{N} \sum_{i=1}^N ((wx_i + b) - y_i)^2\]

To find the optimal ( w ), we set the partial derivative of ( L ) with respect to ( w ) to zero:

Differentiate with respect to ( w ):

\[\frac{\partial L}{\partial w} = \frac{2}{N} \sum_{i=1}^{N} x_i ((wx_i + b) - y_i) = 0\]

Rearrange this equation to get:

\[w \sum_{i=1}^{N} x_i^2 + b N \bar{X} = \sum_{i=1}^{N} x_i y_i\quad\quad(*)\]

Derivative with respect to $b$:

\[ \frac{\partial L}{\partial b} = \frac{2}{N} \sum_{i=1}^{N} ((wx_i + b) - y_i) = 0 \]

This simplifies to:

\[ w\sum_{i=1}^{N} x_i + Nb = \sum_{i=1}^{N} y_i \]

Or, equivalently:

\[ w\bar{X} + b = \bar{Y} \]

Substitute $b$ back into the equation (*), we get:

\[w \sum_{i=1}^{N} x_i^2 + (\bar{Y} - w \bar{X}) n \bar{X} = \sum_{i=1}^{N} x_i y_i\]

Expanding this:

\[w \sum_{i=1}^{N} x_i^2 + N \bar{Y} \bar{X}- w \bar{X} \sum_{i=1}^{N} x_i = \sum_{i=1}^{N} x_i y_i\]

Rearranging terms

\[w (\sum_{i=1}^{N} x_i^2 - N\bar{X}^2) = \sum_{i=1}^{N} x_i y_i - N\bar{Y}\bar{X}\]

Now, solving for $w$:

\[w = \frac{\sum_{i=1}^{N} x_i y_i - N\bar{Y}\bar{X}}{\sum_{i=1}^{N} x_i^2 - N\bar{X}^2}\]

Recall that $Cov(X,Y) = E[XY]-E[X]E[Y]$, and $Var(X) = E[X^2] - E[X]^2$. Divide both the numerator and the denominator by $N$. We can rewrite the above equation in terms of covariance and variance: $$w = \frac{Cov(X,Y)}{Var(X)}$$

This is the final expression for $w^*$ in simple linear regression, representing the slope of the best-fit line.