Lecture Week 4 Mon 10/21#

Problem#

Given the training dataset \(x_i\in\mathbb{R}\), \(y_i\in\mathbb{R}\), \(i= 1,2,..., N\), we want to find the linear function

\[ y\approx f(x) = wx + b \]

that fits the relations between \(x_i\) and \(y_i\). So that given any \(x\), we can make the prediction

\[ \hat{y} = f(x) = w x+b \]

Loss Function and Optimization#

With the training dataset, define the loss function \(L(w,b)\) of parameter \(w\) and \(b\), which is also called mean squared error (MSE)

\[L(w,b)=\frac{1}{N}\sum_{i=1}^N\big(\hat{y}^{(i)}-y_i\big)^2=\frac{1}{N}\sum_{i=1}^N\big((wx_i+b)-y_i\big)^2,\]

where \(\hat{y}^{(i)}\) denotes the predicted value of y at \(x_i\), i.e. \(\hat{y}^{(i)} = wx_i+b\).

Our goal is to find the optimal \(w\) and \(b\) that minimize the loss function \(L(w,b)\), i.e.

\[\min_{w,b} L(w,b)\]

This is a function of \(w\) and \(b\), and we can analytically solve \(\partial_{w}L = \partial_{b}L =0\), and yields

\[w^* = \frac{\sum_{i=1}^{N} (x_i - \bar{X})(y_i - \bar{Y})}{\sum_{i=1}^{N} (x_i - \bar{X})^2} = \frac{\text{Cov}(X,Y)}{\text{Var}(X)}\]
\[b^* = \bar{Y} - w^*\bar{X}\]

where \(\bar{X}\) and \(\bar{Y}\) are the mean of \(x\) and of \(y\), and \(\text{Cov}(X,Y)\) denotes the estimated covariance (or called sample covariance) between \(X\) and \(Y\), \(\text{Var}(Y)\) denotes the sample variance of \(Y\).

What is the best constant model?#

What is the best constant \(b\) that minimizes the loss function \(L(b)\)?

\[L(b)=\frac{1}{N}\sum_{i=1}^N\big(b-y_i\big)^2\]

We can analytically solve \(\partial_{b}L = 0\), and yields

\[b^* = \bar{y}\]

where \(\bar{y}\) is the mean of \(y\).

Evaluating the model#

  • MSE: The smaller MSE indicates better performance.

MSE depends on the unit. For example, if we replace y[m] by y[mm], than MSE will be \(1000^2\) times the original MSE.

  • R-Squared: The larger \(R^{2}\) (closer to 1) indicates better performance. Compared with MSE, R-squared is dimensionless, not dependent on the units of variable.

\[R^{2} = 1 - \frac{\sum_{i=1}^{N}(y_i-\hat{y}^{(i)})^{2}}{\sum_{i=1}^{N}(y_i-\bar{y})^{2}} = 1 - \frac{\frac{1}{N}\sum_{i=1}^{N}(y_i-\hat{y}^{(i)})^{2}}{\frac{1}{N}\sum_{i=1}^{N}(y_i-\bar{y})^{2}} = 1 - \frac{\text{MSE}}{\text{Var}(Y)}\]

Intuitively, MSE is the “unexplained” variance, and \(R^{2}\) is the proportion of the variance that is explained by the model: If our model is perfect, then MSE is 0, and \(R^{2}\) is 1. If we are using a constant model, then our best prediction is the mean of \(y\), and MSE is the variance of \(y\), and \(R^{2}\) is 0;

import numpy as np
x = np.array([0,1,2,3])
y = np.array([0,2,0,6])

poll: What is the optimal slope and intercept for the data?#

class myLinearRegression:
    '''
    The single-variable linear regression estimator.
    This serves as an example of the regression models from sklearn, with methods fit, predict, and score.
    '''
    def __init__(self):
        '''
        '''
        self.w = None
        self.b = None
    
    def fit(self, x, y):
        # covariance matrix, 
        # bias = True makes the factor 1/N, otherwise 1/(N-1)
        # but it doesn't matter here, since the factor will be cancelled out in the calculation of w
        
        cov_mat = np.cov(x, y, bias=True)
        # cov_mat[0, 1] is the covariance of x and y, and cov_mat[0, 0] is the variance of x

        self.w = cov_mat[0, 1] / cov_mat[0, 0]
        self.b = np.mean(y) - self.w * np.mean(x)

        # :.3f means 3 decimal places
        print(f'w = {self.w:.3f}, b = {self.b:.3f}')

    def predict(self, x):
        '''
        Predict the output values for the input value x, based on trained parameters

        '''
        ypred = self.w * x + self.b
        return ypred

    def score(self, x, y):
        '''
        Calculate the R^2 score of the model
        '''
        mse =  np.mean((y - self.predict(x))**2)
        var = np.mean((y - np.mean(y))**2)
        Rsquare = 1 - mse / var
        return Rsquare
lm = myLinearRegression()
lm.fit(x, y)
w = 1.600, b = -0.400
score = lm.score(x, y)
print(f'R^2 score = {score:.3f}')
R^2 score = 0.533

Use the following command to install scikit-learn

conda install scikit-learn

or

pip install scikit-learn

from sklearn.linear_model import LinearRegression

reg = LinearRegression()

# For sklearn, the input x has to be n-by-d, where n is the number of samples and d is the number of features
# x.reshape(-1, 1) reshapes the 1D array x to a 2D array with 1 column
# 

reg.fit(x.reshape(-1, 1), y)
print(f'w = {reg.coef_[0]:.3f}, b = {reg.intercept_:.3f}')

score = reg.score(x.reshape(-1, 1), y)
print(f'score = {score:.3f}')
w = 1.600, b = -0.400
score = 0.533
# side note on numpy.reshape

x = np.array([0,1,2,3,4,5])

# reshape to 2D array with 2 rows and 3 columns
print(x.reshape(2, 3))

# if we set the number of rows to -1, numpy will automatically calculate the number of rows
print(x.reshape(-1, 3))

# reshape to n-by-1 array
print(x.reshape(-1, 1))
[[0 1 2]
 [3 4 5]]
[[0 1 2]
 [3 4 5]]
[[0]
 [1]
 [2]
 [3]
 [4]
 [5]]