Lecture Week 5 Mon 10/7#

Poll

Generally speaking, if we had a new feature that highly correlates with an existing feature, is it useful to include it in our linear regression model?

Poll

Given input x in 1D, output y in 1D. If we use feature

\([x, 2x, 2^2x,..., 2^kx]\) …, how does the model performance changes as we increase k

Colinearity#

If two variables are highly correlated, intuitively, we can say that they are measuring the same thing. In the context of linear regression, this is called multicollinearity.

Suppose \(Y = aX_1\) and \(X_2 = bX_1\), then we can write \(Y = \beta_1 X_1 + \beta_2 X_2\) for any \(\beta_1\) and \(\beta_2\) such that \(\beta_1 + b \beta_2 = a\).

We can see that the coefficients are not unique.

This cause a few problems:

  1. Interpretation: it’s difficult to interpret the coefficients. Originally, we would say that a one unit increase in \(X_1\) would lead to a \(\beta_1\) increase in \(Y\). But now, we can’t say that, because we can’t change \(X_1\) without changing \(X_2\).

  2. Large Variance of the parameter: Since there could be infinite solutions, there is no guarantee which one the solver will find. Therefore small changes in the data could lead to large changes in the coefficients.

  3. Numerical Stability: Sometimes the solver can’t find the solution.

However, this does not affect the prediction: even though the coefficients are not unique, the prediction will be the same.

import numpy as np
from sklearn.linear_model import LinearRegression

X = np.random.uniform(0,1,(100,1))

# append collinear columns
X = np.column_stack((X, X[:,0]*2))
y = X[:,0] + np.random.randn(100)
X
array([[0.62365027, 1.24730053],
       [0.4253973 , 0.85079461],
       [0.43361357, 0.86722714],
       [0.717071  , 1.434142  ],
       [0.42653244, 0.85306487],
       [0.53399778, 1.06799557],
       [0.31122431, 0.62244861],
       [0.85197119, 1.70394239],
       [0.36421847, 0.72843694],
       [0.73485184, 1.46970368],
       [0.95144227, 1.90288455],
       [0.36629381, 0.73258763],
       [0.13651207, 0.27302415],
       [0.13391422, 0.26782845],
       [0.73678487, 1.47356975],
       [0.33688059, 0.67376118],
       [0.31941082, 0.63882164],
       [0.13129254, 0.26258509],
       [0.48312005, 0.9662401 ],
       [0.19293042, 0.38586083],
       [0.90247469, 1.80494938],
       [0.10462735, 0.2092547 ],
       [0.77478773, 1.54957547],
       [0.4485391 , 0.8970782 ],
       [0.67584279, 1.35168559],
       [0.53556331, 1.07112662],
       [0.17936428, 0.35872856],
       [0.45865905, 0.91731811],
       [0.43844865, 0.8768973 ],
       [0.95822866, 1.91645733],
       [0.93936992, 1.87873985],
       [0.48351485, 0.96702971],
       [0.76848652, 1.53697304],
       [0.74659586, 1.49319172],
       [0.30637101, 0.61274201],
       [0.71966159, 1.43932319],
       [0.5990783 , 1.1981566 ],
       [0.04917678, 0.09835356],
       [0.78568906, 1.57137812],
       [0.23491446, 0.46982892],
       [0.83524037, 1.67048074],
       [0.12860991, 0.25721982],
       [0.22757576, 0.45515152],
       [0.6783472 , 1.35669439],
       [0.09459492, 0.18918984],
       [0.37375248, 0.74750497],
       [0.48121314, 0.96242628],
       [0.93840482, 1.87680963],
       [0.91088067, 1.82176133],
       [0.23372349, 0.46744699],
       [0.12110401, 0.24220802],
       [0.62649115, 1.25298231],
       [0.65600748, 1.31201497],
       [0.41327241, 0.82654483],
       [0.87526916, 1.75053832],
       [0.62088216, 1.24176432],
       [0.61256831, 1.22513662],
       [0.6880837 , 1.37616741],
       [0.66363576, 1.32727151],
       [0.43210947, 0.86421895],
       [0.15804385, 0.3160877 ],
       [0.83568519, 1.67137038],
       [0.53916489, 1.07832977],
       [0.37496727, 0.74993454],
       [0.45414027, 0.90828053],
       [0.01863059, 0.03726118],
       [0.10407411, 0.20814822],
       [0.70851436, 1.41702871],
       [0.89112632, 1.78225263],
       [0.43227028, 0.86454056],
       [0.51323461, 1.02646922],
       [0.51997571, 1.03995143],
       [0.41797499, 0.83594999],
       [0.23205229, 0.46410459],
       [0.2632247 , 0.52644939],
       [0.41732701, 0.83465402],
       [0.29158383, 0.58316767],
       [0.87425648, 1.74851296],
       [0.14387416, 0.28774832],
       [0.98656439, 1.97312877],
       [0.61229716, 1.22459431],
       [0.72356889, 1.44713777],
       [0.94288385, 1.88576771],
       [0.07805372, 0.15610744],
       [0.10098234, 0.20196469],
       [0.68889518, 1.37779035],
       [0.44619938, 0.89239876],
       [0.16402955, 0.32805909],
       [0.60039083, 1.20078165],
       [0.37525504, 0.75051008],
       [0.39785474, 0.79570948],
       [0.77653281, 1.55306561],
       [0.62362695, 1.24725389],
       [0.00591554, 0.01183108],
       [0.80683403, 1.61366805],
       [0.26391796, 0.52783592],
       [0.2924461 , 0.58489221],
       [0.39119296, 0.78238592],
       [0.24553983, 0.49107966],
       [0.74915235, 1.49830469]])
# regression with X0 and X1=2*X0
lreg_sklearn = LinearRegression()
lreg_sklearn.fit(X,y)
score = lreg_sklearn.score(X,y)
print(f'coefs: {lreg_sklearn.coef_}, intercept: {lreg_sklearn.intercept_}, score: {score}')


# regression with X0
lreg_sklearn.fit(X[:,0:1],y)
score = lreg_sklearn.score(X[:,0:1],y)
print(f'coefs: {lreg_sklearn.coef_}, intercept: {lreg_sklearn.intercept_}, score: {score}')
coefs: [0.29626866 0.59253733], intercept: -0.08254725576306376, score: 0.13419087806713048
coefs: [1.48134332], intercept: -0.08254725576306343, score: 0.1341908780671307

Cross validation#

Training, Validation, and Testing#

Note that we have the following separate goals:

  • Model selection: estimate the performance of different models in order to choose the best one.

  • Model assessment: after choosing the best model, estimate its prediction error on new data.

If we have plenty of data, we can split it into three sets: training, validation, and test.

The training set is used to fit the models. The validation set is used to estimate prediction error, which is used to select the model or tune the hyperparameters. In our example, this is the degree of the polynomial. Notice that in the process, the models “see” the validation set. The test set is used for assessment of the generalization error of the final chosen model. This set is never seen by the models. We should not go back and choose the model based on the test set performance.

One common way of splitting the data is 60% training, 20% validation, and 20% test.

Sometimes people use “validation” and “test” interchangeably. This is fine if we are only doing only one of the tasks above (model selection or model assessment). However, if we are doing both, we should have two separate sets.

Question:#

Take the penguins dataset. Use the flipper length to predict the body mass. Perform a 5-fold cross-validation. What is the average mean squared error?

Feel free Look up the documentation of KFold, google, chatGPT, or discuss with your classmates.

# example from scikit-learn
# https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.KFold.html
from sklearn.model_selection import KFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4])
kf = KFold(n_splits=4)
type(kf)
#
sklearn.model_selection._split.KFold
type(kf.split(X))
# This is an object that is similar to the range object
# We can iterate over it
generator
# get the indices of the train and test sets
for train_index, test_index in kf.split(X):
    print(f"  Train: index={train_index}")
    print(f"  Test:  index={test_index}")
  Train: index=[1 2 3]
  Test:  index=[0]
  Train: index=[0 2 3]
  Test:  index=[1]
  Train: index=[0 1 3]
  Test:  index=[2]
  Train: index=[0 1 2]
  Test:  index=[3]
# using enumerate to get the counter i
for i, (train_index, test_index) in enumerate(kf.split(X)):
    print(f"  i: {i}")
    print(f"  Train: index={train_index}")
    print(f"  Test:  index={test_index}")
  i: 0
  Train: index=[1 2 3]
  Test:  index=[0]
  i: 1
  Train: index=[0 2 3]
  Test:  index=[1]
  i: 2
  Train: index=[0 1 3]
  Test:  index=[2]
  i: 3
  Train: index=[0 1 2]
  Test:  index=[3]
import numpy as np
import seaborn as sns
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold

# Load the Penguins dataset
df = sns.load_dataset('penguins')
df.dropna(inplace=True)  # Remove missing values

# features = ['bill_length_mm', 'bill_depth_mm','flipper_length_mm']
features = ['flipper_length_mm']
target = ['body_mass_g']  

# Initialize linear regression model
model = LinearRegression()

kf = KFold(n_splits=5, shuffle=True, random_state=1)


all_scores = []

for i, (train_index, test_index) in enumerate(kf.split(df)):
    X_train = df[features].iloc[train_index] 
    X_test  = df[features].iloc[test_index]

    y_train = df[target].iloc[train_index]
    y_test = df[target].iloc[test_index]

    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)

    all_scores.append(score)

    print(f"Fold {i} R^2 score:", score)

print(f"Mean R^2 score: {np.mean(all_scores)}")
Fold 0 R^2 score: 0.725084353204537
Fold 1 R^2 score: 0.6993225273807326
Fold 2 R^2 score: 0.7895360534716703
Fold 3 R^2 score: 0.7782229609972658
Fold 4 R^2 score: 0.794602971686194
Mean R^2 score: 0.75735377334808