Cross Validation#
Training, Validation, and Testing#
Note that we have the following separate goals:
Model selection: estimate the performance of different models in order to choose the best one.
Model assessment: after choosing the best model, estimate its prediction error on new data.
If we have plenty of data, we can split it into three sets: training, validation, and test.
The training set is used to fit the models. The validation set is used to estimate prediction error, which is used to select the model or tune the hyperparameters. In our example, this is the degree of the polynomial. Notice that in the process, the models “see” the validation set. The test set is used for assessment of the generalization error of the final chosen model. This set is never seen by the models. We should not go back and choose the model based on the test set performance.
One common way of splitting the data is 60% training, 20% validation, and 20% test.
Sometimes people use “validation” and “test” interchangeably. This is fine if we are only doing only one of the tasks above (model selection or model assessment). However, if we are doing both, we should have two separate sets.
Cross-Validation#
When the data is limited, we can use cross-validation. The most common method is k-fold cross-validation. For example, for 5-fold cross-validation, the data is randomly split into 5 equal parts. We train on 4 parts and validate on the remaining part. We repeat this process 5 times (the folds), each time with a different validation part. The final score is the average of the 5 validation scores.
import numpy as np
import seaborn as sns
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold
from sklearn.metrics import r2_score
# Load the Penguins dataset
df = sns.load_dataset('penguins')
# features = ['bill_length_mm', 'bill_depth_mm','flipper_length_mm']
features = ['flipper_length_mm']
target = ['body_mass_g']
# Remove missing values based on the features and target
df.dropna(subset=features + target, inplace=True) # Remove missing values
# Initialize linear regression model
model = LinearRegression()
# Set up 5-fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=1)
# Initialize a list to store the R^2 scores for each fold
scores = []
# Manually perform cross-validation
k = 1
for train_index, test_index in kf.split(df):
# Split the data into training and test sets for this fold
X_train, X_test = df[features].iloc[train_index], df[features].iloc[test_index]
y_train, y_test = df[target].iloc[train_index], df[target].iloc[test_index]
# Fit the model to the training data
model.fit(X_train, y_train)
# Predict on the test data
y_pred = model.predict(X_test)
# Calculate the R^2 score and append to list
score = r2_score(y_test, y_pred)
scores.append(score)
k += 1
# Output the scores for each fold and the average score
print(f"Fold {k-1} R^2 score:", score)
print("Average R^2 score:", np.mean(scores))
Fold 1 R^2 score: 0.758969714080479
Fold 2 R^2 score: 0.6996496443204514
Fold 3 R^2 score: 0.7813098171294692
Fold 4 R^2 score: 0.7538535583920862
Fold 5 R^2 score: 0.7554820079335888
Average R^2 score: 0.7498529483712149
We can also use the cross_val_score function to streamline the process.
from sklearn.model_selection import cross_val_score
# Initialize linear regression model
model = LinearRegression()
# Set up 5-fold cross-validation
kf = KFold(n_splits=5, shuffle=True, random_state=1)
# Perform cross-validation
scores = cross_val_score(model, df[features], df[target], cv=kf, scoring='r2')
# Output the scores for each fold
print("R^2 scores for each fold:", scores)
print("Average R^2 score:", scores.mean())
R^2 scores for each fold: [0.75896971 0.69964964 0.78130982 0.75385356 0.75548201]
Average R^2 score: 0.7498529483712149