Homework 7 (Due 11/25/2024 at 11:59pm)#
Name:#
ID:#
Submission instruction:
Download the file as .ipynb (see top right corner on the webpage).
Write your name and ID in the field above.
Answer the questions in the .ipynb file in either markdown or code cells.
Before submission, make sure to rerun all cells by clicking
Kernel
->Restart & Run All
and check all the outputs.Upload the .ipynb file to Gradescope.
Q1
Use the multiclass logistic regression model to classify the penguins dataset.
Use the features bill_length_mm
, bill_depth_mm
to predict the species.
(1) Load the data. Remove missing value. Standardize the features.
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.inspection import DecisionBoundaryDisplay
# Load the dataset
df = sns.load_dataset('penguins')
# Drop rows with missing values
df.dropna(inplace=True)
features = ['bill_length_mm', 'bill_depth_mm']
# scale the features
scaler = StandardScaler()
df[features] = scaler.fit_transform(df[features])
(2) Split the data 50:50 into a training set and a test set. In thetrain_test_split
function,
use stratified
sampling, so that the proportion of different species is the same in the training set and the test set. Set random_state=0
for reproducibility.
df_train, df_test = train_test_split(df, test_size=0.5, random_state=0, stratify=df['species'])
(3) Look at the documentation of LogisticRegression
in sklearn
. Notice that the default is to use L2 regularization as in ridge regression.
Here, let’s fit a multiclass logistic regression model without regularization on the training set. Report the training and testing accuracy.
# Initialize and train the logistic regression model
clf = LogisticRegression(penalty=None)
clf.fit(df_train[features], df_train['species'])
train_accuracy = accuracy_score(df_train['species'], clf.predict(df_train[features]))
test_accuracy = accuracy_score(df_test['species'], clf.predict(df_test[features]))
print(f'Train accuracy: {train_accuracy:.2f}, Test accuracy: {test_accuracy:.2f}')
Train accuracy: 0.98, Test accuracy: 0.96
(4) Visualize the confusion matrix on the test set.
# Evaluate the model
conf_matrix = confusion_matrix(df_test['species'], clf.predict(df_test[features]))
# Plotting the confusion matrix
ConfusionMatrixDisplay(conf_matrix, display_labels=clf.classes_).plot()
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix of Species Classification')
plt.show()
(5) Use DecisionBoundaryDisplay
to visualize the decision boundaries and the test set. The decision boundaries are obtained from the trained classifier (using training dataset). The scatter plot should show the data points from the test set.
# Plot the decision boundaries using DecisionBoundaryDisplay
fig, ax = plt.subplots()
db_display = DecisionBoundaryDisplay.from_estimator(
clf,
df_test[features],
grid_resolution=200,
response_method="predict", # Can be "predict_proba" for probability contours
alpha=0.5,
cmap='coolwarm',
ax=ax
)
# Scatter plot of the data points
sns.scatterplot(data=df_test, x='bill_length_mm', y='bill_depth_mm', hue='species', ax=ax)
# Adding title and labels
ax.set_title('3-Class Logistic Regression Decision Boundary')
ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
# Show plot
plt.show()
Q2. Continue from Q1 (use the same training and testing set)
This time, let’s use kNN classifier to classify the species using bill_length_mm
, bill_depth_mm
.
(1) Plot the training and testing accuracy for k = 1, 2, 3, …, 100. That is, for each k, fit a kNN classifier on the training set and compute the training and testing accuracy. Plot the training and testing accuracy as a function of k.
from sklearn.neighbors import KNeighborsClassifier
# Try different values of k for kNN
k_values = range(1, 51)
scores_test = []
scores_train = []
for k in k_values:
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(df_train[features], df_train['species'])
# Calculate the score on the test set
score_test = knn.score(df_test[features], df_test['species'])
score_train = knn.score(df_train[features], df_train['species'])
scores_test.append(score_test)
scores_train.append(score_train)
# Plot the test and train accuracy as a function of k
plt.plot(k_values, scores_test, label='Test')
plt.plot(k_values, scores_train, label='Train')
plt.xlabel('k')
plt.ylabel('Accuracy')
plt.legend()
<matplotlib.legend.Legend at 0x17707c710>
(2) Find the best k from (1) based on the testing set. Fit the kNN classifier with the best k. Visualize the decision boundaries and the testing set.
best_k = k_values[scores_test.index(max(scores_test))]
k = best_k
# Initialize and fit the KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(df_train[features], df_train['species'])
# Visualize the decision boundary
fig, ax = plt.subplots()
db_display = DecisionBoundaryDisplay.from_estimator(
knn,
df_test[features],
grid_resolution=200,
response_method="predict", # Can be "predict_proba" for probability contours
alpha=0.5,
cmap='coolwarm',
ax=ax
)
sns.scatterplot(data=df_train, x = 'bill_length_mm', y = 'bill_depth_mm', hue = 'species')
plt.title(f'Decision Boundary of k-Nearest Neighbors')
plt.show()