Homework 7 (Due 11/25/2024 at 11:59pm)

Contents

Homework 7 (Due 11/25/2024 at 11:59pm)#

Name:#

ID:#

Submission instruction:

  • Download the file as .ipynb (see top right corner on the webpage).

  • Write your name and ID in the field above.

  • Answer the questions in the .ipynb file in either markdown or code cells.

  • Before submission, make sure to rerun all cells by clicking Kernel -> Restart & Run All and check all the outputs.

  • Upload the .ipynb file to Gradescope.

Q1

Use the multiclass logistic regression model to classify the penguins dataset.

Use the features bill_length_mm, bill_depth_mm to predict the species.

(1) Load the data. Remove missing value. Standardize the features.

# code here

(2) Split the data 50:50 into a training set and a test set. In thetrain_test_split function, use stratified sampling, so that the proportion of different species is the same in the training set and the test set. Set random_state=0 for reproducibility.

# code here

(3) Look at the documentation of LogisticRegression in sklearn. Notice that the default is to use L2 regularization as in ridge regression.

Here, let’s fit a multiclass logistic regression model without regularization on the training set. Report the training and testing accuracy.

# code here

(4) Visualize the confusion matrix on the test set.

# code here

(5) Use DecisionBoundaryDisplay to visualize the decision boundaries and the test set. The decision boundaries are obtained from the trained classifier (using training dataset). The scatter plot should show the data points from the test set.

# code here

Q2. Continue from Q1 (use the same training and testing set)

This time, let’s use kNN classifier to classify the species using bill_length_mm, bill_depth_mm.

(1) Plot the training and testing accuracy for k = 1, 2, 3, …, 100. That is, for each k, fit a kNN classifier on the training set and compute the training and testing accuracy. Plot the training and testing accuracy as a function of k.

# code here

(2) Find the best k from (1) based on the testing set. Fit the kNN classifier with the best k. Visualize the decision boundaries and the testing set.

# code here