Homework 5 (Due 11/1/2024 at 11:59pm)

Homework 5 (Due 11/1/2024 at 11:59pm)#

Name:#

ID:#

Submission instruction:

Download the file as .ipynb (see top right corner on the webpage).
Write your name and ID in the field above.
Answer the questions in the .ipynb file in either markdown or code cells.
Before submission, make sure to rerun all cells by clicking Kernel -> Restart & Run All and check all the outputs.
Upload the .ipynb file to Gradescope.

Q1 Linear regression on exponential growth

Suppose we want to measure the growth of a population of bacteria. We culture the bacteria in a test tube. We measure the volume of the cell culture at various times (hr) and obtain the following dataframe:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df_bac = pd.DataFrame({'time': [0, 1.5, 5, 9, 10], 'Volume':[0.37, 1.63, 2.4, 6.2, 10.66]})
print(df_bac)

   time  Volume
 0.0    0.37
 1.5    1.63
 5.0    2.40
 9.0    6.20
10.0   10.66

(1) Use linear regression to fit the data. Compute the \(R^2\). Plot the data and the fitted line on the same figure.

from sklearn.linear_model import LinearRegression
import numpy as np

sns.scatterplot(data=df_bac, x='time', y='Volume')

lm = LinearRegression()
lm.fit(df_bac[['time']], df_bac['Volume'])

sns.lineplot(x=df_bac['time'], y=lm.predict(df_bac[['time']]), color='red')

score = lm.score(df_bac[['time']], df_bac['Volume'])
print(f'R2' , score)

R2 0.8523390725174672

../_images/a078252998bf01e3856b4cfeafb6f7b0e00b1338e9e19ed3f72ef4a26297e102.png

(2) We know that when nutrients are abundant, the bacteria grow exponentially. Therefore, a better model might be

\[ y = a e^{bx} \]

where \(y\) is the volume of the bacteria culture and \(x\) is the time.

Can we turn this into a linear regression problem?

Hint: Take the natural logarithm of both sides of the equation.

After taking the natural logarithm, we have

\[ \log y = \log a + bx \]

This is a linear regression problem.

(3) After the transformation, find the optimal values of \(a\) and \(b\). Compute the \(R^2\). Plot the data and the fitted curve in the same figure. Is it better than the linear model?

# exponential growth
lm2 = LinearRegression()
lm2.fit(df_bac[['time']], np.log(df_bac['Volume']))
a = np.exp(lm2.intercept_)
b = lm2.coef_[0]

print(a, b)

# compute R2
y_pred = a * np.exp(b * df_bac['time'])
mse = np.mean((df_bac['Volume'] - y_pred) ** 2)
var = np.var(df_bac['Volume'])
r2 = 1 - mse / var
print('R2 = ', r2)
# the R2 is higher than the linear model


# plot the exponential fit
sns.scatterplot(data=df_bac, x='time', y='Volume')
xx = np.linspace(0, 10, 100)
yy = a * np.exp(b * xx)
sns.lineplot(x=xx, y=yy)
# add legend
_ = plt.legend(labels=["data","fit"])

0.5954551711492091 0.2805094137824158
R2 =  0.9605944747794366

../_images/6f72655a86d35b01fb425e301a5852559f31e2484b24820d6262d814c37a0a4c.png

Q2 Multiple linear regression

In this problem, we would like to predict the body_mass_g of a penguin based on the other features in the dataset.

# Do not modify this cell
df = sns.load_dataset("penguins")
df.dropna(inplace=True)

from sklearn.model_selection import train_test_split
# the stratify parameter makes sure that the train and test sets have the same proportion of species
df_train, df_test = train_test_split(df, test_size=0.5, random_state=0, stratify=df['species'])

In this exercise, Let’s consider what is the best subset of features to predict body_mass_g. This is called the best subset selection problem.

Since we have 3 features, we can fit 7 models, each with a different subset of features:

Model 1: bill_length_mm
Model 2: bill_depth_mm
Model 3: flipper_length_mm
Model 4: bill_length_mm, bill_depth_mm
Model 5: bill_length_mm, flipper_length_mm
Model 6: bill_depth_mm, flipper_length_mm
Model 7: bill_length_mm, bill_depth_mm, flipper_length_mm

For each model, fit a linear regression model and compute the \(R^2\) on the training and testing data. Which model has the best \(R^2\) on the testing data? Is the best model the one with the most features?

# Feel free to use this variable. 
# all_subsets = [ ['bill_length_mm'], ['bill_depth_mm'], ['flipper_length_mm'], 
#                 ['bill_length_mm', 'bill_depth_mm'], ['bill_length_mm', 'flipper_length_mm'], ['bill_depth_mm', 'flipper_length_mm'], 
#                 ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm']]
# Alternatively, you can check out itertools.combinations (part of the Python standard library) to generate all combinations of a list

from itertools import combinations

features = ['bill_length_mm', 'bill_depth_mm',	'flipper_length_mm']

best_subset = []
best_score = 0

for L in range(1, len(features) + 1):
    for subset in combinations(features, L):
        lm = LinearRegression()
        lm.fit(df_train[list(subset)], df_train['body_mass_g'])
        test_score = lm.score(df_test[list(subset)], df_test['body_mass_g'])
        train_score = lm.score(df_train[list(subset)], df_train['body_mass_g'])

        print(f'{subset}: {train_score:.4f} {test_score:.4f}')

        if test_score > best_score:
            best_score = test_score
            best_subset = subset

print(f'best subset: {best_subset} {best_score:.4f}')

('bill_length_mm',): 0.3792 0.3101
('bill_depth_mm',): 0.2185 0.2134
('flipper_length_mm',): 0.7489 0.7731
('bill_length_mm', 'bill_depth_mm'): 0.4890 0.4385
('bill_length_mm', 'flipper_length_mm'): 0.7523 0.7689
('bill_depth_mm', 'flipper_length_mm'): 0.7508 0.7745
('bill_length_mm', 'bill_depth_mm', 'flipper_length_mm'): 0.7532 0.7705
best subset: ('bill_depth_mm', 'flipper_length_mm') 0.7745

Q3 Working with categorical variables

Let’s try to make use of the species feature in predicting the body_mass_g

# Do not modify this cell
df_penguins = sns.load_dataset("penguins")
df_penguins.dropna(inplace=True)

We can think of two different methods to encode the categorical feature:

Method 1: Using integer 1, 2, 3 …
Method 2: Using dummy variables such as is_category_A, where is_category_A is 1 if the category is A, 0 otherwise.

In this problem, we will compare the two methods.

For our dataset, the species are ['Adelie', 'Chinstrap', 'Gentoo'].

Let’s generate a few new features in df_penguins:

code is 0 if the species is Adelie, 1 if the species is Chinstrap, 2 if the species is Gentoo
is_Adelie is 1 if the species is Adelie, 0 otherwise
is_Chinstrap is 1 if the species is Chinstrap, 0 otherwise

The is_Adelie and is_Chinstrap features are called dummy variables.

df_penguins['is_Adelie'] = df_penguins['species'] == 'Adelie'
df_penguins['is_Chinstrap'] = df_penguins['species'] == 'Chinstrap'

# encode species as 0 1 2
code = {'Adelie': 0, 'Chinstrap': 1, 'Gentoo': 2}
df_penguins['code'] = df_penguins['species'].map(code)

After generating these new features, split the new augmented dataframe into training and testing sets use the following command

# Do not modify this cell
df_penguins_train, df_penguins_test = train_test_split(df_penguins, test_size=0.5, random_state=0, stratify=df_penguins['species'])

(2) Compare the performance using different combinations of features. Compute the \(R^2\) on training and testing sets. Which one is the best?

Baseline: ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm']
Use code: ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'code']
Use dummy: ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'is_Adelie', 'is_Chinstrap']

# we define a helper function for our workflow: take some features, fit a linear regression model, and return the R2 score
def get_score(df_train, df_test, features):
    lm = LinearRegression()
    lm.fit(df_train[features], df_train['body_mass_g'])
    train_score = lm.score(df_train[features], df_train['body_mass_g'])
    test_score = lm.score(df_test[features], df_test['body_mass_g'])
    return train_score, test_score

features = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm']
train_score, test_score = get_score(df_penguins_train, df_penguins_test, features)
print(f'Baseline: Train score: {train_score:.3f}, Test score: {test_score:.3f}')


features = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'code']
train_score, test_score = get_score(df_penguins_train, df_penguins_test, features)
print(f'Use code: Train score: {train_score:.3f}, Test score: {test_score:.3f}')


features = ['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'is_Adelie', 'is_Chinstrap']
train_score, test_score = get_score(df_penguins_train, df_penguins_test, features)
print(f'Use dummy: Train score: {train_score:.3f}, Test score: {test_score:.3f}')

# the dummy variables are better than the code variable
# In general, encoding categorical variables as integers 1 2 3 is not a good idea, because it implies an order between the categories that does not exist.
# However, in some case, when the categories are age groups, it can make sense to encode them as 1 2 3, as there is an order between the categories.

Baseline: Train score: 0.753, Test score: 0.770
Use code: Train score: 0.763, Test score: 0.760
Use dummy: Train score: 0.851, Test score: 0.839

Homework 5 (Due 11/1/2024 at 11:59pm)

Contents

Homework 5 (Due 11/1/2024 at 11:59pm)#

Name:#

ID:#