Analysis of Mushroom Attributes

Analysis of Mushroom Attributes#

Author: Crystal Tran

Course Project, UC Irvine, Math 10, S24

I would like to post my notebook on the course’s website. [Yes]

Introduction#

The ultimate goal of this analysis is to classify mushrooms as edible or poisonous given select features of the mushroom dataset. We will use feature selection, data scaling, data balancing, and cross-validation on different regression and classification models in order to create the best model to classify mushrooms.

Additionally, we will explore the data to possibly determine other interesting correlations between the selected features.

Import#

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression,LinearRegression
from sklearn.neighbors import KNeighborsClassifier,KNeighborsRegressor
from sklearn import metrics
from sklearn.utils import resample

Data Cleaning#

The dataset has a total of 20 features in addition to mushroom class, but for simplicity, we will only use about half of those features. The features that we select will be features that are easily determined by examining the mushroom, such as the shape, surface, and color of the mushroom, as well as features that can be measured, such as cap diameter and stem dimensions.

df = pd.read_csv("mushroom.csv")
df.drop(["gill-spacing", "spore-print-color", "stem-root", "veil-type", "veil-color", "habitat", "gill-attachment", "gill-color", "season", "ring-type"], axis=1, inplace=True)
df.dropna()
df.head()

	class	cap-diameter	cap-shape	cap-surface	cap-color	does-bruise-or-bleed	stem-height	stem-width	stem-surface	stem-color	has-ring
0	p	15.26	x	g	o	f	16.95	17.09	y	w	t
1	p	16.60	x	g	o	f	17.99	18.19	y	w	t
2	p	14.07	x	g	o	f	17.80	17.74	y	w	t
3	p	14.17	f	h	e	f	15.77	15.98	y	w	t
4	p	14.64	x	h	o	f	16.53	17.20	y	w	t

Data Preprocessing#

In order to get the data ready for analysis, we will change all boolean values to 0 or 1 as well as changing data with string values to dummy features.

isPoisonous = {"p": 1, "e": 0}
boolean = {"t": 1, "f": 0}
df = df.replace({"class": isPoisonous, "has-ring": boolean, "does-bruise-or-bleed": boolean})

dummy_features = ["cap-shape", "cap-surface", "cap-color", "stem-surface", "stem-color"]
df_dummies = pd.get_dummies(df[dummy_features], drop_first=True)
df_dummies = df_dummies.replace({i: {True: 1, False: 0} for i in df_dummies.columns.tolist()})
df = df.join(df_dummies)
df.drop(dummy_features, axis=1, inplace=True)
df.head()

	class	cap-diameter	stem-height	stem-width	has-ring	cap-shape_f	...	stem-color_w
0	1	15.26	16.95	17.09	1	0	...	1
1	1	16.60	17.99	18.19	1	0	...	1
2	1	14.07	17.80	17.74	1	0	...	1
3	1	14.17	15.77	15.98	1	1	...	1
4	1	14.64	16.53	17.20	1	0	...	1

5 rows × 52 columns

Now we can to feed the data into classification models to predict mushroom class given the features.

Classification Models#

Data Balancing#

From the distribution of mushroom class, we can see that there is about 11% more poisonous mushrooms in our dataset than edible mushrooms. We will simply oversample the edible mushroom class to balance the dataset. Additionally, we will also use balanced class weights for each model to ensure that the model performs most accurately.

fig, ax = plt.subplots(1, 2, figsize=(10, 15))

count_class = {0: 0, 1: 0}
for i in df["class"]:
    count_class[i] += 1
    
ax[0].pie([count_class[i] for i in count_class], labels=['edible', 'poisonous'], autopct='%1.1f%%', colors=['#66b3ff','#ff9999'])
ax[0].set_title('Distribution of Mushroom Class')

df_poisonous = df[df["class"] == 1]
df_edible = df[df["class"] == 0]

edible_group_oversampled = resample(df_edible, replace=True, n_samples=len(df_poisonous.index), random_state=0)
df = pd.concat([edible_group_oversampled, df_poisonous])
df = df.reset_index(drop=True)

count_class = {0: 0, 1: 0}
for i in df["class"]:
    count_class[i] += 1

ax[1].pie([count_class[i] for i in count_class], labels=['edible', 'poisonous'], autopct='%1.1f%%', colors=['#66b3ff','#ff9999'])
ax[1].set_title('Oversampled Distribution of Mushroom Class')
plt.show()

../_images/b3ff96a71f6405f62ab556dd799e19bc330567cfab3f504d99ca60c91410b92a.png

Training and Testing Sets#

features = df.columns.tolist()
features.pop(0)
target = "class"

X = df[features]
y = df[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, shuffle=True, random_state=1)

Logistic Regression Model#

log_reg = LogisticRegression(max_iter=1000, class_weight="balanced")
log_reg.fit(X_train, y_train)

print(f"Accuracy for Logistic Regression: {log_reg.score(X_test, y_test):.4f}")

Accuracy for Logistic Regression: 0.7305

KNN Classification Model#

We can to determine the best value of k and fit the KNN classification model.

best_k = 0
best_accuracy = 0

for k in range(1, 6):
    knn = KNeighborsClassifier(n_neighbors=k, weights="distance")
    
    knn.fit(X_train, y_train)
    accuracy = knn.score(X_test, y_test)
    
    print(f'k = {k}: {accuracy:.4f}')
    
    if accuracy > best_accuracy:
        best_k = k
        best_accuracy = accuracy
        
print()        
print(f"Best k is k = {best_k} with an accuracy of {best_accuracy:.4f}.")

knn = KNeighborsClassifier(n_neighbors=best_k, weights="distance")
knn.fit(X_train, y_train)

print()
print(f"Accuracy for KNN Classification: {knn.score(X_test, y_test):.4f}")

k = 1: 0.9969
k = 2: 0.9969
k = 3: 0.9960
k = 4: 0.9963
k = 5: 0.9954

Best k is k = 1 with an accuracy of 0.9969.

Accuracy for KNN Classification: 0.9969

Comparison of the Models#

cnf_matrix_lr = metrics.confusion_matrix(y_test, log_reg.predict(X_test))
cnf_matrix_knn = metrics.confusion_matrix(y_test, knn.predict(X_test))

fig, ax = plt.subplots(1, 2, figsize=(18, 8))

sns.heatmap(pd.DataFrame(cnf_matrix_lr), annot=True, cmap="YlGnBu", fmt='g', ax=ax[0])
ax[0].set_title('Logistic Regression Confusion Matrix')
ax[0].set_xlabel('Predicted label')
ax[0].set_ylabel('Actual label')

sns.heatmap(pd.DataFrame(cnf_matrix_knn), annot=True, cmap="YlGnBu", fmt='g', ax=ax[1])
ax[1].set_title('KNN Confusion Matrix')
ax[1].set_xlabel('Predicted label')
ax[1].set_ylabel('Actual label')

plt.tight_layout()
plt.show()

../_images/77dbdfefaee2b6bc22022ec339b7f20dc4d99f74adefba1f3b1b7016f77674ab.png

TN = cnf_matrix_lr[0][0]
FN = cnf_matrix_lr[1][0]
TP = cnf_matrix_lr[1][1]
FP = cnf_matrix_lr[0][1]

print(metrics.classification_report(y_test, log_reg.predict(X_test)))
print()
print(f"The logistic regression model predicted {TN} edible mushrooms and {FN} poisonus mushrooms as edible.")
print(f"The logistic regression model predicted {TP} poisonous mushrooms and {FP} edible mushrooms as poisonous.")

              precision    recall  f1-score   support

           0       0.71      0.78      0.74     17047
           1       0.75      0.68      0.72     16841

    accuracy                           0.73     33888
   macro avg       0.73      0.73      0.73     33888
weighted avg       0.73      0.73      0.73     33888


The logistic regression model predicted 13224 edible mushrooms and 5310 poisonus mushrooms as edible.
The logistic regression model predicted 11531 poisonous mushrooms and 3823 edible mushrooms as poisonous.

TN = cnf_matrix_knn[0][0]
FN = cnf_matrix_knn[1][0]
TP = cnf_matrix_knn[1][1]
FP = cnf_matrix_knn[0][1]

print(metrics.classification_report(y_test, knn.predict(X_test)))
print()
print(f"The KNN classification model predicted {TN} edible mushrooms and {FN} poisonus mushrooms as edible.")
print(f"The KNN classification model predicted {TP} poisonous mushrooms and {FP} edible mushrooms as poisonous.")

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     17047
           1       1.00      1.00      1.00     16841

    accuracy                           1.00     33888
   macro avg       1.00      1.00      1.00     33888
weighted avg       1.00      1.00      1.00     33888


The KNN classification model predicted 17006 edible mushrooms and 65 poisonus mushrooms as edible.
The KNN classification model predicted 16776 poisonous mushrooms and 41 edible mushrooms as poisonous.

Conclusion#

From the classification reports and confusion matrices, we can conclude that the best model for classifying edible and poisonous mushrooms is the KNN classification model with k = 1, which produces an accuracy of 0.9969. We are able to use this model for new mushroom data to classify each mushroom as edible or poisonous based on the given features, which we chose were observable and/or measurable.

Now, we can use a correlation matrix to see which other features are highly correlated.

Correlation Matrix#

corr_labels = ["class", "cap-diameter", "does-bruise-or-bleed", "stem-height", "stem-width", "has-ring"]
corr = df[corr_labels].corr()

sns.heatmap(corr, annot=False)
plt.title("Correlation Matrix")
plt.show()

../_images/849cdd82184f257a7ab59abc5099d49ce5127d56d2e315d263226cfadd593a56.png

We can see that there is a noticable correlation between stem height, stem width, and cap diameter. We can explore this further with some regression models.

Regression Models#

Data Preprocessing#

features = ["cap-diameter", "stem-height", "stem-width"]

fig = plt.figure()

plt.scatter(df["stem-height"], df["cap-diameter"], label='Stem Height')
plt.scatter(df["stem-width"], df["cap-diameter"], label='Stem Width')
plt.title('Stem Dimensions with Cap Diameter')
plt.xlabel('Stem Dimensions (Height/Width)')
plt.ylabel('Cap Diameter')
plt.legend()
plt.show()

../_images/3de62e5769466c50c88d0047604320b7d1ce8cb175e2b35edc82af9943576851.png

There seems to be a clear divide between cap diameters. This could be a result of a different species or type of mushroom in the dataset. For simplicity, we will just emit this data when training the model since we are not losing significant amount of data points.

df_no_outliers = df[df["cap-diameter"] < 35]

Data Scaling#

From the graph of unscaled mushroom dimensions, we can see that the data seems to be skewed similarly to a pareto distribution. Therefore, to minimize the effect of outliers in our data while scaling, we will use robust scaling.

fig,ax = plt.subplots(1, 2, figsize=(10, 5))

ax[0].hist(df_no_outliers[features], label=features)
ax[0].legend()
ax[0].set_title("Unscaled Mushroom Dimensions")

scaler = RobustScaler()
df_no_outliers[features] = scaler.fit_transform(df_no_outliers[features])

ax[1].hist(df_no_outliers[features], label=features)
ax[1].legend()
ax[1].set_title("Robust Scaled Mushroom Dimensions")

C:\Users\cryst\AppData\Local\Temp\ipykernel_1992\2997960766.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_no_outliers[features] = scaler.fit_transform(df_no_outliers[features])

Text(0.5, 1.0, 'Robust Scaled Mushroom Dimensions')

../_images/98c04d95e139c335b9dd0e50c10e9282bbc88d951ed86f92a2153d443441c757.png

Training and Testing Sets#

features = ["stem-height", "stem-width"]
target = "cap-diameter"

X = df_no_outliers[features]
y = df_no_outliers[target]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, shuffle=True, random_state=0)

Linear Regression Model#

lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)

kf = KFold(n_splits=5, shuffle=True, random_state=1)
scores = cross_val_score(lin_reg, X, y, cv=kf, scoring="r2")

print(f"R2 score for linear regression:")
print(f"-train score: {lin_reg.score(X_train, y_train):.4f} \n-test score: {lin_reg.score(X_test, y_test):.4f}")

R2 score for linear regression:
-train score: 0.6297 
-test score: 0.6446

KNN Regression Model#

knn = KNeighborsRegressor(n_neighbors=30)
knn.fit(X_train, y_train)

print(f"R2 score for KNN regression:")
print(f"-train score: {knn.score(X_train, y_train):.4f} \n-test score: {knn.score(X_test, y_test):.4f}")

R2 score for KNN regression:
-train score: 0.7979 
-test score: 0.7849

Comparison#

fig = plt.figure(figsize=(15, 30))

xx, yy = np.meshgrid(np.linspace(-2, 8, 10), np.linspace(-2, 8, 10))

ax = fig.add_subplot(1,2,1, projection='3d')
ax.scatter(df_no_outliers[features[0]], df_no_outliers[features[1]], y, color='b', label='Data Points')

coefficients = lin_reg.coef_
intercept = lin_reg.intercept_
zz = coefficients[0] * xx + coefficients[1] * yy + intercept

ax.plot_surface(xx, yy, zz, alpha=0.5, color='r', label='Regression Plane')

ax.set_xlabel(f'{features[0]}')
ax.set_ylabel(f'{features[1]}')
ax.set_zlabel(target)
ax.set_title('Linear Regression: Stem Dimensions vs Cap Diameter')

ax = fig.add_subplot(1,2,2, projection='3d')
ax.scatter(df_no_outliers[features[0]], df_no_outliers[features[1]], y, label='Data Points')

Z = knn.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

ax.plot_surface(xx, yy, Z, alpha=0.5, cmap='viridis')

ax.set_xlabel(f'{features[0]}')
ax.set_ylabel(f'{features[1]}')
ax.set_zlabel(target)
ax.set_title('KNN Regression: Stem Dimensions vs Cap Diameter')

plt.show()

c:\Users\cryst\anaconda3\Lib\site-packages\sklearn\base.py:439: UserWarning: X does not have valid feature names, but KNeighborsRegressor was fitted with feature names
  warnings.warn(

../_images/295334cc05b79fbd8c5aac303878289ea81066c51c3d06989119cde6aa4eb4fa.png

Conclusion#

We conclude that the KNN regression model with k = 30 produces an \(R^2\) score of 0.7849 and is the best predictor of mushroom cap diameter given stem height and width. However, it is not the greatest model. The reason for this may be because of the different species of mushrooms that are undefined in this certain dataset. Therefore, we can only say that there is a slight correlation between mushroom cap diameter and stem dimensions, and the ratio may depend on the species of mushrooms or other factors.

References#

Mushroom Dataset From Kaggle

Linear Regression

Logistic Regression

Nearest Neighbors Classification and Regression

Robust Scaling

A special thanks to StackOverflow and Math 10 Notes.

Analysis of Mushroom Attributes

Contents

Analysis of Mushroom Attributes#

Introduction#

Import#

Data Cleaning#

Data Preprocessing#

Classification Models#

Data Balancing#

Training and Testing Sets#

Logistic Regression Model#

KNN Classification Model#

Comparison of the Models#

Conclusion#

Correlation Matrix#

Regression Models#

Data Preprocessing#

Data Scaling#

Training and Testing Sets#

Linear Regression Model#

KNN Regression Model#

Comparison#

Conclusion#

References#