{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Analysis of Mushroom Attributes\n", "Author: Crystal Tran\n", "\n", "Course Project, UC Irvine, Math 10, S24\n", "\n", "I would like to post my notebook on the course’s website. [Yes]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Introduction**\n", "\n", "The ultimate goal of this analysis is to classify mushrooms as edible or poisonous given select features of the [mushroom dataset](https://www.kaggle.com/datasets/vishalpnaik/mushroom-classification-edible-or-poisonous/data). We will use feature selection, data scaling, data balancing, and cross-validation on different regression and classification models in order to create the best model to classify mushrooms. \n", "\n", "Additionally, we will explore the data to possibly determine other interesting correlations between the selected features." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Import**" ] }, { "cell_type": "code", "execution_count": 807, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "from sklearn.preprocessing import RobustScaler\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.linear_model import LogisticRegression,LinearRegression\n", "from sklearn.neighbors import KNeighborsClassifier,KNeighborsRegressor\n", "from sklearn import metrics\n", "from sklearn.utils import resample" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## **Data Cleaning**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset has a total of **20** features in addition to mushroom class, but for simplicity, we will only use about half of those features. The features that we select will be features that are easily determined by examining the mushroom, such as the shape, surface, and color of the mushroom, as well as features that can be measured, such as cap diameter and stem dimensions." ] }, { "cell_type": "code", "execution_count": 808, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | class | \n", "cap-diameter | \n", "cap-shape | \n", "cap-surface | \n", "cap-color | \n", "does-bruise-or-bleed | \n", "stem-height | \n", "stem-width | \n", "stem-surface | \n", "stem-color | \n", "has-ring | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "p | \n", "15.26 | \n", "x | \n", "g | \n", "o | \n", "f | \n", "16.95 | \n", "17.09 | \n", "y | \n", "w | \n", "t | \n", "
1 | \n", "p | \n", "16.60 | \n", "x | \n", "g | \n", "o | \n", "f | \n", "17.99 | \n", "18.19 | \n", "y | \n", "w | \n", "t | \n", "
2 | \n", "p | \n", "14.07 | \n", "x | \n", "g | \n", "o | \n", "f | \n", "17.80 | \n", "17.74 | \n", "y | \n", "w | \n", "t | \n", "
3 | \n", "p | \n", "14.17 | \n", "f | \n", "h | \n", "e | \n", "f | \n", "15.77 | \n", "15.98 | \n", "y | \n", "w | \n", "t | \n", "
4 | \n", "p | \n", "14.64 | \n", "x | \n", "h | \n", "o | \n", "f | \n", "16.53 | \n", "17.20 | \n", "y | \n", "w | \n", "t | \n", "
\n", " | class | \n", "cap-diameter | \n", "does-bruise-or-bleed | \n", "stem-height | \n", "stem-width | \n", "has-ring | \n", "cap-shape_c | \n", "cap-shape_f | \n", "cap-shape_o | \n", "cap-shape_p | \n", "... | \n", "stem-color_g | \n", "stem-color_k | \n", "stem-color_l | \n", "stem-color_n | \n", "stem-color_o | \n", "stem-color_p | \n", "stem-color_r | \n", "stem-color_u | \n", "stem-color_w | \n", "stem-color_y | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "1 | \n", "15.26 | \n", "0 | \n", "16.95 | \n", "17.09 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "
1 | \n", "1 | \n", "16.60 | \n", "0 | \n", "17.99 | \n", "18.19 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "
2 | \n", "1 | \n", "14.07 | \n", "0 | \n", "17.80 | \n", "17.74 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "
3 | \n", "1 | \n", "14.17 | \n", "0 | \n", "15.77 | \n", "15.98 | \n", "1 | \n", "0 | \n", "1 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "
4 | \n", "1 | \n", "14.64 | \n", "0 | \n", "16.53 | \n", "17.20 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "1 | \n", "0 | \n", "
5 rows × 52 columns
\n", "