{ "cells": [ { "cell_type": "markdown", "id": "103c7433", "metadata": {}, "source": [ "# Cancer Prediction Based on Medical and Lifestyle Information" ] }, { "cell_type": "markdown", "id": "07550d69", "metadata": {}, "source": [ "Author: Tran Hoang Thuc Cao (67121302)\n", "\n", "Course Project, UC Irvine, Math 10, S24\n", "\n", "I would not like to post my notebook on the course’s website. \n" ] }, { "cell_type": "markdown", "id": "95795340", "metadata": {}, "source": [ "# Introduction" ] }, { "cell_type": "markdown", "id": "25ef51dd", "metadata": {}, "source": [ "Cancer is one of the most feared and prevalent diseases in modern society, posing a significant threat to public health worldwide. Despite advances in medical research and technology, a complete cure for cancer remains elusive. This project aims to explore into the intricate relationship between cancer and various medical and lifestyle factors, such as age, Body Mass Index (BMI), genetic risk, physical activity, alcohol intake, and gender. By analyzing these variables, we seek to gain a deeper understanding of how they influence the likelihood of developing cancer. The insights derived from this analysis can be pivotal in enhancing early diagnosis and prevention strategies, ultimately contributing to more effective management of this formidable disease. Through comprehensive data analysis, we hope to illuminate patterns and correlations that could inform public health policies and individual lifestyle choices, thereby reducing the overall burden of cancer in society." ] }, { "cell_type": "code", "execution_count": 1, "id": "4c3af818", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "import numpy as np\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.ensemble import RandomForestClassifier\n", "from sklearn.metrics import accuracy_score, classification_report, confusion_matrix\n", "from sklearn.model_selection import train_test_split, cross_val_score\n", "from sklearn.linear_model import LinearRegression, Lasso, Ridge\n", "from sklearn.tree import DecisionTreeRegressor\n", "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.metrics import mean_squared_error, make_scorer\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.neighbors import KNeighborsRegressor\n", "from sklearn.metrics import mean_squared_error\n", "from sklearn.cluster import KMeans" ] }, { "cell_type": "markdown", "id": "41005a40", "metadata": {}, "source": [ "# Data Description" ] }, { "cell_type": "code", "execution_count": 3, "id": "d0154585", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Age | \n", "Gender | \n", "BMI | \n", "Smoking | \n", "GeneticRisk | \n", "PhysicalActivity | \n", "AlcoholIntake | \n", "CancerHistory | \n", "Diagnosis | \n", "
---|---|---|---|---|---|---|---|---|---|
1361 | \n", "48 | \n", "0 | \n", "22.598563 | \n", "0 | \n", "0 | \n", "5.971166 | \n", "4.932070 | \n", "0 | \n", "0 | \n", "
932 | \n", "55 | \n", "1 | \n", "21.894189 | \n", "1 | \n", "1 | \n", "1.941288 | \n", "0.015104 | \n", "0 | \n", "1 | \n", "
333 | \n", "76 | \n", "0 | \n", "36.833073 | \n", "0 | \n", "0 | \n", "5.300523 | \n", "2.286152 | \n", "0 | \n", "0 | \n", "
38 | \n", "46 | \n", "0 | \n", "32.378614 | \n", "0 | \n", "0 | \n", "7.595201 | \n", "2.087241 | \n", "0 | \n", "0 | \n", "
244 | \n", "61 | \n", "1 | \n", "27.805548 | \n", "0 | \n", "0 | \n", "9.517024 | \n", "2.144437 | \n", "0 | \n", "0 | \n", "
642 | \n", "71 | \n", "1 | \n", "21.507171 | \n", "1 | \n", "0 | \n", "7.901314 | \n", "4.366957 | \n", "1 | \n", "1 | \n", "
146 | \n", "60 | \n", "1 | \n", "24.188361 | \n", "0 | \n", "0 | \n", "6.919274 | \n", "4.679971 | \n", "0 | \n", "0 | \n", "
701 | \n", "53 | \n", "0 | \n", "17.873325 | \n", "1 | \n", "0 | \n", "4.078942 | \n", "3.989012 | \n", "1 | \n", "1 | \n", "
270 | \n", "21 | \n", "1 | \n", "30.616589 | \n", "1 | \n", "1 | \n", "6.477201 | \n", "4.529720 | \n", "0 | \n", "1 | \n", "
291 | \n", "75 | \n", "1 | \n", "20.615390 | \n", "0 | \n", "0 | \n", "2.862799 | \n", "4.345495 | \n", "0 | \n", "0 | \n", "
885 | \n", "30 | \n", "0 | \n", "22.210136 | \n", "0 | \n", "0 | \n", "1.287544 | \n", "4.312171 | \n", "0 | \n", "0 | \n", "
966 | \n", "20 | \n", "1 | \n", "30.664813 | \n", "0 | \n", "0 | \n", "5.042635 | \n", "1.596584 | \n", "0 | \n", "0 | \n", "
877 | \n", "51 | \n", "1 | \n", "37.502080 | \n", "0 | \n", "0 | \n", "0.395575 | \n", "1.546341 | \n", "0 | \n", "1 | \n", "
572 | \n", "75 | \n", "1 | \n", "35.090088 | \n", "0 | \n", "0 | \n", "1.132312 | \n", "1.449937 | \n", "0 | \n", "1 | \n", "
374 | \n", "68 | \n", "0 | \n", "30.489733 | \n", "1 | \n", "0 | \n", "0.582378 | \n", "0.892253 | \n", "0 | \n", "1 | \n", "
1359 | \n", "41 | \n", "0 | \n", "20.653133 | \n", "0 | \n", "1 | \n", "9.706680 | \n", "3.143607 | \n", "0 | \n", "0 | \n", "
815 | \n", "69 | \n", "0 | \n", "27.150012 | \n", "0 | \n", "1 | \n", "0.777124 | \n", "2.266853 | \n", "0 | \n", "0 | \n", "
1395 | \n", "52 | \n", "1 | \n", "32.860604 | \n", "0 | \n", "1 | \n", "6.447567 | \n", "4.646305 | \n", "1 | \n", "1 | \n", "
787 | \n", "61 | \n", "1 | \n", "24.595972 | \n", "0 | \n", "2 | \n", "1.565292 | \n", "2.312284 | \n", "0 | \n", "1 | \n", "
245 | \n", "63 | \n", "1 | \n", "32.005694 | \n", "0 | \n", "0 | \n", "3.197208 | \n", "3.261320 | \n", "0 | \n", "1 | \n", "