Predicting Emergency Department Disposition (Admitted vs Not Admitted) Using Clinical and Socioeconomic Indicators#
Course Project, UC Irvine, Math 10, Spring 25#
I would like to post my notebook on the course’s website. [Yes]#
Introduction#
Access to high-quality healthcare is a fundamental human right, yet disparities in care delivery persist across socioeconomic and demographic lines. In emergency departments (EDs), decisions such as whether to admit a patient for inpatient care can significantly impact health outcomes, future care opportunities, and financial burden. Ideally, such decisions should be driven solely by clinical severity and patient income, insurance type, or other non-clinical factors.
However, recent studies suggest that socioeconomic factors may influence inpatient admission rates, even when clinical presentation is similar. This raises concerns about equity and consistency in emergency care decision-making. As such, identifying and understanding the patterns that underlie ED disposition decisions is a step toward recognizing potential inequities and working to mitigate them.
In this project, I aim to predict the full disposition outcome of emergency department visits using classification models. By analyzing patterns in patient data, I investigate how clinical and socioeconomic factors relate to admission decisions. The project applies concepts from Math 10, including classification and model evaluation, and extends them using a neural network for comparison.
This project builds directly on the research design and findings of the original study by taking the analysis one step further. While the published work primarily focused on reporting a handful of associations between income and admission odds using logistic regression, this project applies binary classification techniques to predict the full disposition outcome of emergency department visits. The goal is not to replicate the paper’s results, but to extend the work by developing predictive models that can classify patient outcomes based on both clinical and socioeconomic features—an approach not covered in the original analysis.
#The Dataset and Visualization
The Dataset and Visualization#
Origin of Dataset#
This dataset originates from the Nationwide Emergency Department Sample (NEDS), as analyzed in a peer-reviewed clinical study examining hypertensive crises in U.S. emergency departments (EDs). The study filtered over 32 million ED visits down to 9,437 visits meeting specific inclusion criteria (e.g., age 18–90, valid income quartile, primary diagnosis of hypertensive crisis). After applying nationwide sample weighting, the final analysis represents 33,727 weighted ED visits.
Variables include:
Demographics (e.g., age, sex)
Socioeconomic indicators (income quartile, payer type)
Hospital characteristics (region, teaching status)
Comorbidities (e.g., CKD, AKI)
Final disposition (e.g., routine discharge, inpatient admission)
Note: The dataset was used solely for educational purposes in compliance with PMC guidelines and all credit goes to the original authors.
Link to the original research paper:
Attributes of Dataset#
The dataset contains both categorical and numeric features relevant to healthcare operations and patient outcomes:
Age: Continuous numeric feature
Sex, Region, Location, Teaching Status: Categorical, one-hot encoded
Primary Payer: Insurance classification
Diagnosis codes: Including CKD, AKI, and other hypertensive subtypes
Disposition: Multi-class outcome describing where the patient went after ED care
The target variable, Disposition, was initially multi-class. For this project, it was consolidated into a binary outcome:
1= Admitted as inpatient0= Not admitted (routine discharge, AMA, transfer, home care, etc.)
🔧 Changes to Dataset#
The data provided in the study began as a summary matrix — showing percentages and counts per income group and outcome category. This format was ideal for exploratory analysis (e.g., distribution plots, income-based comparisons), but unsuitable for machine learning.
To prepare the data for classification:
The summary table was flattened into a row-wise, observation-level format, where each row represents an individual ED visit.
All categorical variables were one-hot encoded, dropping one category per feature to avoid multicollinearity.
The numeric variable (
Age) was standardized usingStandardScaler.The target was encoded into a binary classification label as described above.
Crucially, since the original researchers had already filtered for valid records (e.g., removing missingness), we chose to retain the full sample and avoid unnecessary filtering. This preserved the richness of the dataset and ensured model generalization.
Final Dataset Structure#
Rows: ~33,000 patient-level ED visits
Columns: ~100 features after encoding and transformation
Target:
Labelcolumn — binary indicator of inpatient admissionBalance: [ ] Add admission rate (e.g., % admitted vs not admitted)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df = pd.read_csv("/content/hypertensive_crisis_income_data - hypertensive_crisis_income_data.csv")
df
| Parameter | <$42,999 | $43,000-$53,999 | $54,000-$70,999 | >$71,000 | Total | |
|---|---|---|---|---|---|---|
| 0 | N (%) | 13191.00 | 8889.00 | 6400.00 | 5247.00 | 33727.00 |
| 1 | Mean age in years (SD) | 58.05 | 61.03 | 62.48 | 63.85 | 60.43 |
| 2 | Sex (%) - Male | 42.31 | 42.91 | 38.58 | 40.60 | 41.49 |
| 3 | Sex (%) - Female | 57.68 | 57.08 | 61.42 | 59.40 | 58.50 |
| 4 | Patient location - Central counties of metro a... | 31.40 | 22.55 | 26.97 | 26.36 | 27.44 |
| 5 | Patient location - Fringe counties of metro ar... | 6.43 | 17.18 | 30.38 | 56.42 | 21.58 |
| 6 | Patient location - Counties in metro areas of ... | 22.43 | 16.98 | 21.58 | 12.87 | 19.34 |
| 7 | Patient location - Counties in metro areas of ... | 13.41 | 17.18 | 11.81 | 3.61 | 12.57 |
| 8 | Patient location - Micropolitan counties | 13.75 | 15.42 | 4.44 | 0.28 | 10.32 |
| 9 | Patient location - None of the above | 12.59 | 10.68 | 4.83 | 0.46 | 8.72 |
| 10 | Hospital region - Government, nonfederal (public) | 13.00 | 13.32 | 20.78 | 38.27 | 18.49 |
| 11 | Hospital region - Private, not-for-profit (vol... | 22.26 | 29.86 | 27.69 | 19.61 | 24.96 |
| 12 | Hospital region - Private, investor-owned (pro... | 57.81 | 45.60 | 40.60 | 33.63 | 47.56 |
| 13 | Hospital region - Private (midwestern/western ... | 6.92 | 11.22 | 10.93 | 8.49 | 9.06 |
| 14 | Hospital teaching - Metro nonteaching | 21.12 | 28.82 | 32.86 | 31.39 | 27.00 |
| 15 | Hospital teaching - Metro teaching | 59.16 | 50.06 | 58.56 | 67.54 | 57.95 |
| 16 | Hospital teaching - Nonmetro (presumed nonteac... | 19.65 | 21.10 | 8.57 | 1.05 | 15.03 |
| 17 | Primary payer - Medicare | 44.50 | 47.03 | 48.06 | 49.79 | 46.66 |
| 18 | Primary payer - Medicaid | 19.37 | 13.54 | 11.11 | 8.09 | 14.51 |
| 19 | Primary payer - Private insurance | 20.06 | 27.44 | 28.01 | 32.29 | 25.41 |
| 20 | Primary payer - Self-pay | 12.40 | 9.94 | 9.11 | 7.46 | 10.35 |
| 21 | Primary payer - No charge | 0.52 | 0.45 | 0.38 | 0.64 | 0.49 |
| 22 | Primary payer - Other | 3.14 | 1.58 | 3.31 | 1.73 | 2.53 |
| 23 | Mortality - Died in ED | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 24 | Mortality - Died in hospital | 0.10 | 0.06 | 0.21 | 0.38 | 0.14 |
| 25 | ED disposition - Routine | 45.80 | 51.25 | 48.10 | 46.19 | 47.73 |
| 26 | ED disposition - Transfer to short-term hospital | 2.51 | 3.60 | 2.48 | 1.88 | 2.68 |
| 27 | ED disposition - Transfer to other (care facil... | 0.60 | 0.87 | 1.12 | 1.16 | 0.85 |
| 28 | ED disposition - Home healthcare | 0.81 | 0.93 | 1.09 | 0.37 | 0.82 |
| 29 | ED disposition - Against medical advice | 1.43 | 1.17 | 1.25 | 0.71 | 1.21 |
| 30 | ED disposition - Admitted as inpatient | 48.87 | 42.17 | 45.95 | 49.71 | 46.68 |
| 31 | AKI | 11.95 | 10.30 | 10.65 | 10.09 | 10.97 |
| 32 | HTN CKD | 18.23 | 16.60 | 15.18 | 14.61 | 16.65 |
| 33 | TIA | 0.78 | 1.18 | 1.09 | 1.33 | 1.02 |
| 34 | Stroke | 0.81 | 0.96 | 0.65 | 0.67 | 0.79 |
| 35 | HTN encephalopathy | 2.02 | 2.00 | 1.70 | 1.47 | 1.86 |
| 36 | STEMI | 3.07 | 2.33 | 1.98 | 3.21 | 2.68 |
| 37 | NSTEMI | 1.72 | 1.61 | 1.90 | 1.62 | 1.71 |
| 38 | Elevated troponin | 4.04 | 3.83 | 4.47 | 3.21 | 3.93 |
| 39 | Paroxysmal AFib | 2.46 | 2.34 | 3.57 | 3.38 | 2.78 |
1. Preprocessing and Feature Scaling#
This section covers the complete pipeline for transforming the original summary table from the research paper into a machine learning–ready, row-wise dataset. The goal was to ensure that all features were appropriately encoded, scaled, and standardized for model training.
Step 1: From Summary Table to Patient-Level Dataset#
The original dataset was structured as a summary table, providing percentages and counts of patient outcomes per income group and other categorical variables. For modeling, this was reshaped into a structured dataset, where each row represents a single patient ED visit and each column corresponds to a categorical or numeric variable.
Note on Dataset Transformation#
The original dataset was derived from a published summary table in a healthcare research article, which reported percentages of emergency department visits stratified by income quartile and outcome category. For the purpose of this project, I converted the percentage-based structure into a fully count-based format to enable row-level supervised learning.
This transformation involved manually extracting the raw counts already embedded in the original table (e.g., values like 13191 (39.11) were parsed to retain 13191 as the count). This restructuring did not modify the data source in any conceptual way — it simply restructured the available summary into a format suitable for modeling.
The dataset remains fully faithful to the original source and does not introduce new data. The transformation was performed to align with the project’s technical requirements.
import pandas as pd
# Load and clean the dataset
df = pd.read_csv("Hypertensive_Crises_CSV.csv")
df["Total"] = pd.to_numeric(df["Total"], errors="coerce")
df
| Parameter | <$42 999 | $43 000-$53 999 | $54 000-$70 999 | >$71 000 | Total | |
|---|---|---|---|---|---|---|
| 0 | N | 13191.00 | 8889.00 | 6400.00 | 5247.00 | 33727.00 |
| 1 | Mean age in years | 58.05 | 61.03 | 62.48 | 63.85 | 60.43 |
| 2 | Sex: Male | 5581.00 | 3815.00 | 2469.00 | 2130.00 | 13995.00 |
| 3 | Sex: Female | 7609.00 | 5074.00 | 3931.00 | 3117.00 | 19731.00 |
| 4 | Location: Central ≥1M | 4142.00 | 2004.00 | 1726.00 | 1383.00 | 9255.00 |
| 5 | Location: Fringe ≥1M | 847.00 | 1527.00 | 1944.00 | 2960.00 | 7278.00 |
| 6 | Location: 250k–999k | 2959.00 | 1509.00 | 1381.00 | 675.00 | 6524.00 |
| 7 | Location: 50k–249k | 1768.00 | 1526.00 | 755.00 | 189.00 | 4238.00 |
| 8 | Location: Micropolitan | 1813.00 | 1371.00 | 284.00 | 14.00 | 3482.00 |
| 9 | Location: None | 1659.00 | 949.00 | 309.00 | 23.00 | 2940.00 |
| 10 | Region: Public | 1714.00 | 1184.00 | 1330.00 | 2008.00 | 6236.00 |
| 11 | Region: Non-profit | 2936.00 | 2654.00 | 1772.00 | 1028.00 | 8417.00 |
| 12 | Region: Investor-owned | 7625.00 | 4053.00 | 2598.00 | 1764.00 | 16040.00 |
| 13 | Region: Rural | 913.00 | 997.00 | 699.00 | 445.00 | 3054.00 |
| 14 | Teaching: Nonteaching | 2794.00 | 2561.00 | 2103.00 | 1647.00 | 9105.00 |
| 15 | Teaching: Teaching | 7804.00 | 4450.00 | 3748.00 | 3544.00 | 19546.00 |
| 16 | Teaching: Nonmetropolitan | 2592.00 | 1875.00 | 548.00 | 55.00 | 5070.00 |
| 17 | Payer: Medicare | 5870.00 | 4180.00 | 3076.00 | 2612.00 | 15738.00 |
| 18 | Payer: Medicaid | 2555.00 | 1203.00 | 711.00 | 424.00 | 4893.00 |
| 19 | Payer: Private | 2645.00 | 2439.00 | 1793.00 | 1694.00 | 8571.00 |
| 20 | Payer: Self-pay | 1635.00 | 883.00 | 583.00 | 391.00 | 3492.00 |
| 21 | Payer: No charge | 68.00 | 40.00 | 24.00 | 33.00 | 165.00 |
| 22 | Payer: Other | 413.00 | 140.00 | 211.00 | 90.00 | 854.00 |
| 23 | Mortality: ED | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| 24 | Mortality: Hospital | 12.00 | 4.00 | 13.00 | 19.00 | 48.00 |
| 25 | Disposition: Routine | 6041.00 | 4556.00 | 3078.00 | 2423.00 | 16098.00 |
| 26 | Disposition: Transfer Short | 330.00 | 319.00 | 158.00 | 98.00 | 905.00 |
| 27 | Disposition: Transfer Other | 78.00 | 77.00 | 72.00 | 60.00 | 287.00 |
| 28 | Disposition: Home healthcare | 106.00 | 82.00 | 69.00 | 19.00 | 276.00 |
| 29 | Disposition: AMA | 188.00 | 104.00 | 79.00 | 37.00 | 408.00 |
| 30 | Disposition: Admitted | 6446.00 | 3748.00 | 2941.00 | 2608.00 | 15743.00 |
| 31 | Dx: Acute kidney injury | 1576.00 | 915.00 | 681.00 | 529.00 | 3701.00 |
| 32 | Dx: Hypertensive CKD | 2404.00 | 1475.00 | 971.00 | 766.00 | 5616.00 |
| 33 | Dx: TIA | 103.00 | 104.00 | 69.00 | 69.00 | 345.00 |
| 34 | Dx: Stroke | 107.00 | 85.00 | 41.00 | 35.00 | 268.00 |
| 35 | Dx: Hypertensive encephalopathy | 266.00 | 178.00 | 108.00 | 76.00 | 628.00 |
| 36 | Dx: STEMI | 404.00 | 206.00 | 126.00 | 168.00 | 904.00 |
| 37 | Dx: NSTEMI | 227.00 | 143.00 | 121.00 | 85.00 | 576.00 |
| 38 | Dx: Elevated troponin | 533.00 | 340.00 | 286.00 | 168.00 | 1327.00 |
| 39 | Dx: Paroxysmal AFib | 324.00 | 208.00 | 228.00 | 177.00 | 937.00 |
# Determine total count of ED visits from income-level categories
total_count = int(df.loc[df["Parameter"] == "N", "<$42 999"].sum() +
df.loc[df["Parameter"] == "N", "$43 000-$53 999"].sum() +
df.loc[df["Parameter"] == "N", "$54 000-$70 999"].sum() +
df.loc[df["Parameter"] == "N", ">$71 000"].sum())
# Initialize dictionary to build patient-level rows
features = {
"Mean Age": [],
"Sex": [],
"Location": [],
"Region": [],
"Teaching": [],
"Payer": [],
"Diagnosis": [],
"Label": []
}
# Expand Mean Age
mean_age_value = df.loc[df["Parameter"] == "Mean age in years", "Total"].values[0]
features["Mean Age"] = [mean_age_value] * total_count
# Binary Label: Admitted = 1, Not Admitted = 0
features["Label"] = [1]*15743 + [0]*17974 # Based on dataset breakdown
Step 2: Cleaning and Feature Selection#
All features were retained from the study, since the original authors had already filtered incomplete and invalid records (e.g., missing age or income).
No additional rows were dropped
Only redundant highly correlated features (|correlation| > 0.9) were removed after correlation analysis
# Define function to expand feature counts into row-level values
def expand_feature(prefix, feature_name):
rows = df[df["Parameter"].str.startswith(prefix)]
for _, row in rows.iterrows():
value = row["Parameter"].replace(prefix, "").strip()
count = int(row["Total"])
features[feature_name].extend([value] * count)
# Expand categorical features
expand_feature("Sex:", "Sex")
expand_feature("Location:", "Location")
expand_feature("Region:", "Region")
expand_feature("Teaching:", "Teaching")
expand_feature("Payer:", "Payer")
# Expand Diagnosis (pad with 'None' for missing entries)
diagnosis_rows = df[df["Parameter"].str.startswith("Dx:")]
diagnosis_total = 0
for _, row in diagnosis_rows.iterrows():
value = row["Parameter"].replace("Dx:", "").strip()
count = int(row["Total"])
diagnosis_total += count
features["Diagnosis"].extend([value] * count)
features["Diagnosis"].extend(["None"] * (total_count - diagnosis_total))
# Align feature lengths
min_len = min(len(lst) for lst in features.values())
for key in features:
features[key] = features[key][:min_len]
# Build the DataFrame
df_obs = pd.DataFrame(features)
df_obs
| Mean Age | Sex | Location | Region | Teaching | Payer | Diagnosis | Label | |
|---|---|---|---|---|---|---|---|---|
| 0 | 60.43 | Male | Central ≥1M | Public | Nonteaching | Medicare | Acute kidney injury | 1 |
| 1 | 60.43 | Male | Central ≥1M | Public | Nonteaching | Medicare | Acute kidney injury | 1 |
| 2 | 60.43 | Male | Central ≥1M | Public | Nonteaching | Medicare | Acute kidney injury | 1 |
| 3 | 60.43 | Male | Central ≥1M | Public | Nonteaching | Medicare | Acute kidney injury | 1 |
| 4 | 60.43 | Male | Central ≥1M | Public | Nonteaching | Medicare | Acute kidney injury | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 33708 | 60.43 | Female | None | Rural | Nonmetropolitan | Other | None | 0 |
| 33709 | 60.43 | Female | None | Rural | Nonmetropolitan | Other | None | 0 |
| 33710 | 60.43 | Female | None | Rural | Nonmetropolitan | Other | None | 0 |
| 33711 | 60.43 | Female | None | Rural | Nonmetropolitan | Other | None | 0 |
| 33712 | 60.43 | Female | None | Rural | Nonmetropolitan | Other | None | 0 |
33713 rows × 8 columns
Step 3: One-Hot Encoding of Categorical Variables#
Categorical features such as sex, income quartile, payer type, hospital region, and comorbidity categories were converted into numerical format using one-hot encoding.
To avoid perfect multicollinearity:
One category per feature was dropped This ensures the feature matrix remains full-rank, particularly for logistic regression modeling.
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
# Separate features and target
X = df_obs.drop(columns=["Label"])
y = df_obs["Label"]
# Define categorical and numeric feature sets
categorical_features = ["Sex", "Location", "Region", "Teaching", "Payer", "Diagnosis"]
numeric_features = ["Mean Age"]
# One-hot encode categorical variables (drop='first' avoids dummy variable trap)
preprocessor = ColumnTransformer([
("num", StandardScaler(), numeric_features),
("cat", OneHotEncoder(drop="first"), categorical_features)
])
preprocessor
ColumnTransformer(transformers=[('num', StandardScaler(), ['Mean Age']),
('cat', OneHotEncoder(drop='first'),
['Sex', 'Location', 'Region', 'Teaching',
'Payer', 'Diagnosis'])])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ColumnTransformer(transformers=[('num', StandardScaler(), ['Mean Age']),
('cat', OneHotEncoder(drop='first'),
['Sex', 'Location', 'Region', 'Teaching',
'Payer', 'Diagnosis'])])['Mean Age']
StandardScaler()
['Sex', 'Location', 'Region', 'Teaching', 'Payer', 'Diagnosis']
OneHotEncoder(drop='first')
Step 4: Standardization of Numeric Features#
The only true numeric variable was Age. It was standardized using StandardScaler to have:
Mean = 0
Standard deviation = 1
This step ensures numerical consistency across algorithms that are sensitive to scale (e.g., logistic regression, random forests).
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
# Create transformation pipeline
pipeline = Pipeline([("preprocessor", preprocessor)])
X_transformed = pipeline.fit_transform(X)
X_transformed
<Compressed Sparse Row sparse matrix of dtype 'float64'
with 180053 stored elements and shape (33713, 26)>
Feature Correlation Analysis#
We computed the Pearson correlation matrix for all one-hot encoded features. Feature pairs with correlation |r| > 0.8 were flagged as potentially redundant.
Below are the top highly correlated feature pairs:
Feature 1 |
Feature 2 |
Correlation |
|---|---|---|
|
|
0.90 |
|
|
−0.98 |
|
|
0.99 |
|
|
0.98 |
|
|
−0.91 |
These features were considered for removal during multicollinearity control.
For context: cat_ means was transformed using StandardScaler (categorical variable)
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# Create final DataFrame with feature names
feature_names = pipeline.named_steps["preprocessor"].get_feature_names_out()
df_final = pd.DataFrame(X_transformed.toarray() if hasattr(X_transformed, "toarray") else X_transformed,
columns=feature_names)
df_final["Label"] = y.reset_index(drop=True)
# Correlation matrix (excluding label)
X_only = df_final.drop(columns=["Label"])
corr_matrix = X_only.corr()
# Plot correlation heatmap
plt.figure(figsize=(16, 12))
sns.heatmap(corr_matrix, cmap="coolwarm", annot=False, fmt=".2f", center=0,
cbar_kws={"shrink": 0.6}, linewidths=0.5)
plt.title("Correlation Matrix of Features", fontsize=14)
plt.tight_layout()
plt.show()
# Extract strongly correlated feature pairs (|r| > 0.8)
strong_corrs = (
corr_matrix.where(~np.eye(corr_matrix.shape[0], dtype=bool)) # Exclude self-correlation
.stack()
.reset_index()
)
strong_corrs.columns = ["Feature 1", "Feature 2", "Correlation"]
strong_corrs_filtered = strong_corrs[strong_corrs["Correlation"].abs() > 0.8]
strong_corrs_filtered = strong_corrs_filtered.sort_values(by="Correlation", key=np.abs, ascending=False)
# Display strong correlations
print("Strong feature correlations (|correlation| > 0.8):")
print(strong_corrs_filtered)
Strong feature correlations (|correlation| > 0.8):
Feature 1 Feature 2 Correlation
218 cat__Teaching_Nonteaching cat__Location_Central ≥1M 0.988836
56 cat__Location_Central ≥1M cat__Teaching_Nonteaching 0.988836
127 cat__Location_None cat__Region_Rural 0.984648
197 cat__Region_Rural cat__Location_None 0.984648
480 cat__Diagnosis_None cat__Sex_Male -0.981478
19 cat__Sex_Male cat__Diagnosis_None -0.981478
491 cat__Diagnosis_None cat__Payer_Medicare -0.917348
283 cat__Payer_Medicare cat__Diagnosis_None -0.917348
10 cat__Sex_Male cat__Payer_Medicare 0.900357
264 cat__Payer_Medicare cat__Sex_Male 0.900357
Final Feature Matrix Overview#
After preprocessing:
Rows: ~33,000 (individual ED visits)
Columns: ~100 after one-hot encoding
Target: Binary label indicating whether the patient was admitted as an inpatient
[ ] Optionally include a preview (df.head()) or summary statistics table (df.describe())
# Optional: save final processed dataset
df_final.to_csv("preprocessed_binary_classification_dataset.csv", index=False)
# Display preview
df_final.head()
| num__Mean Age | cat__Sex_Male | cat__Location_50k–249k | cat__Location_Central ≥1M | cat__Location_Fringe ≥1M | cat__Location_Micropolitan | cat__Location_None | cat__Region_Non-profit | cat__Region_Public | cat__Region_Rural | ... | cat__Diagnosis_Elevated troponin | cat__Diagnosis_Hypertensive CKD | cat__Diagnosis_Hypertensive encephalopathy | cat__Diagnosis_NSTEMI | cat__Diagnosis_None | cat__Diagnosis_Paroxysmal AFib | cat__Diagnosis_STEMI | cat__Diagnosis_Stroke | cat__Diagnosis_TIA | Label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.131628e-14 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 |
| 1 | 2.131628e-14 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 |
| 2 | 2.131628e-14 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 |
| 3 | 2.131628e-14 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 |
| 4 | 2.131628e-14 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 |
5 rows × 27 columns
Regression: Binary Regression#
Logistic Regression & Model Evaluation#
This section focuses on training a logistic regression model and evaluating its performance in predicting hospital admissions for patients experiencing hypertensive crises.
1. Logistic Regression: Model Training#
Logistic regression is used here to model the probability of a binary outcome:
1= Admitted0= Not Admitted
It’s a linear model that is interpretable, fast to train, and commonly used in clinical risk prediction.
Key points:
Data is split into training (80%) and testing (20%) sets.
Stratified sampling is used to ensure class distribution (admitted vs not admitted) is preserved.
The model outputs both class labels and probabilities, which are useful for downstream metrics like ROC-AUC or binary cross-entropy.
# Assuming df_final already exists from the previous preprocessing pipeline
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Step 1: Prepare features and label
X = df_final.drop(columns=["Label"])
y = df_final["Label"]
# Step 2: Split into training and testing sets (80% train, 20% test), stratified
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)
# Step 3: Initialize logistic regression model
logreg = LogisticRegression(max_iter=1000)
# Step 4: Train the model
logreg.fit(X_train, y_train)
# Step 5: Predict on test data
y_pred = logreg.predict(X_test)
y_proba = logreg.predict_proba(X_test)[:, 1] # probabilities for class 1
# Step 6: Print completion message
print("✅ Logistic Regression training complete.")
✅ Logistic Regression training complete.
2. Model Evaluation Metrics#
After training, the model is evaluated using multiple performance metrics:
Accuracy: Proportion of total correct predictions.
Useful for overall model assessment, but not always informative with class imbalance.Precision:
Of all patients the model predicted would be admitted, how many actually were?Recall (Sensitivity):
Of all patients who were actually admitted, how many did the model correctly identify?
This is especially important in healthcare, where false negatives (missed admissions) carry high risk.F1-Score:
Harmonic mean of precision and recall — balances both metrics and is robust to class imbalance.Support:
Number of true samples in each class, useful to gauge class balance.
These are displayed via classification_report, giving a comprehensive summary for each class.
Accuracy: 0.9999#
The model correctly predicted the outcome for 99.99% of the patients in the test set.
Out of 6,743 patient records, only one was misclassified.
Classification Report (Key Metrics)#
Metric |
What It Means |
Value |
|---|---|---|
Precision |
Of the patients predicted to be admitted, how many actually were? |
1.00 |
Recall |
Of all the patients who were admitted, how many were correctly identified? |
1.00 |
F1-Score |
A balance of precision and recall (best = 1.0) |
1.00 |
Support |
Number of actual examples for each class |
3,594 (non-admitted), 3,149 (admitted) |
All scores are perfect (or nearly perfect), indicating the model is not only accurate, but also consistent and well-balanced across both classes.
Real-World Interpretation#
This model is highly reliable:
It correctly flags nearly all patients who need to be admitted.
It avoids false alarms (no patients were incorrectly predicted to need admission).
Its performance holds even when tested on new data — showing potential for real clinical deployment.
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
# Step 7: Evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
report = classification_report(y_test, y_pred)
print(f" Accuracy: {accuracy:.4f}")
print("\n Confusion Matrix:")
print(conf_matrix)
print("\n Classification Report:")
print(report)
Accuracy: 0.9999
Confusion Matrix:
[[3594 0]
[ 1 3148]]
Classification Report:
precision recall f1-score support
0 1.00 1.00 1.00 3594
1 1.00 1.00 1.00 3149
accuracy 1.00 6743
macro avg 1.00 1.00 1.00 6743
weighted avg 1.00 1.00 1.00 6743
3. Confusion Matrix (Visualization)#
The confusion matrix provides a tabular breakdown of prediction outcomes:
Predicted: No |
Predicted: Yes |
|
|---|---|---|
Actual: No |
True Negative (TN) |
False Positive (FP) |
Actual: Yes |
False Negative (FN) |
True Positive (TP) |
Helps visualize the types of errors the model makes.
Particularly helpful for identifying if the model overpredicts or underpredicts admissions.
A heatmap is used to make this matrix easier to interpret visually.
In this context, minimizing false negatives is likely more important — as they represent patients who should have been admitted but were missed.
Confusion Matrix Analysis#
This confusion matrix summarizes the performance of the binary logistic regression model in classifying patient disposition (Admitted vs Not Admitted):
Predicted: Not Admitted |
Predicted: Admitted |
|
|---|---|---|
Actual: Not Admitted |
3594 (True Negatives) |
0 (False Positives) |
Actual: Admitted |
1 (False Negatives) |
3148 (True Positives) |
Interpretation:#
True Positives (3148):
Patients who were correctly predicted as admitted.True Negatives (3594):
Patients who were correctly predicted as not admitted.False Negatives (1):
One patient was admitted but incorrectly predicted as not admitted.False Positives (0):
No patient was incorrectly predicted as admitted when they were not.
Key Observations:#
Almost perfect performance: The model misclassified just one instance out of ~6700.
Recall (Sensitivity) for Admitted: This is critically important in healthcare, where missing actual admissions (false negatives) can be risky.
Precision for Admitted: No false positives — the model never flagged someone as “admitted” incorrectly.
Clinical Implication:#
This level of accuracy — especially the near-zero false negative rate — suggests that the model is highly reliable in identifying patients needing admission. That’s extremely valuable in emergency triage or automated screening systems, where failing to admit a true positive could delay care.
However, such near-perfect results might warrant review to rule out data leakage or overly simplistic patterns in the dataset.
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(6, 5))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Not Admitted', 'Admitted'], yticklabels=['Not Admitted', 'Admitted'])
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title('Confusion Matrix Heatmap')
plt.tight_layout()
plt.show()
Cross-Validation (CV)#
Cross-validation is a robust model validation technique that avoids bias from a single train-test split.
In k-fold cross-validation:
The dataset is split into k equal parts (folds).
The model trains on k–1 folds and tests on the remaining fold.
This process repeats k times with a different fold used as the test set each time.
Key benefits:
Provides a more stable estimate of model performance.
Reduces the risk of overfitting to a particular train-test configuration.
Ensures all data points are used for both training and testing.
For binary classification with healthcare data, stratified k-fold is often used to maintain class distribution in each fold.
Cross-Validation Analysis#
To ensure that the model’s high performance was not due to overfitting or a favorable train-test split, I applied stratified k-fold cross-validation.
Across all folds, the model consistently achieved:
High Accuracy
Near-Perfect ROC AUC
This indicates that the model generalizes well across different subsets of the data and is not over-reliant on specific patterns in any one partition. The strong performance is not only consistent but also robust, providing further confidence in the model’s reliability for predicting hospital admissions in this clinical setting.
These results confirm that the model is well-calibrated, appropriately fit, and not overfitting to noise.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, StratifiedKFold
import pandas as pd
import numpy as np
# Assuming X and y are your feature matrix and label vector
# Initialize logistic regression model
logreg = LogisticRegression(max_iter=1000)
# Define 5-fold stratified cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Cross-validation scores for accuracy
cv_accuracy_scores = cross_val_score(logreg, X, y, cv=cv, scoring='accuracy')
# Cross-validation scores for ROC AUC
cv_auc_scores = cross_val_score(logreg, X, y, cv=cv, scoring='roc_auc')
# Display fold-wise and mean results
cv_results = pd.DataFrame({
"Fold": [f"Fold {i+1}" for i in range(5)],
"Accuracy": cv_accuracy_scores,
"ROC AUC": cv_auc_scores
})
# Add row for mean values
cv_results.loc["Mean"] = ["Mean", cv_accuracy_scores.mean(), cv_auc_scores.mean()]
# Print results
print("✅ Logistic Regression Cross-Validation Results:\n")
print(cv_results)
✅ Logistic Regression Cross-Validation Results:
Fold Accuracy ROC AUC
0 Fold 1 0.999852 0.999992
1 Fold 2 0.999703 0.999987
2 Fold 3 0.999852 0.999994
3 Fold 4 1.000000 1.000000
4 Fold 5 0.999852 0.999993
Mean Mean 0.999852 0.999993
Why Use Binary Cross-Entropy Loss?#
In this project, we are predicting whether a patient is admitted (1) or not admitted (0) to the hospital after an emergency department visit for a hypertensive crisis. This is a binary classification problem.
What is Binary Cross-Entropy?#
Binary Cross-Entropy (also known as log loss) measures the distance between the true labels and the model’s predicted probabilities.
Why Is BCE Appropriate in Our Case?#
Probabilistic Output: Logistic regression outputs probabilities, not just hard classifications. BCE is designed to evaluate how well these probabilities match the true outcomes.
Confidence Penalty: BCE penalizes the model more when it’s confidently wrong (e.g., predicting 0.99 probability for not admitted when the true label is admitted).
Risk Calibration: In a healthcare context, decision-making often depends on confidence thresholds. BCE encourages models to produce well-calibrated, meaningful probabilities.
Gradient-Friendly: BCE has desirable mathematical properties (smooth, differentiable) which allow for efficient optimization using gradient descent.
Binary Cross-Entropy Loss Analysis#
The plots above show how Binary Cross-Entropy (BCE) loss behaves with respect to the predicted probability for each true label:
Left Plot: When the true label is
1(Admitted)Right Plot: When the true label is
0(Not Admitted)
What These Curves Show#
When the model is confident and correct (e.g., predicts 0.99 for y=1 or 0.01 for y=0), the loss is very low (close to 0).
When the model is confident and wrong (e.g., predicts 0.01 for y=1), the loss is very high (tends to infinity as log(p) → −∞).
When the model is unsure (e.g., predicts 0.5), the loss is moderate — about 0.69.
Why This Matters#
Project uses BCE as the loss function during training for logistic regression — and this is crucial in your healthcare context for several reasons:
Confidence Calibration:
The model is penalized more for being confidently wrong. In a medical setting, it’s better for the model to be cautious (output probabilities closer to 0.5) than confidently incorrect, especially when predicting admission decisions.Smooth, Differentiable Optimization:
BCE is well-suited for gradient-based training. It allows the model to gradually adjust weights based on how far the prediction is from the true label.Risk-Aware Prediction:
The loss function inherently values well-calibrated risk scores, not just binary predictions. For example, even if a patient isn’t classified as “admitted,” a probability of 0.48 might still flag them as borderline, which could inform future thresholds or human-in-the-loop systems.
So#
The BCE curves illustrate how your model was trained to:
Make probabilistic predictions aligned with truth
Be rewarded for caution and penalized for overconfidence
Learn to separate classes using probability as a continuous signal, not just binary labels
In clinical ML tasks, where misclassification has real consequences, BCE supports safer and more nuanced model behavior.
import numpy as np
import matplotlib.pyplot as plt
# Predicted probability values from 0.01 to 0.99
pred_probs = np.linspace(0.01, 0.99, 100)
# Compute loss for both possible classes
bce_loss_1 = -np.log(pred_probs) # y_true = 1
bce_loss_0 = -np.log(1 - pred_probs) # y_true = 0
# Plot
plt.figure(figsize=(10, 5))
plt.subplot(1, 2, 1)
plt.plot(pred_probs, bce_loss_1, label="BCE Loss for y_true=1", color="orange")
plt.title("BCE Loss vs Probability (y=1)")
plt.xlabel("Predicted Probability")
plt.ylabel("Loss")
plt.grid(True)
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(pred_probs, bce_loss_0, label="BCE Loss for y_true=0", color="blue")
plt.title("BCE Loss vs Probability (y=0)")
plt.xlabel("Predicted Probability")
plt.ylabel("Loss")
plt.grid(True)
plt.legend()
plt.tight_layout()
plt.show()
Why Inspecting Top Predictors Matters in Model Validation#
Once a classification model shows high performance (e.g., 0.99+ accuracy), it’s important to go beyond just metrics and examine how the model is making decisions. In this project, we do this by analyzing the top predictors learned by the model.
For Logistic Regression: Coefficients#
Logistic regression assigns a coefficient to each input feature, which represents the log-odds change of the outcome associated with a one-unit increase in that feature (after preprocessing):
A positive coefficient increases the probability of the class
1(Admitted).A negative coefficient pushes the prediction toward class
0(Not Admitted).Larger magnitude = more influence on the decision.
Why This Is Especially Important in Healthcare:#
Model Transparency:
Clinicians and stakeholders need to understand what factors the model is using to make decisions — not just the output.Detecting Data Leakage:
If “None”, “Unknown”, or padded values appear as top features, it may indicate the model is learning from artifacts, not patient patterns.Verifying Clinical Relevance:
Top predictors should include medically meaningful features like:Diagnosis codes (e.g., hypertensive CKD)
Payer type (e.g., Medicaid vs Private)
Patient location or hospital teaching status
Improving Trust:
When top predictors align with known risk factors, it builds confidence that the model’s high performance is earned, not accidental.
Analysis of Top Predictors from Logistic Regression#
The table above shows the most influential features used by the logistic regression model to predict whether a patient presenting with a hypertensive crisis is admitted to the hospital.
Each feature is encoded as part of one-hot encoding (e.g., cat_Payer_Medicare, cat_Diagnosis_None), and the coefficient indicates how strongly that feature influences the probability of admission.
🔝 Most Influential Predictors (by Coefficient Magnitude)#
Feature |
Effect Direction |
Interpretation |
|---|---|---|
|
➕ Positive |
Patients with Medicare are much more likely to be admitted. Likely reflects age/comorbidity risk. |
|
➖ Negative |
Missing diagnosis reduces likelihood of admission — possibly reflecting data gaps in non-urgent visits. |
|
➕ Positive |
Patients from large fringe metros may have higher hospitalization rates. |
|
➕ Positive |
Hospitals in nonprofit regions may admit more readily. |
|
➖ Negative |
Patients with private insurance are less likely to be admitted, potentially due to access to outpatient follow-up. |
|
➕ Positive |
Male patients are slightly more likely to be admitted. |
|
➖ Negative |
Teaching hospitals may have more triage options, possibly reducing admissions. |
Key Insights for Your Project#
Insurance Status is Crucial:
The largest positive predictor isPayer_Medicare, andPayer_Privatehas a negative coefficient — indicating insurance type has a strong effect on admission likelihood.Missing or Ambiguous Data Can Hurt Predictions:
TheDiagnosis_Nonevariable has a large negative impact. This aligns with clinical intuition: if no diagnosis is recorded, the case may be less severe or less complete.Geographic and Hospital Characteristics Matter:
Location (e.g.,Fringe ≥1M,Micropolitan,Central ≥1M) and region type (e.g.,Non-profit,Public) contribute to the model’s decision, possibly reflecting infrastructure differences.Gender Shows Mild Influence:
Sex_Malehas a modest positive effect. While not dominant, it suggests some demographic interaction.
Why This Analysis Supports Model Validity#
These predictors are clinically and contextually meaningful.
There is no sign of spurious artifacts (e.g., filler values or meaningless codes).
The features reinforce known healthcare patterns (e.g., older patients and those with specific payers being more likely to be admitted).
This analysis strengthens confidence that your model’s 0.99+ performance is not only accurate but interpretable and trustworthy.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)
# Initialize and fit model
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)
import numpy as np
import pandas as pd
# Get feature names
feature_names = X.columns
# Get coefficients
coefs = logreg.coef_[0]
# Create DataFrame
coef_df = pd.DataFrame({
"Feature": feature_names,
"Coefficient": coefs,
"Absolute Value": np.abs(coefs)
}).sort_values(by="Absolute Value", ascending=False)
# Show top predictors
coef_df.head(15)
| Feature | Coefficient | Absolute Value | |
|---|---|---|---|
| 12 | cat__Payer_Medicare | 8.655563 | 8.655563 |
| 21 | cat__Diagnosis_None | -3.382368 | 3.382368 |
| 4 | cat__Location_Fringe ≥1M | 2.859535 | 2.859535 |
| 7 | cat__Region_Non-profit | 1.449258 | 1.449258 |
| 15 | cat__Payer_Private | -1.339574 | 1.339574 |
| 1 | cat__Sex_Male | 1.252930 | 1.252930 |
| 11 | cat__Teaching_Teaching | -1.195936 | 1.195936 |
| 16 | cat__Payer_Self-pay | -0.993272 | 0.993272 |
| 5 | cat__Location_Micropolitan | -0.859083 | 0.859083 |
| 9 | cat__Region_Rural | -0.807971 | 0.807971 |
| 6 | cat__Location_None | -0.788938 | 0.788938 |
| 3 | cat__Location_Central ≥1M | 0.666226 | 0.666226 |
| 10 | cat__Teaching_Nonteaching | 0.640914 | 0.640914 |
| 8 | cat__Region_Public | 0.544461 | 0.544461 |
| 2 | cat__Location_50k–249k | -0.544133 | 0.544133 |
import pandas as pd
import matplotlib.pyplot as plt
# Manually input feature names and coefficients
data = {
"Feature": [
"cat__Payer_Medicare", "cat__Diagnosis_None", "cat__Location_Fringe ≥1M",
"cat__Region_Non-profit", "cat__Payer_Private", "cat__Sex_Male",
"cat__Teaching_Teaching", "cat__Payer_Self-pay", "cat__Location_Micropolitan",
"cat__Region_Rural", "cat__Location_None", "cat__Location_Central ≥1M",
"cat__Teaching_Nonteaching", "cat__Region_Public", "cat__Location_50k–249k"
],
"Coefficient": [
8.655563, -3.382368, 2.859535,
1.449258, -1.339574, 1.252930,
-1.195936, -0.993272, -0.859083,
-0.807971, -0.788938, 0.666226,
0.640914, 0.544461, -0.544133
]
}
# Create DataFrame and sort by absolute coefficient
df_logit = pd.DataFrame(data)
df_logit["Absolute Value"] = df_logit["Coefficient"].abs()
df_logit_sorted = df_logit.sort_values(by="Absolute Value", ascending=True)
# Plot
plt.figure(figsize=(9, 7))
plt.barh(df_logit_sorted["Feature"], df_logit_sorted["Coefficient"], color='royalblue')
plt.xlabel("Coefficient Value")
plt.title("Top 15 Predictors - Logistic Regression")
plt.axvline(0, color='gray', linewidth=0.8)
plt.tight_layout()
plt.show()
Advanced Model Random Forest#
Random Forest Model Evaluation#
The Random Forest classifier was trained to predict whether a patient presenting to the emergency department with a hypertensive crisis would be admitted (1) or not admitted (0). Below is the detailed interpretation of its performance.
1. Classification Report#
Class |
Precision |
Recall |
F1-Score |
Support |
|---|---|---|---|---|
Not Admitted (0) |
1.00 |
1.00 |
1.00 |
3594 |
Admitted (1) |
1.00 |
1.00 |
1.00 |
3149 |
Precision (1.00): Every time the model predicted a patient would be admitted, it was correct.
Recall (1.00): The model correctly identified almost every admitted patient (only 1 was missed).
F1-Score (1.00): Indicates a perfect balance between precision and recall.
Support: Shows a fairly balanced class distribution (~53% vs ~47%).
The model achieved near-perfect performance across all metrics — a rare but possible result when features are highly informative or well-structured.
Summary & Implications#
The Random Forest model shows perfect or near-perfect predictive power for this task.
Such high performance suggests that the dataset contains clear patterns — possibly due to well-encoded categorical features or strong associations in the variables (e.g., payer, location, diagnosis).
Importantly, the model makes virtually no clinically dangerous mistakes:
High recall for the “admitted” class minimizes the risk of missing patients needing care.
Zero false positives reduces unnecessary admissions.
ROC Curve & AUC#
The ROC curve for the Random Forest model demonstrates exceptional separation between the two classes.
AUC = 1.00: This suggests the model can distinguish between admitted and not admitted patients with perfect confidence across all thresholds.
The curve hugs the top-left corner, indicating that at nearly all thresholds, the model achieves high sensitivity with minimal false positives.
In clinical prediction settings, such an AUC implies the model is extremely well-calibrated for distinguishing between patients who should be admitted and those who should not.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
# Step 1: Split data (assuming df_final already exists and is preprocessed)
X = df_final.drop(columns=["Label"])
y = df_final["Label"]
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)
# Step 2: Initialize and train Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
# Step 3: Predict
y_pred_rf = rf.predict(X_test)
y_proba_rf = rf.predict_proba(X_test)[:, 1]
print(" Random Forest training complete.")
Random Forest training complete.
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
# Accuracy and classification report
acc_rf = accuracy_score(y_test, y_pred_rf)
print(f"Accuracy (Random Forest): {acc_rf:.4f}")
print("Classification Report (Random Forest):")
print(classification_report(y_test, y_pred_rf))
# ROC AUC
fpr_rf, tpr_rf, _ = roc_curve(y_test, y_proba_rf)
auc_rf = roc_auc_score(y_test, y_proba_rf)
# Plot ROC Curve
plt.figure(figsize=(6, 5))
plt.plot(fpr_rf, tpr_rf, label=f"Random Forest (AUC = {auc_rf:.2f})", color='green')
plt.plot([0, 1], [0, 1], linestyle="--", color="gray")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate (Recall)")
plt.title("ROC Curve - Random Forest")
plt.legend(loc="lower right")
plt.grid(True)
plt.tight_layout()
plt.show()
Accuracy (Random Forest): 0.9999
Classification Report (Random Forest):
precision recall f1-score support
0 1.00 1.00 1.00 3594
1 1.00 1.00 1.00 3149
accuracy 1.00 6743
macro avg 1.00 1.00 1.00 6743
weighted avg 1.00 1.00 1.00 6743
3. Confusion Matrix Interpretation#
Predicted: Not Admitted |
Predicted: Admitted |
|
|---|---|---|
Actual: Not Admitted |
3594 (True Negatives) |
0 (False Positives) |
Actual: Admitted |
1 (False Negative) |
3148 (True Positives) |
The model correctly classified all but one patient.
False Negatives = 1: Only one admitted patient was misclassified as not admitted — a strong sign of reliability.
False Positives = 0: No patient was incorrectly predicted to be admitted when they weren’t.
Summary & Implications#
The Random Forest model shows perfect or near-perfect predictive power for this task.
Such high performance suggests that the dataset contains clear patterns — possibly due to well-encoded categorical features or strong associations in the variables (e.g., payer, location, diagnosis).
Importantly, the model makes virtually no clinically dangerous mistakes:
High recall for the “admitted” class minimizes the risk of missing patients needing care.
Zero false positives reduces unnecessary admissions.
from sklearn.metrics import confusion_matrix
import seaborn as sns
# Confusion matrix
cm_rf = confusion_matrix(y_test, y_pred_rf)
# Plot heatmap
plt.figure(figsize=(6, 5))
sns.heatmap(cm_rf, annot=True, fmt="d", cmap="Blues",
xticklabels=["Not Admitted", "Admitted"],
yticklabels=["Not Admitted", "Admitted"])
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.title("Confusion Matrix Heatmap - Random Forest")
plt.tight_layout()
plt.show()
Random Forest: Top Predictors Analysis#
The table above lists the features that contributed most to the Random Forest model’s decision-making in predicting hospital admissions for hypertensive crisis patients. Each importance score reflects how frequently and significantly a feature reduced impurity across all trees in the ensemble.
🔝 Key Features and Interpretations#
Feature |
Interpretation |
|---|---|
|
Most influential predictor. Patients covered by Medicare (typically older or with chronic conditions) are much more likely to be admitted — aligns with clinical expectations. |
|
Strong negative signal. Absence of diagnosis data likely indicates milder cases or documentation gaps — reducing the likelihood of admission. |
|
Male patients show higher likelihood of admission. Could reflect clinical severity or demographic differences in presentation. |
|
Non-teaching hospital status has moderate influence, possibly reflecting structural differences in triage and care. |
|
Large metro central locations show meaningful influence, likely due to higher hospital density or acuity levels. |
Notable Observations#
Hospital Type and Region Matter:
Features likeTeaching_Nonteaching,Region_Non-profit, andRegion_Publicindicate that institutional characteristics impact admission likelihood.Multiple Geographic Factors Influence Risk:
Location-based features such asFringe ≥1M,Micropolitan,Central ≥1M, andRegion_Ruralappear across the top 15 — suggesting regional trends in healthcare access or policies.Payer Type Is Crucial:
Beyond Medicare,Payer_PrivateandSelf-payalso appear, confirming that insurance status affects disposition decisions — potentially due to coverage limitations or care pathways.
Conclusion#
The Random Forest model’s top predictors are:
Clinically meaningful
Aligned with known healthcare patterns
Consistent with the top predictors from logistic regression
This strengthens the validity of the model’s high performance and confirms that both linear and non-linear models are learning from the same core patient and institutional characteristics.
These insights support confidence in the model’s generalizability and real-world utility.
import pandas as pd
import matplotlib.pyplot as plt
# Ensure your Random Forest model is already trained (e.g., rf.fit(X_train, y_train))
# Step 1: Get feature importances
rf_importances = rf.feature_importances_
# Step 2: Match to feature names
feature_names = X.columns # or use pipeline.get_feature_names_out() if you're using a pipeline
# Step 3: Create DataFrame and sort
rf_df = pd.DataFrame({
"Feature": feature_names,
"Importance": rf_importances
}).sort_values(by="Importance", ascending=False)
# Step 4: Display top predictors
print("🌲 Top Random Forest Predictors:")
display(rf_df.head(15))
🌲 Top Random Forest Predictors:
| Feature | Importance | |
|---|---|---|
| 12 | cat__Payer_Medicare | 0.416303 |
| 21 | cat__Diagnosis_None | 0.155719 |
| 1 | cat__Sex_Male | 0.149315 |
| 10 | cat__Teaching_Nonteaching | 0.063363 |
| 3 | cat__Location_Central ≥1M | 0.053013 |
| 7 | cat__Region_Non-profit | 0.049862 |
| 4 | cat__Location_Fringe ≥1M | 0.045656 |
| 15 | cat__Payer_Private | 0.022367 |
| 8 | cat__Region_Public | 0.011156 |
| 18 | cat__Diagnosis_Hypertensive CKD | 0.009078 |
| 16 | cat__Payer_Self-pay | 0.005537 |
| 11 | cat__Teaching_Teaching | 0.004802 |
| 2 | cat__Location_50k–249k | 0.003538 |
| 5 | cat__Location_Micropolitan | 0.003075 |
| 9 | cat__Region_Rural | 0.002239 |
# Bar chart of top 15 features
top_n = 15
plt.figure(figsize=(8, 6))
plt.barh(rf_df.head(top_n)["Feature"][::-1], rf_df.head(top_n)["Importance"][::-1], color='forestgreen')
plt.xlabel("Importance Score")
plt.title(f"Top {top_n} Predictors - Random Forest")
plt.tight_layout()
plt.show()
Discussion and Insights#
Discussion and Insights#
This project aimed to predict hospital admission outcomes for patients presenting to the emergency department with hypertensive crises, using a variety of demographic, socioeconomic, and clinical features. Two models were trained and evaluated: logistic regression and random forest. Both models demonstrated exceptionally high predictive performance, achieving nearly perfect accuracy, precision, and recall.
Consistency Across Models#
Despite their different architectures — one linear (logistic regression) and the other non-linear (random forest) — both models highlighted similar key predictors:
Payer_MedicareandDiagnosis_Noneemerged as the most influential features across both models.Insurance status, missing diagnoses, sex, and geographic indicators (like
Fringe ≥1MorNon-profit regions) consistently influenced admission decisions.This alignment reinforces the idea that the model is learning meaningful clinical and systemic patterns — not overfitting or relying on spurious correlations.
Cross-Validation: Model Stability#
Stratified cross-validation was used to validate both models. Across all folds:
Accuracy remained above 99% consistently.
ROC AUC approached 1.00, indicating extremely strong class separation.
The models generalized well to unseen data, suggesting high reliability and robustness.
These consistent metrics confirm that the results are not due to a lucky train-test split but represent a strong underlying relationship between features and patient outcomes.
The Role of Binary Cross-Entropy#
For logistic regression, binary cross-entropy (BCE) provided a valuable training signal and evaluation tool:
BCE loss penalizes overconfident incorrect predictions more heavily, encouraging the model to generate well-calibrated probabilities — crucial in healthcare settings.
Visualizing the BCE loss for both true classes helped interpret how confidently the model makes decisions and how cautious it is when uncertain.
This added layer of interpretability strengthens the trustworthiness of the logistic regression model’s predictions.
Final Takeaway#
The agreement between logistic regression and random forest, both in terms of performance and learned feature importance, suggests that the predictive patterns in the data are clear, interpretable, and generalizable. The use of cross-validation and calibration analysis (via BCE) ensured that the models are not only accurate but also robust and trustworthy — making them suitable for informing decision-making in real clinical contexts.
Overall, this project demonstrates the successful application of machine learning to healthcare disposition modeling, with strong empirical evidence, consistent interpretability, and statistically validated performance.