Predicting Subscription to Term Deposit#

Author: Pinge Chen

Course Project, UC Irvine, Math 10, F24

I would like to post my notebook on the course’s website. Yes

1.Introduction#

This data is from the Bank Marketing task from UCI Machine Learning Repository (https://archive.ics.uci.edu/dataset/222/bank+marketing). The 45,211 entries with 16 features of data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).

The classification goal is to predict whether a client will subscribe to a term deposit, represented by the target variable y, which is binary:

y = 1 (Yes): The client subscribed to a term deposit.

y = 0 (No): The client did not subscribe to a term deposit.

The dataset includes information about:

Client-specific attributes: Age, job, marital status, education level, and financial details.

Campaign-related attributes: Number of contacts, previous campaign outcomes, and time-related variables (month, day of the week).

External attributes: Economic context features such as employment variation rate and consumer confidence index.

Variable Name

Role

Type

Demographic

Description

Units

Missing Values

age

Feature

Integer

Age

no

job

Feature

Categorical

Occupation

type of job (categorical: ‘admin.’, ‘blue-collar’, ‘entrepreneur’, ‘housemaid’, ‘management’, ‘retired’, ‘self-employed’, ‘services’, ‘student’, ‘technician’, ‘unemployed’, ‘unknown’)

no

marital

Feature

Categorical

Marital Status

marital status (categorical: ‘divorced’, ‘married’, ‘single’, ‘unknown’; note: ‘divorced’ means divorced or widowed)

no

education

Feature

Categorical

Education Level

(categorical: ‘basic.4y’, ‘basic.6y’, ‘basic.9y’, ‘high.school’, ‘illiterate’, ‘professional.course’, ‘university.degree’, ‘unknown’)

no

default

Feature

Binary

has credit in default?

no

balance

Feature

Integer

average yearly balance

euros

no

housing

Feature

Binary

has housing loan?

no

loan

Feature

Binary

has personal loan?

no

contact

Feature

Categorical

contact communication type (categorical: ‘cellular’, ‘telephone’)

yes

day_of_week

Feature

Date

last contact day of the week

no

month

Feature

Date

last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)

no

duration

Feature

Integer

last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=’no’). However, the duration is not known before a call is performed. Thus, it should only be included for benchmark purposes and excluded in realistic predictive models.

seconds

no

campaign

Feature

Integer

number of contacts performed during this campaign and for this client (numeric, includes last contact)

no

pdays

Feature

Integer

number of days that passed by after the client was last contacted from a previous campaign (numeric; -1 means client was not previously contacted)

yes

previous

Feature

Integer

number of contacts performed before this campaign and for this client

no

poutcome

Feature

Categorical

outcome of the previous marketing campaign (categorical: ‘failure’, ‘nonexistent’, ‘success’)

yes

y

Target

Binary

has the client subscribed to a term deposit?

no

from ucimlrepo import fetch_ucirepo 
import pandas as pd  
# fetch dataset 
bank_marketing = fetch_ucirepo(id=222) 
  
# data (as pandas dataframes) 
X = bank_marketing.data.features 
y = bank_marketing.data.targets 
  
# metadata 
# print(bank_marketing.metadata) 
  
# variable information 
# print(bank_marketing.variables)

2.Data Cleaning#

I am going to approach this by:

  1. Dropping irrelevant rows and columns

  2. Encoding categorical features

X.head(5)
age job marital education default balance housing loan contact day_of_week month duration campaign pdays previous poutcome
0 58 management married tertiary no 2143 yes no NaN 5 may 261 1 -1 0 NaN
1 44 technician single secondary no 29 yes no NaN 5 may 151 1 -1 0 NaN
2 33 entrepreneur married secondary no 2 yes yes NaN 5 may 76 1 -1 0 NaN
3 47 blue-collar married NaN no 1506 yes no NaN 5 may 92 1 -1 0 NaN
4 33 NaN single NaN no 1 no no NaN 5 may 198 1 -1 0 NaN

Dropping irrelavant columns and value-missing rows.#

I am dropping some of the columns that are subjectively irrelevant to me:

  1. job: This feature includes 11 distinct value, making it very hard to encode for the regression to work in the future. If use one-hot encoding, there will be a huge increase in dimensionality. If use ordinal encoding, there is no internal order for job types. Additionally, it is hard for me to subjectively assign values to each job type. Thus, I am dropping this feature.

  2. poutcome: Even though this feature should be very useful intuitively, however, there are 36959/45211 missing values, which is too much. Keep this feature will require me to drop all the rows with missing value, costing me a lot of other information. Thus, I am dropping this feature.

  3. duration: This feature was said to be dropped by the author. Important note: this attribute highly affects the output target (e.g., if duration=0 then y=’no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

  4. contact: I am dropping this feature because it is irrelevant to me intuitively.

I am transforming these columns:

  1. day_of_week & month: These features together can represent the last contact date of the current campaign. Putting them together and representing them as a difference between today’s date (assuming December 8th) and the date itself is more meaningful.

df = X.drop(columns=['job', 'poutcome', 'duration', 'contact'], axis=1)

df.head()
age marital education default balance housing loan day_of_week month campaign pdays previous
0 58 married tertiary no 2143 yes no 5 may 1 -1 0
1 44 single secondary no 29 yes no 5 may 1 -1 0
2 33 married secondary no 2 yes yes 5 may 1 -1 0
3 47 married NaN no 1506 yes no 5 may 1 -1 0
4 33 single NaN no 1 no no 5 may 1 -1 0
month_map = {
    'jan': 1, 'feb': 2, 'mar': 3, 'apr': 4, 'may': 5, 'jun': 6,
    'jul': 7, 'aug': 8, 'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12
}

df['month'] = df['month'].map(month_map)

df['date'] = pd.to_datetime(
    dict(year=2024, month=df['month'], day=df['day_of_week'])
)

df = df.drop(columns=['month', 'day_of_week'])

reference_date = pd.Timestamp('2024-12-08')

df['days_since_contact'] = (reference_date - df['date']).dt.days

df = df.drop(columns=['date'])

Next step is to drop the rows with missing-values, but before that, I am going to append y to df so that we don’t have to match the outcomes one by one.

df['y'] = y

I am dropping rows with missing-values.

df = df.dropna()

Encoding Categorical Features#

I am one-hot encoding the categorical features for easier usage in regression models: marital, education

I am maping housing, loan, default and also y to 1 and 0.

df = pd.get_dummies(df, columns=['marital', 'education'], drop_first=True)

df['housing'] = df['housing'].map({'yes': 1, 'no': 0})
df['loan'] = df['loan'].map({'yes': 1, 'no': 0})
df['default'] = df['default'].map({'yes': 1, 'no': 0})

df['y'] = df['y'].map({'yes': 1, 'no': 0})

3.Exploratory Data Analysis (EDA)#

As data cleaning is now complete, we will move on to EDA and gain some understanding of the data.

df.describe()
age default balance housing loan campaign pdays previous days_since_contact y
count 43354.000000 43354.000000 43354.000000 43354.000000 43354.000000 43354.000000 43354.000000 43354.000000 43354.000000 43354.000000
mean 40.783111 0.018061 1355.226715 0.560733 0.164022 2.760184 40.340960 0.584260 170.662061 0.116183
std 10.518987 0.133172 3039.916830 0.496304 0.370300 3.065496 100.331955 2.329661 74.843025 0.320448
min 18.000000 0.000000 -8019.000000 0.000000 0.000000 1.000000 -1.000000 0.000000 -22.000000 0.000000
25% 33.000000 0.000000 71.000000 0.000000 0.000000 1.000000 -1.000000 0.000000 124.000000 0.000000
50% 39.000000 0.000000 443.000000 1.000000 0.000000 2.000000 -1.000000 0.000000 187.000000 0.000000
75% 48.000000 0.000000 1415.000000 1.000000 0.000000 3.000000 -1.000000 0.000000 213.000000 0.000000
max 95.000000 1.000000 102127.000000 1.000000 1.000000 58.000000 871.000000 275.000000 337.000000 1.000000
df.head(5)
age default balance housing loan campaign pdays previous days_since_contact y marital_married marital_single education_secondary education_tertiary
0 58 0 2143 1 0 1 -1 0 217 0 True False False True
1 44 0 29 1 0 1 -1 0 217 0 False True True False
2 33 0 2 1 1 1 -1 0 217 0 True False True False
5 35 0 231 1 0 1 -1 0 217 0 True False False True
6 28 0 447 1 1 1 -1 0 217 0 False True False True

With some basic information above, I now want to explore the distribution of the outcome variable y.

import matplotlib.pyplot as plt

y_counts = df['y'].value_counts()

plt.bar(y_counts.index, y_counts.values, tick_label=['No (0)', 'Yes (1)'])
plt.xlabel('Outcome (y)')
plt.ylabel('Count')
plt.title('Distribution of Outcome Variable (y)')
plt.show()
../_images/35ac5406b9640e278dc4584f9116eb64132ceafe7e81136dd4b48ad6d4fa50a5.png
y_counts = df['y'].value_counts()
class_proportions = y_counts / len(df)

print(f"Class Counts:\n{y_counts}")
print(f"\nClass Proportions:\n{class_proportions}")
Class Counts:
y
0    38317
1     5037
Name: count, dtype: int64

Class Proportions:
y
0    0.883817
1    0.116183
Name: count, dtype: float64

We can see that there is a significant class imbalance for the outcome variable y.

The majority of clients did not subscribe to a term deposit (y=0), while only a small percentage subscribed (y=1).

This imbalance means my dataset is skewed toward the majority class, making it harder for a model to accurately predict the minority class (y=1).

This also means that the minority class (y=1) is underrepresented, and special care (resampling or reweighting) will be needed to ensure the model can handle this.

If a model always predicts the majority class (y=0), the baseline accuracy will equal the proportion of y=0, which in this case is 88%.

Next, I want to see if there is a linear relationship between age and balanc. As I believe, an individual with higher age should be correlated with a higher average yearly balance. I will apply Linear Regression to test this.

from sklearn.linear_model import LinearRegression

X = df[['age']]
y = df['balance']

model = LinearRegression()
model.fit(X, y)

print(f"Intercept: {model.intercept_:.2f}")
print(f"Coefficient: {model.coef_[0]:.2f}")

plt.figure(figsize=(8, 6))
plt.scatter(X, y, alpha=0.3, label='Data points')
plt.plot(X, model.predict(X), color='red', label='Regression line')
plt.xlabel("Age")
plt.ylabel("Balance")
plt.title("Relationship Between Age and Balance")
plt.legend()
plt.show()
Intercept: 210.42
Coefficient: 28.07
../_images/bde707426cbc429ba84277e8c73891b24b19cfca19bca5d2d57235e62ac7dbca.png

The data points are widely dispersed, and there doesn’t appear to be a strong pattern or trend between age and balance. The red regression line is almost flat, showing a near-zero slope. This further confirms the lack of a significant relationship between the two variables

Now, lets build a heat map to see how the features are related to each other.

import seaborn as sns
import matplotlib.pyplot as plt

correlation_matrix = df.corr()

plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm", cbar=True)

plt.title("Correlation Heatmap of Features", fontsize=16)
plt.show()
../_images/53aca0b750cecf76f23a488b55a049971a04c25ff64f1a0c6e1d08c920ece418.png

There is not much useful information given this heat map. Some takeaways are:

age and y (0.02): Indicates no significant linear relationship between age and the likelihood of subscribing to a term deposit. default and y (-0.02): Default status has almost no relationship with the target variable. These variables are likely going to be dropped if I apply Lasso Regularization.

Multicollinearity might exist. pdays and previous (0.45): These features have a moderate positive correlation. This could indicate redundancy, and one of these features might be removed during feature selection.

The weak correlation between most features and the target variable (y) suggests that no single feature strongly predicts the likelihood of subscribing to a term deposit. The model will likely need to rely on combinations of features.

Next, I am going to split the data into training and testing data after I use oversampling to solve the class imbalance.

from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from collections import Counter

X = df.drop(columns=['y'])
y = df['y']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)

print("Original training set class distribution:", Counter(y_train))
print("Balanced training set class distribution:", Counter(y_train_balanced))
Original training set class distribution: Counter({0: 30653, 1: 4030})
Balanced training set class distribution: Counter({0: 30653, 1: 30653})

4.Logistic Regression#

Since the problem is a binary classification problem, it is more reasonable to use Logistic Regression.

First, I will scale the numerical features.

from sklearn.preprocessing import StandardScaler
import numpy as np

numeric_features = ['age', 'balance', 'campaign', 'pdays', 'previous', 'days_since_contact']
categorical_features = ['default', 'housing', 'loan', 'marital_married', 'marital_single', 'education_secondary', 'education_tertiary']

scaler = StandardScaler()

X_train_scaled_numeric = scaler.fit_transform(X_train_balanced[numeric_features])
X_test_scaled_numeric = scaler.transform(X_test[numeric_features])

X_train_scaled = np.hstack([
    X_train_scaled_numeric, 
    X_train_balanced[categorical_features].values
])
X_test_scaled = np.hstack([
    X_test_scaled_numeric, 
    X_test[categorical_features].values
])

Next, I am training the Logistic Regression Model with the balanced and scaled X_train and the balanced y_train.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
import pandas as pd

logistic_model = LogisticRegression(max_iter=1000, random_state=42)

logistic_model.fit(X_train_scaled, y_train_balanced)

coefficients = pd.DataFrame({
    'Feature': X_train_balanced.columns,
    'Coefficient': logistic_model.coef_[0]
}).sort_values(by='Coefficient', ascending=False)

Next, I am plotting the Confusion Matrix as an visualization of the result and calculating the accuracy score.

import seaborn as sns
import matplotlib.pyplot as plt

y_pred = logistic_model.predict(X_test_scaled)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['No', 'Yes'], yticklabels=['No', 'Yes'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
Accuracy: 0.77
../_images/f561ede745c96120d40f31068bbe02a89bf60ae176de060943214f1ef3af3add.png

True Negatives (TN = 6274): The model correctly predicted “No” (class 0) for 6274 clients who actually did not subscribe to a term deposit.

False Positives (FP = 1390): The model incorrectly predicted “Yes” (class 1) for 1390 clients who actually did not subscribe.

False Negatives (FN = 637): The model incorrectly predicted “No” (class 0) for 637 clients who actually subscribed.

True Positives (TP = 370): The model correctly predicted “Yes” (class 1) for 370 clients who actually subscribed.

I am aware that Lasso Regularization is used to prevent overfitting, but I’d still like to try if it improves my results. I am also implementing GridSearch for hyperparameter tuning.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

logistic_lasso = LogisticRegression(penalty='l1', solver='liblinear', max_iter=1000, random_state=42)

param_grid = {
    'C': [0.01, 0.1, 1, 10, 100]
}

grid = GridSearchCV(logistic_lasso, param_grid, scoring='accuracy', cv=5)
grid.fit(X_train_scaled, y_train_balanced)

print("Best Regularization Strength (C):", grid.best_params_)
best_lasso_model = grid.best_estimator_

best_lasso_model.fit(X_train_scaled, y_train_balanced)

y_pred_lasso = best_lasso_model.predict(X_test_scaled)

accuracy = accuracy_score(y_test, y_pred_lasso)
cm = confusion_matrix(y_test, y_pred_lasso)

print(f"Accuracy: {accuracy:.2f}")

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['No', 'Yes'], yticklabels=['No', 'Yes'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix (Lasso Regularization)')
plt.show()
Best Regularization Strength (C): {'C': 1}
Accuracy: 0.77
../_images/8547c23ab58e0b92d26be8afdbfde1793a968635d4219b1482857f8f1dbae057.png

Both Logistic Regression and Logistic Regression with Lasso achieve 77% accuracy. This is lower than the baseline accuracy but demonstrates that the model is attempting to balance predictions for both classes, rather than always predicting the majority class.

A model achieving 77% accuracy while predicting both classes is more meaningful than a naive model achieving 88% accuracy but ignoring one class entirely.

I need to focus on metrics beyond accuracy to measure the performance of my models. I am going to use Recall.

from sklearn.metrics import recall_score

y_pred_logistic = logistic_model.predict(X_test_scaled)

y_pred_lasso = best_lasso_model.predict(X_test_scaled)

recall_logistic = recall_score(y_test, y_pred_logistic)
print(f"Recall (Logistic Regression): {recall_logistic:.2f}")

recall_lasso = recall_score(y_test, y_pred_lasso)
print(f"Recall (Logistic Regression with Lasso): {recall_lasso:.2f}")
Recall (Logistic Regression): 0.37
Recall (Logistic Regression with Lasso): 0.37

Sadly, changing the metrics doesn’t change the fact that the model is poorly performing on predicting the Yes category. With a recall of 0.37, my model struggles to capture the majority of potential subscribers, which is problematic if the goal is to target clients likely to subscribe.

Next, I am going to use the K Nearest Neighbor Model to see if this nonparametric model can produce a better result.

5.KNN Model#

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)

knn.fit(X_train_scaled, y_train_balanced)

y_pred_knn = knn.predict(X_test_scaled)

accuracy_knn = accuracy_score(y_test, y_pred_knn)
recall_knn = recall_score(y_test, y_pred_knn)
cm_knn = confusion_matrix(y_test, y_pred_knn)

print(f"KNN Accuracy: {accuracy_knn:.2f}")
print(f"KNN Recall: {recall_knn:.2f}")

plt.figure(figsize=(8, 6))
sns.heatmap(cm_knn, annot=True, fmt='d', cmap='Blues', xticklabels=['No', 'Yes'], yticklabels=['No', 'Yes'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix (KNN)')
plt.show()
KNN Accuracy: 0.78
KNN Recall: 0.41
../_images/0952d544a56559c95d59adf8107088ca5f81eb2ab486129dd0a99d5e5923d1d6.png

Next, I will use GridSearch again to identify the best parameters.

param_grid = {
    'n_neighbors': [3, 5, 7, 9, 11],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

knn = KNeighborsClassifier()

grid_knn = GridSearchCV(knn, param_grid, scoring='recall', cv=5)
grid_knn.fit(X_train_scaled, y_train_balanced)

best_knn = grid_knn.best_estimator_

y_pred_best_knn = best_knn.predict(X_test_scaled)
accuracy_best_knn = accuracy_score(y_test, y_pred_best_knn)
recall_best_knn = recall_score(y_test, y_pred_best_knn)

print(f"Best KNN Accuracy: {accuracy_best_knn:.2f}")
print(f"Best KNN Recall: {recall_best_knn:.2f}")
Best KNN Accuracy: 0.78
Best KNN Recall: 0.40

Accuracy = 0.78:

The KNN model achieves an accuracy of 78%, which is slightly better than the Logistic Regression model’s 77%.

Recall = 0.40:

The KNN model achieves a recall of 40%, which is better than the Logistic Regression model’s recall of 37%. This means that the KNN model correctly identifies 40% of the actual positive cases (y=1), making it slightly better at capturing the minority class (clients who subscribed to the term deposit).

The improvement in recall suggests that KNN is slightly better at identifying the minority class (y=1), even though the overall improvement is modest.

Next, I want to try clustering and see if it gives any suprising results.

6.Clustering with K-Means and PCA for Dimensionality Reduction#

from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

kmeans = KMeans(n_clusters=2, random_state=42)
cluster_labels = kmeans.fit_predict(X_train_scaled)
pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)

plt.figure(figsize=(8, 6))
sns.scatterplot(x=X_train_pca[:, 0], y=X_train_pca[:, 1], hue=cluster_labels, palette='Set1', alpha=0.7)
plt.title('K-Means Clusters (2D Projection)')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.legend(title='Cluster')
plt.show()
../_images/9da9313ecd3da9c836ae239ee27fbc4e489a6424c97629e46a35d441ae7bf9c4.png

To be honest, I am suprised by the above graph as in I do not understand a thing about what the graph is telling me. So I asked ChatGPT to interpret the graph for me. And here are some understandings:

There is a significant overlap between the clusters in the lower-left region of the plot, indicating that the features may not provide enough separation for distinct groups.

The points in Cluster 1 (blue) extend further along PCA Component 1, suggesting this cluster captures samples with distinct characteristics on the principal feature captured by the PCA.

Significant overlap in the clusters aligns with my classification models (Logistic Regression, KNN) struggling to achieve higher recall for the minority class.

In the above code, I have given that there are 2 clusters. For fun, I want to test if not given that information, how many clusters will minimize loss.

The following code is directly copied from the Clustering file posted by instructor.

# Range of k values
k_values = range(1, 10)
inertias = []

# Apply k-Means for different values of k and record the inertia
for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    inertias.append(kmeans.inertia_)

# Plotting the results
fig, ax = plt.subplots(1, 1, figsize=(6, 4))

# Loss (inertia) plot
ax.plot(k_values, inertias, marker='o')
ax.set_title('Loss vs. Number of clusters')
ax.set_xlabel('Number of clusters (k)')
ax.set_ylabel('Loss')
ax.grid(True)

plt.tight_layout()
plt.show()
../_images/13d0fecd7928f9435f0ac9ef8f344ad2c1593925063866729a3c9b6896289280.png
cluster_numbers = [3, 4]

fig, axes = plt.subplots(1, len(cluster_numbers), figsize=(12, 5), sharex=True, sharey=True)

for idx, n_clusters in enumerate(cluster_numbers):
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    cluster_labels = kmeans.fit_predict(X_train_scaled)
    
    pca = PCA(n_components=2)
    X_train_pca = pca.fit_transform(X_train_scaled)
    
    ax = axes[idx]
    sns.scatterplot(
        x=X_train_pca[:, 0], 
        y=X_train_pca[:, 1], 
        hue=cluster_labels, 
        palette='Set1', 
        alpha=0.7, 
        ax=ax
    )
    ax.set_title(f'K-Means Clusters (k={n_clusters})')
    ax.set_xlabel('PCA Component 1')
    ax.set_ylabel('PCA Component 2')
    ax.legend(title='Cluster')

plt.tight_layout()
plt.show()
../_images/59fec39fde59f0c62c709d4b3c3b90190a3f9c618f57633fb9b27184da8c702f.png

It is interesting to see that the program believes that k=3 or k=4 is the best k to choose, instead of k=2 as intended. This might mean that clustering is not a good way to solve this problem, or it means that there are insufficient information from the original dataset to determine the clusters.

Next, I am going to try Random Forest.

7.Random Forest#

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state=42, n_estimators=100)

rf.fit(X_train_scaled, y_train_balanced)

y_pred_rf = rf.predict(X_test_scaled)
y_prob_rf = rf.predict_proba(X_test_scaled)[:, 1]

accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest Accuracy: {accuracy_rf:.2f}")

recall_rf = recall_score(y_test, y_pred_rf)
print(f"Random Forest Recall: {recall_rf:.2f}")

cm_rf = confusion_matrix(y_test, y_pred_rf)
print("Confusion Matrix:")
print(cm_rf)

plt.figure(figsize=(8, 6))
sns.heatmap(cm_rf, annot=True, fmt='d', cmap='Blues', xticklabels=['No', 'Yes'], yticklabels=['No', 'Yes'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix (Random Forest)')
plt.show()
Random Forest Accuracy: 0.84
Random Forest Recall: 0.37
Confusion Matrix:
[[6879  785]
 [ 631  376]]
../_images/1d4fe32b3a403340faa5471c59f1a72b453b4c8d4ec11813175026794baed816.png

I have learnt that with Random Forest, you can actually see the importance of the features.

feature_importances = rf.feature_importances_
features = ['age', 'balance', 'campaign', 'pdays', 'previous', 'days_since_contact', 
            'default', 'housing', 'loan', 'marital_married', 'marital_single', 
            'education_secondary', 'education_tertiary']
importance_df = pd.DataFrame({'Feature': features, 'Importance': feature_importances})
importance_df = importance_df.sort_values(by='Importance', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=importance_df)
plt.title('Feature Importance (Random Forest)')
plt.show()
../_images/a493545d117302f3928092d37b50cf71d5e67eac2130fcc55bdf70ed4596cfa9.png

In the graph, we can see that days_since_contact and balance are the two most important features, which is surprising.

8.Conclusion#

In this project, I have the goal of predicting subscription based on given features. I have used multiple models, including Logistic Regression Model, Logistic Regression Model with Lasso, KNN, K-means Clustering, and Random Forest. Since the model had a skewed distribution between majority and minority, accuracy cannot be an effective measure for the models’ performance. Instead, we should look at the recall score. The best recall score was given by the KNN model of k=5 with 40%. However, it is still inefficient in predicting the correct classes. I have done everything that I could think of based on my current knowledge. If I had the chance to learn more techniques, maybe ways to preprocess the data, or new models, I might achieve better results.