Predicting Subscription to Term Deposit

Predicting Subscription to Term Deposit#

Author: Pinge Chen

Course Project, UC Irvine, Math 10, F24

I would like to post my notebook on the course’s website. Yes

1.Introduction#

This data is from the Bank Marketing task from UCI Machine Learning Repository (https://archive.ics.uci.edu/dataset/222/bank+marketing). The 45,211 entries with 16 features of data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).

The classification goal is to predict whether a client will subscribe to a term deposit, represented by the target variable y, which is binary:

y = 1 (Yes): The client subscribed to a term deposit.

y = 0 (No): The client did not subscribe to a term deposit.

The dataset includes information about:

Client-specific attributes: Age, job, marital status, education level, and financial details.

Campaign-related attributes: Number of contacts, previous campaign outcomes, and time-related variables (month, day of the week).

External attributes: Economic context features such as employment variation rate and consumer confidence index.

Variable Name	Role	Type	Demographic	Description	Units	Missing Values
age	Feature	Integer	Age			no
job	Feature	Categorical	Occupation	type of job (categorical: ‘admin.’, ‘blue-collar’, ‘entrepreneur’, ‘housemaid’, ‘management’, ‘retired’, ‘self-employed’, ‘services’, ‘student’, ‘technician’, ‘unemployed’, ‘unknown’)		no
marital	Feature	Categorical	Marital Status	marital status (categorical: ‘divorced’, ‘married’, ‘single’, ‘unknown’; note: ‘divorced’ means divorced or widowed)		no
education	Feature	Categorical	Education Level	(categorical: ‘basic.4y’, ‘basic.6y’, ‘basic.9y’, ‘high.school’, ‘illiterate’, ‘professional.course’, ‘university.degree’, ‘unknown’)		no
default	Feature	Binary		has credit in default?		no
balance	Feature	Integer		average yearly balance	euros	no
housing	Feature	Binary		has housing loan?		no
loan	Feature	Binary		has personal loan?		no
contact	Feature	Categorical		contact communication type (categorical: ‘cellular’, ‘telephone’)		yes
day_of_week	Feature	Date		last contact day of the week		no
month	Feature	Date		last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)		no
duration	Feature	Integer		last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=’no’). However, the duration is not known before a call is performed. Thus, it should only be included for benchmark purposes and excluded in realistic predictive models.	seconds	no
campaign	Feature	Integer		number of contacts performed during this campaign and for this client (numeric, includes last contact)		no
pdays	Feature	Integer		number of days that passed by after the client was last contacted from a previous campaign (numeric; -1 means client was not previously contacted)		yes
previous	Feature	Integer		number of contacts performed before this campaign and for this client		no
poutcome	Feature	Categorical		outcome of the previous marketing campaign (categorical: ‘failure’, ‘nonexistent’, ‘success’)		yes
y	Target	Binary		has the client subscribed to a term deposit?		no

from ucimlrepo import fetch_ucirepo 
import pandas as pd  
# fetch dataset 
bank_marketing = fetch_ucirepo(id=222) 
  
# data (as pandas dataframes) 
X = bank_marketing.data.features 
y = bank_marketing.data.targets 
  
# metadata 
# print(bank_marketing.metadata) 
  
# variable information 
# print(bank_marketing.variables)

2.Data Cleaning#

I am going to approach this by:

Dropping irrelevant rows and columns
Encoding categorical features

X.head(5)

	age	job	marital	education	default	balance	housing	loan	contact	day_of_week	month	duration	campaign	pdays	poutcome
0	58	management	married	tertiary	no	2143	yes	no	NaN	5	may	261	1	-1	NaN
1	44	technician	single	secondary	no	29	yes	no	NaN	5	may	151	1	-1	NaN
2	33	entrepreneur	married	secondary	no	2	yes	yes	NaN	5	may	76	1	-1	NaN
3	47	blue-collar	married	NaN	no	1506	yes	no	NaN	5	may	92	1	-1	NaN
4	33	NaN	single	NaN	no	1	no	no	NaN	5	may	198	1	-1	NaN

Dropping irrelavant columns and value-missing rows.#

I am dropping some of the columns that are subjectively irrelevant to me:

job: This feature includes 11 distinct value, making it very hard to encode for the regression to work in the future. If use one-hot encoding, there will be a huge increase in dimensionality. If use ordinal encoding, there is no internal order for job types. Additionally, it is hard for me to subjectively assign values to each job type. Thus, I am dropping this feature.
poutcome: Even though this feature should be very useful intuitively, however, there are 36959/45211 missing values, which is too much. Keep this feature will require me to drop all the rows with missing value, costing me a lot of other information. Thus, I am dropping this feature.
duration: This feature was said to be dropped by the author. Important note: this attribute highly affects the output target (e.g., if duration=0 then y=’no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
contact: I am dropping this feature because it is irrelevant to me intuitively.

I am transforming these columns:

day_of_week & month: These features together can represent the last contact date of the current campaign. Putting them together and representing them as a difference between today’s date (assuming December 8th) and the date itself is more meaningful.

df = X.drop(columns=['job', 'poutcome', 'duration', 'contact'], axis=1)

df.head()

	age	marital	education	default	balance	housing	loan	day_of_week	month	campaign	pdays
0	58	married	tertiary	no	2143	yes	no	5	may	1	-1
1	44	single	secondary	no	29	yes	no	5	may	1	-1
2	33	married	secondary	no	2	yes	yes	5	may	1	-1
3	47	married	NaN	no	1506	yes	no	5	may	1	-1
4	33	single	NaN	no	1	no	no	5	may	1	-1

month_map = {
    'jan': 1, 'feb': 2, 'mar': 3, 'apr': 4, 'may': 5, 'jun': 6,
    'jul': 7, 'aug': 8, 'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12
}

df['month'] = df['month'].map(month_map)

df['date'] = pd.to_datetime(
    dict(year=2024, month=df['month'], day=df['day_of_week'])
)

df = df.drop(columns=['month', 'day_of_week'])

reference_date = pd.Timestamp('2024-12-08')

df['days_since_contact'] = (reference_date - df['date']).dt.days

df = df.drop(columns=['date'])

Next step is to drop the rows with missing-values, but before that, I am going to append y to df so that we don’t have to match the outcomes one by one.

df['y'] = y

I am dropping rows with missing-values.

df = df.dropna()

Encoding Categorical Features#

I am one-hot encoding the categorical features for easier usage in regression models: marital, education

I am maping housing, loan, default and also y to 1 and 0.

df = pd.get_dummies(df, columns=['marital', 'education'], drop_first=True)

df['housing'] = df['housing'].map({'yes': 1, 'no': 0})
df['loan'] = df['loan'].map({'yes': 1, 'no': 0})
df['default'] = df['default'].map({'yes': 1, 'no': 0})

df['y'] = df['y'].map({'yes': 1, 'no': 0})

3.Exploratory Data Analysis (EDA)#

As data cleaning is now complete, we will move on to EDA and gain some understanding of the data.

df.describe()

	age	default	balance	housing	loan	campaign	pdays	previous	days_since_contact	y
count	43354.000000	43354.000000	43354.000000	43354.000000	43354.000000	43354.000000	43354.000000	43354.000000	43354.000000	43354.000000
mean	40.783111	0.018061	1355.226715	0.560733	0.164022	2.760184	40.340960	0.584260	170.662061	0.116183
std	10.518987	0.133172	3039.916830	0.496304	0.370300	3.065496	100.331955	2.329661	74.843025	0.320448
min	18.000000	0.000000	-8019.000000	0.000000	0.000000	1.000000	-1.000000	0.000000	-22.000000	0.000000
25%	33.000000	0.000000	71.000000	0.000000	0.000000	1.000000	-1.000000	0.000000	124.000000	0.000000
50%	39.000000	0.000000	443.000000	1.000000	0.000000	2.000000	-1.000000	0.000000	187.000000	0.000000
75%	48.000000	0.000000	1415.000000	1.000000	0.000000	3.000000	-1.000000	0.000000	213.000000	0.000000
max	95.000000	1.000000	102127.000000	1.000000	1.000000	58.000000	871.000000	275.000000	337.000000	1.000000

df.head(5)

	age	balance	housing	loan	campaign	pdays	days_since_contact	marital_married	marital_single	education_secondary	education_tertiary
0	58	2143	1	0	1	-1	217	True	False	False	True
1	44	29	1	0	1	-1	217	False	True	True	False
2	33	2	1	1	1	-1	217	True	False	True	False
5	35	231	1	0	1	-1	217	True	False	False	True
6	28	447	1	1	1	-1	217	False	True	False	True

With some basic information above, I now want to explore the distribution of the outcome variable y.

import matplotlib.pyplot as plt

y_counts = df['y'].value_counts()

plt.bar(y_counts.index, y_counts.values, tick_label=['No (0)', 'Yes (1)'])
plt.xlabel('Outcome (y)')
plt.ylabel('Count')
plt.title('Distribution of Outcome Variable (y)')
plt.show()

../_images/35ac5406b9640e278dc4584f9116eb64132ceafe7e81136dd4b48ad6d4fa50a5.png

y_counts = df['y'].value_counts()
class_proportions = y_counts / len(df)

print(f"Class Counts:\n{y_counts}")
print(f"\nClass Proportions:\n{class_proportions}")

Class Counts:
y
0    38317
1     5037
Name: count, dtype: int64

Class Proportions:
y
0    0.883817
1    0.116183
Name: count, dtype: float64

We can see that there is a significant class imbalance for the outcome variable y.

The majority of clients did not subscribe to a term deposit (y=0), while only a small percentage subscribed (y=1).

This imbalance means my dataset is skewed toward the majority class, making it harder for a model to accurately predict the minority class (y=1).

This also means that the minority class (y=1) is underrepresented, and special care (resampling or reweighting) will be needed to ensure the model can handle this.

If a model always predicts the majority class (y=0), the baseline accuracy will equal the proportion of y=0, which in this case is 88%.

Next, I want to see if there is a linear relationship between age and balanc. As I believe, an individual with higher age should be correlated with a higher average yearly balance. I will apply Linear Regression to test this.

from sklearn.linear_model import LinearRegression

X = df[['age']]
y = df['balance']

model = LinearRegression()
model.fit(X, y)

print(f"Intercept: {model.intercept_:.2f}")
print(f"Coefficient: {model.coef_[0]:.2f}")

plt.figure(figsize=(8, 6))
plt.scatter(X, y, alpha=0.3, label='Data points')
plt.plot(X, model.predict(X), color='red', label='Regression line')
plt.xlabel("Age")
plt.ylabel("Balance")
plt.title("Relationship Between Age and Balance")
plt.legend()
plt.show()

Intercept: 210.42
Coefficient: 28.07

../_images/bde707426cbc429ba84277e8c73891b24b19cfca19bca5d2d57235e62ac7dbca.png

The data points are widely dispersed, and there doesn’t appear to be a strong pattern or trend between age and balance. The red regression line is almost flat, showing a near-zero slope. This further confirms the lack of a significant relationship between the two variables

Now, lets build a heat map to see how the features are related to each other.

import seaborn as sns
import matplotlib.pyplot as plt

correlation_matrix = df.corr()

plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm", cbar=True)

plt.title("Correlation Heatmap of Features", fontsize=16)
plt.show()

../_images/53aca0b750cecf76f23a488b55a049971a04c25ff64f1a0c6e1d08c920ece418.png

There is not much useful information given this heat map. Some takeaways are:

age and y (0.02): Indicates no significant linear relationship between age and the likelihood of subscribing to a term deposit. default and y (-0.02): Default status has almost no relationship with the target variable. These variables are likely going to be dropped if I apply Lasso Regularization.

Multicollinearity might exist. pdays and previous (0.45): These features have a moderate positive correlation. This could indicate redundancy, and one of these features might be removed during feature selection.

The weak correlation between most features and the target variable (y) suggests that no single feature strongly predicts the likelihood of subscribing to a term deposit. The model will likely need to rely on combinations of features.

Next, I am going to split the data into training and testing data after I use oversampling to solve the class imbalance.

from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from collections import Counter

X = df.drop(columns=['y'])
y = df['y']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)

print("Original training set class distribution:", Counter(y_train))
print("Balanced training set class distribution:", Counter(y_train_balanced))

Original training set class distribution: Counter({0: 30653, 1: 4030})
Balanced training set class distribution: Counter({0: 30653, 1: 30653})

4.Logistic Regression#

Since the problem is a binary classification problem, it is more reasonable to use Logistic Regression.

First, I will scale the numerical features.

from sklearn.preprocessing import StandardScaler
import numpy as np

numeric_features = ['age', 'balance', 'campaign', 'pdays', 'previous', 'days_since_contact']
categorical_features = ['default', 'housing', 'loan', 'marital_married', 'marital_single', 'education_secondary', 'education_tertiary']

scaler = StandardScaler()

X_train_scaled_numeric = scaler.fit_transform(X_train_balanced[numeric_features])
X_test_scaled_numeric = scaler.transform(X_test[numeric_features])

X_train_scaled = np.hstack([
    X_train_scaled_numeric, 
    X_train_balanced[categorical_features].values
])
X_test_scaled = np.hstack([
    X_test_scaled_numeric, 
    X_test[categorical_features].values
])

Next, I am training the Logistic Regression Model with the balanced and scaled X_train and the balanced y_train.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix
import pandas as pd

logistic_model = LogisticRegression(max_iter=1000, random_state=42)

logistic_model.fit(X_train_scaled, y_train_balanced)

coefficients = pd.DataFrame({
    'Feature': X_train_balanced.columns,
    'Coefficient': logistic_model.coef_[0]
}).sort_values(by='Coefficient', ascending=False)

Next, I am plotting the Confusion Matrix as an visualization of the result and calculating the accuracy score.

import seaborn as sns
import matplotlib.pyplot as plt

y_pred = logistic_model.predict(X_test_scaled)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['No', 'Yes'], yticklabels=['No', 'Yes'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

Accuracy: 0.77

../_images/f561ede745c96120d40f31068bbe02a89bf60ae176de060943214f1ef3af3add.png

True Negatives (TN = 6274): The model correctly predicted “No” (class 0) for 6274 clients who actually did not subscribe to a term deposit.

False Positives (FP = 1390): The model incorrectly predicted “Yes” (class 1) for 1390 clients who actually did not subscribe.

False Negatives (FN = 637): The model incorrectly predicted “No” (class 0) for 637 clients who actually subscribed.

True Positives (TP = 370): The model correctly predicted “Yes” (class 1) for 370 clients who actually subscribed.

I am aware that Lasso Regularization is used to prevent overfitting, but I’d still like to try if it improves my results. I am also implementing GridSearch for hyperparameter tuning.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

logistic_lasso = LogisticRegression(penalty='l1', solver='liblinear', max_iter=1000, random_state=42)

param_grid = {
    'C': [0.01, 0.1, 1, 10, 100]
}

grid = GridSearchCV(logistic_lasso, param_grid, scoring='accuracy', cv=5)
grid.fit(X_train_scaled, y_train_balanced)

print("Best Regularization Strength (C):", grid.best_params_)
best_lasso_model = grid.best_estimator_

best_lasso_model.fit(X_train_scaled, y_train_balanced)

y_pred_lasso = best_lasso_model.predict(X_test_scaled)

accuracy = accuracy_score(y_test, y_pred_lasso)
cm = confusion_matrix(y_test, y_pred_lasso)

print(f"Accuracy: {accuracy:.2f}")

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['No', 'Yes'], yticklabels=['No', 'Yes'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix (Lasso Regularization)')
plt.show()

Best Regularization Strength (C): {'C': 1}
Accuracy: 0.77

../_images/8547c23ab58e0b92d26be8afdbfde1793a968635d4219b1482857f8f1dbae057.png

Both Logistic Regression and Logistic Regression with Lasso achieve 77% accuracy. This is lower than the baseline accuracy but demonstrates that the model is attempting to balance predictions for both classes, rather than always predicting the majority class.

A model achieving 77% accuracy while predicting both classes is more meaningful than a naive model achieving 88% accuracy but ignoring one class entirely.

I need to focus on metrics beyond accuracy to measure the performance of my models. I am going to use Recall.

from sklearn.metrics import recall_score

y_pred_logistic = logistic_model.predict(X_test_scaled)

y_pred_lasso = best_lasso_model.predict(X_test_scaled)

recall_logistic = recall_score(y_test, y_pred_logistic)
print(f"Recall (Logistic Regression): {recall_logistic:.2f}")

recall_lasso = recall_score(y_test, y_pred_lasso)
print(f"Recall (Logistic Regression with Lasso): {recall_lasso:.2f}")

Recall (Logistic Regression): 0.37
Recall (Logistic Regression with Lasso): 0.37

Sadly, changing the metrics doesn’t change the fact that the model is poorly performing on predicting the Yes category. With a recall of 0.37, my model struggles to capture the majority of potential subscribers, which is problematic if the goal is to target clients likely to subscribe.

Next, I am going to use the K Nearest Neighbor Model to see if this nonparametric model can produce a better result.

5.KNN Model#

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=5)

knn.fit(X_train_scaled, y_train_balanced)

y_pred_knn = knn.predict(X_test_scaled)

accuracy_knn = accuracy_score(y_test, y_pred_knn)
recall_knn = recall_score(y_test, y_pred_knn)
cm_knn = confusion_matrix(y_test, y_pred_knn)

print(f"KNN Accuracy: {accuracy_knn:.2f}")
print(f"KNN Recall: {recall_knn:.2f}")

plt.figure(figsize=(8, 6))
sns.heatmap(cm_knn, annot=True, fmt='d', cmap='Blues', xticklabels=['No', 'Yes'], yticklabels=['No', 'Yes'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix (KNN)')
plt.show()

KNN Accuracy: 0.78
KNN Recall: 0.41

../_images/0952d544a56559c95d59adf8107088ca5f81eb2ab486129dd0a99d5e5923d1d6.png

Next, I will use GridSearch again to identify the best parameters.

param_grid = {
    'n_neighbors': [3, 5, 7, 9, 11],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan']
}

knn = KNeighborsClassifier()

grid_knn = GridSearchCV(knn, param_grid, scoring='recall', cv=5)
grid_knn.fit(X_train_scaled, y_train_balanced)

best_knn = grid_knn.best_estimator_

y_pred_best_knn = best_knn.predict(X_test_scaled)
accuracy_best_knn = accuracy_score(y_test, y_pred_best_knn)
recall_best_knn = recall_score(y_test, y_pred_best_knn)

print(f"Best KNN Accuracy: {accuracy_best_knn:.2f}")
print(f"Best KNN Recall: {recall_best_knn:.2f}")

Best KNN Accuracy: 0.78
Best KNN Recall: 0.40

Accuracy = 0.78:

The KNN model achieves an accuracy of 78%, which is slightly better than the Logistic Regression model’s 77%.

Recall = 0.40:

The KNN model achieves a recall of 40%, which is better than the Logistic Regression model’s recall of 37%. This means that the KNN model correctly identifies 40% of the actual positive cases (y=1), making it slightly better at capturing the minority class (clients who subscribed to the term deposit).

The improvement in recall suggests that KNN is slightly better at identifying the minority class (y=1), even though the overall improvement is modest.

Next, I want to try clustering and see if it gives any suprising results.

6.Clustering with K-Means and PCA for Dimensionality Reduction#

from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

kmeans = KMeans(n_clusters=2, random_state=42)
cluster_labels = kmeans.fit_predict(X_train_scaled)

pca = PCA(n_components=2)
X_train_pca = pca.fit_transform(X_train_scaled)

plt.figure(figsize=(8, 6))
sns.scatterplot(x=X_train_pca[:, 0], y=X_train_pca[:, 1], hue=cluster_labels, palette='Set1', alpha=0.7)
plt.title('K-Means Clusters (2D Projection)')
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.legend(title='Cluster')
plt.show()

../_images/9da9313ecd3da9c836ae239ee27fbc4e489a6424c97629e46a35d441ae7bf9c4.png

To be honest, I am suprised by the above graph as in I do not understand a thing about what the graph is telling me. So I asked ChatGPT to interpret the graph for me. And here are some understandings:

There is a significant overlap between the clusters in the lower-left region of the plot, indicating that the features may not provide enough separation for distinct groups.

The points in Cluster 1 (blue) extend further along PCA Component 1, suggesting this cluster captures samples with distinct characteristics on the principal feature captured by the PCA.

Significant overlap in the clusters aligns with my classification models (Logistic Regression, KNN) struggling to achieve higher recall for the minority class.

In the above code, I have given that there are 2 clusters. For fun, I want to test if not given that information, how many clusters will minimize loss.

The following code is directly copied from the Clustering file posted by instructor.

# Range of k values
k_values = range(1, 10)
inertias = []

# Apply k-Means for different values of k and record the inertia
for k in k_values:
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X)
    inertias.append(kmeans.inertia_)

# Plotting the results
fig, ax = plt.subplots(1, 1, figsize=(6, 4))

# Loss (inertia) plot
ax.plot(k_values, inertias, marker='o')
ax.set_title('Loss vs. Number of clusters')
ax.set_xlabel('Number of clusters (k)')
ax.set_ylabel('Loss')
ax.grid(True)

plt.tight_layout()
plt.show()

../_images/13d0fecd7928f9435f0ac9ef8f344ad2c1593925063866729a3c9b6896289280.png

cluster_numbers = [3, 4]

fig, axes = plt.subplots(1, len(cluster_numbers), figsize=(12, 5), sharex=True, sharey=True)

for idx, n_clusters in enumerate(cluster_numbers):
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    cluster_labels = kmeans.fit_predict(X_train_scaled)
    
    pca = PCA(n_components=2)
    X_train_pca = pca.fit_transform(X_train_scaled)
    
    ax = axes[idx]
    sns.scatterplot(
        x=X_train_pca[:, 0], 
        y=X_train_pca[:, 1], 
        hue=cluster_labels, 
        palette='Set1', 
        alpha=0.7, 
        ax=ax
    )
    ax.set_title(f'K-Means Clusters (k={n_clusters})')
    ax.set_xlabel('PCA Component 1')
    ax.set_ylabel('PCA Component 2')
    ax.legend(title='Cluster')

plt.tight_layout()
plt.show()

../_images/59fec39fde59f0c62c709d4b3c3b90190a3f9c618f57633fb9b27184da8c702f.png

It is interesting to see that the program believes that k=3 or k=4 is the best k to choose, instead of k=2 as intended. This might mean that clustering is not a good way to solve this problem, or it means that there are insufficient information from the original dataset to determine the clusters.

Next, I am going to try Random Forest.

7.Random Forest#

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(random_state=42, n_estimators=100)

rf.fit(X_train_scaled, y_train_balanced)

y_pred_rf = rf.predict(X_test_scaled)
y_prob_rf = rf.predict_proba(X_test_scaled)[:, 1]

accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest Accuracy: {accuracy_rf:.2f}")

recall_rf = recall_score(y_test, y_pred_rf)
print(f"Random Forest Recall: {recall_rf:.2f}")

cm_rf = confusion_matrix(y_test, y_pred_rf)
print("Confusion Matrix:")
print(cm_rf)

plt.figure(figsize=(8, 6))
sns.heatmap(cm_rf, annot=True, fmt='d', cmap='Blues', xticklabels=['No', 'Yes'], yticklabels=['No', 'Yes'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix (Random Forest)')
plt.show()

Random Forest Accuracy: 0.84
Random Forest Recall: 0.37
Confusion Matrix:
[[6879  785]
 [ 631  376]]

../_images/1d4fe32b3a403340faa5471c59f1a72b453b4c8d4ec11813175026794baed816.png

I have learnt that with Random Forest, you can actually see the importance of the features.

feature_importances = rf.feature_importances_
features = ['age', 'balance', 'campaign', 'pdays', 'previous', 'days_since_contact', 
            'default', 'housing', 'loan', 'marital_married', 'marital_single', 
            'education_secondary', 'education_tertiary']
importance_df = pd.DataFrame({'Feature': features, 'Importance': feature_importances})
importance_df = importance_df.sort_values(by='Importance', ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x='Importance', y='Feature', data=importance_df)
plt.title('Feature Importance (Random Forest)')
plt.show()

../_images/a493545d117302f3928092d37b50cf71d5e67eac2130fcc55bdf70ed4596cfa9.png

In the graph, we can see that days_since_contact and balance are the two most important features, which is surprising.

8.Conclusion#

In this project, I have the goal of predicting subscription based on given features. I have used multiple models, including Logistic Regression Model, Logistic Regression Model with Lasso, KNN, K-means Clustering, and Random Forest. Since the model had a skewed distribution between majority and minority, accuracy cannot be an effective measure for the models’ performance. Instead, we should look at the recall score. The best recall score was given by the KNN model of k=5 with 40%. However, it is still inefficient in predicting the correct classes. I have done everything that I could think of based on my current knowledge. If I had the chance to learn more techniques, maybe ways to preprocess the data, or new models, I might achieve better results.