Fraud Detection#

Author: Madeline Chu

Course Project, UC Irvine, Math 10, Spring 25

I would like to post my notebook on the course’s website. Yes

https://www.kaggle.com/datasets/aryan208/financial-transactions-dataset-for-fraud-detection/data

Imports#

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import graphviz
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import StratifiedShuffleSplit, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from collections import Counter
from IPython.display import display
import graphviz
from sklearn.tree import DecisionTreeClassifier, export_graphviz

Stratified Sample#

The original data contained 5 million samples of financial transactions. I will be using a sample size of 5000.

I am stratifying the data by whether the activity is fraud or not, this will ensure that I have enough cases of financial fraud so I can analyze the relationship between the fraud and features.

# This is all code that I used to stratify the sample.

# Load the data set
df = pd.read_csv('/Users/madelinechu/Downloads/financial_fraud_detection_dataset.csv')

#change the NaN values from fraudtype to 'Not Fraud' so they don't get dropped
df['fraud_type'] = df['fraud_type'].fillna('Not Fraud')

#cleaning and reindexing so I can create a stratified sample
df_clean = df.dropna()
df_clean = df_clean.reset_index(drop=True)

#create an index for the column
target_col = 'is_fraud'

# Stratified sampling, these lines are from Chatgpt.
splitter = StratifiedShuffleSplit(n_splits=1, test_size=5000, random_state=8)

for _, sample_index in splitter.split(df_clean, df_clean[target_col]):
    stratified_sample = df_clean.loc[sample_index]
# Use 3000 of the collected data
stratified_sample.to_csv('/Users/madelinechu/Downloads/stratified_fraud_sample_5000.csv', index=False)

# Show sample
stratified_sample.head()
transaction_id timestamp sender_account receiver_account amount transaction_type merchant_category location device_used is_fraud fraud_type time_since_last_transaction spending_deviation_score velocity_score geo_anomaly_score payment_channel ip_address device_hash
2185037 T3154711 2023-03-30T07:15:17.369205 ACC247896 ACC799578 224.51 withdrawal utilities Dubai mobile False Not Fraud -3339.180043 0.79 8 0.49 card 188.113.95.154 D2057361
2482596 T3461026 2023-10-02T00:28:00.955486 ACC460461 ACC454022 16.83 payment grocery Berlin mobile False Not Fraud 2234.889287 0.80 18 0.06 card 203.60.84.46 D4574144
181111 T737613 2023-10-10T18:10:04.452613 ACC169410 ACC432109 89.04 payment other Toronto web False Not Fraud -794.250890 -0.36 8 0.32 card 152.29.53.109 D6762473
3571094 T4564845 2023-09-22T20:53:07.315571 ACC492144 ACC410595 4.12 transfer grocery Dubai web True card_not_present 3705.568076 0.28 11 0.18 ACH 230.161.113.129 D8221144
972296 T1842120 2023-04-30T08:12:58.062014 ACC339542 ACC130395 393.32 withdrawal travel Sydney mobile False Not Fraud -4001.069331 1.89 15 0.03 ACH 13.91.200.205 D9433689

Features#

Transaction ID - ID number for each transaction.

Timestamp - Time which the transaction took place. In ISO format.

Sender Account - Account number of the sender.

Receiver Account - Account number of the receiver.

Amount - Amount involved in the transaction.

Transaction Type - Type of transaction: deposit, withdrawal, transfer, or payment.

Merchant Category - Type of business the payment was involved in. (retail, traveling, etc.)

Location - City the transaction happened.

Device Used - Type of device. (mobile, POS, ATM, etc.)

Is it Fraud? - Was it fraud, in booleans.

Fraud Type - Type of fraud. (Hacker, money laundering, account takeover, etc.)

Time Since Last Transaction - Time in hours between transactions.

Spending Deviation Score - Deviation from a gaussian distribution.

Velocity Score - Not the time but how many transactions made in a short period of time.

Geo Anomaly Score - Measure of how unusual the transaction was.

Payment Channel - Channel used: card, wire transfer, etc.

IP Address - IP addresses of the transaction.

Device Hash - A unique way of identification on a device.

Data Cleaning#

I will start by taking all my categorical data and turning it into numerical data. After doing some research, I was choosing between one-hot encode or target encoding however I ultimately wanted to the number of columns to a minimum so I will be applying the target encode method to my dataset.

#the indices of the columns that are categorical
encode_cols = ['transaction_type', 'merchant_category', 'location', 'device_used', 'payment_channel']

#Changing the date time format and only taking out the hour of the day the transaction was made. 
#The decision was that I felt like hour of the day would most likely hold the most information with regards to being fraud/not-fraud.
stratified_sample['timestamp'] = pd.to_datetime(df['timestamp'], format='ISO8601', utc=True) #code fixed by chatgpt
stratified_sample['hour'] = stratified_sample['timestamp'].dt.hour

#splitting the data, (df is used but it is on the stratified sample)
train_df, test_df = train_test_split(stratified_sample, test_size=0.2, stratify=stratified_sample['is_fraud'], random_state=8)

stratified_sample.head()
transaction_id timestamp sender_account receiver_account amount transaction_type merchant_category location device_used is_fraud fraud_type time_since_last_transaction spending_deviation_score velocity_score geo_anomaly_score payment_channel ip_address device_hash hour
2185037 T3154711 2023-12-11 23:53:08.903327+00:00 ACC247896 ACC799578 224.51 withdrawal utilities Dubai mobile False Not Fraud -3339.180043 0.79 8 0.49 card 188.113.95.154 D2057361 23
2482596 T3461026 2023-05-06 00:23:18.546703+00:00 ACC460461 ACC454022 16.83 payment grocery Berlin mobile False Not Fraud 2234.889287 0.80 18 0.06 card 203.60.84.46 D4574144 0
181111 T737613 2023-03-29 01:48:24.296081+00:00 ACC169410 ACC432109 89.04 payment other Toronto web False Not Fraud -794.250890 -0.36 8 0.32 card 152.29.53.109 D6762473 1
3571094 T4564845 2023-11-28 09:11:16.099287+00:00 ACC492144 ACC410595 4.12 transfer grocery Dubai web True card_not_present 3705.568076 0.28 11 0.18 ACH 230.161.113.129 D8221144 9
972296 T1842120 2023-04-05 05:37:45.332087+00:00 ACC339542 ACC130395 393.32 withdrawal travel Sydney mobile False Not Fraud -4001.069331 1.89 15 0.03 ACH 13.91.200.205 D9433689 5
#target column name and list of column indices
target_col = 'is_fraud'
feature_cols = ['amount',
                'velocity_score',
                'geo_anomaly_score',
                'transaction_type_TE',
                'merchant_category_TE',
                'location_TE',
                'device_used_TE',
                'payment_channel_TE',
               'hour']

#target mean for all columns
for col in encode_cols:
    target_means = train_df.groupby(col)[target_col].mean() # command from chatgpt
    train_df[f'{col}' + '_TE'] = train_df[col].map(target_means) # command from chatgpt
    test_df[f'{col}' + '_TE'] = test_df[col].map(target_means) # command from chatgpt


corr = train_df[feature_cols + [target_col]].corr()
train_df.head()

#correlation matrix, code taken from Professor Ray's website
corr = train_df[feature_cols + ['is_fraud']].corr()
sns.heatmap(corr, annot=True, cmap="coolwarm", fmt = ".2f")
print(corr)
                        amount  velocity_score  geo_anomaly_score  \
amount                1.000000       -0.024284           0.002632   
velocity_score       -0.024284        1.000000          -0.006721   
geo_anomaly_score     0.002632       -0.006721           1.000000   
transaction_type_TE   0.512677       -0.033048           0.008495   
merchant_category_TE  0.005506       -0.025558           0.011403   
location_TE          -0.007487       -0.003133           0.009528   
device_used_TE        0.043387        0.003006           0.001229   
payment_channel_TE    0.028426        0.005446           0.011014   
hour                  0.027196        0.004868          -0.019790   
is_fraud              0.009866       -0.003198           0.003820   

                      transaction_type_TE  merchant_category_TE  location_TE  \
amount                           0.512677              0.005506    -0.007487   
velocity_score                  -0.033048             -0.025558    -0.003133   
geo_anomaly_score                0.008495              0.011403     0.009528   
transaction_type_TE              1.000000              0.004880    -0.016420   
merchant_category_TE             0.004880              1.000000    -0.009343   
location_TE                     -0.016420             -0.009343     1.000000   
device_used_TE                   0.009352              0.027775     0.016785   
payment_channel_TE               0.022888              0.003020    -0.013950   
hour                             0.006624             -0.005921    -0.011389   
is_fraud                         0.013632              0.051202     0.042580   

                      device_used_TE  payment_channel_TE      hour  is_fraud  
amount                      0.043387            0.028426  0.027196  0.009866  
velocity_score              0.003006            0.005446  0.004868 -0.003198  
geo_anomaly_score           0.001229            0.011014 -0.019790  0.003820  
transaction_type_TE         0.009352            0.022888  0.006624  0.013632  
merchant_category_TE        0.027775            0.003020 -0.005921  0.051202  
location_TE                 0.016785           -0.013950 -0.011389  0.042580  
device_used_TE              1.000000            0.018502 -0.011701  0.018497  
payment_channel_TE          0.018502            1.000000 -0.006758  0.050734  
hour                       -0.011701           -0.006758  1.000000  0.002668  
is_fraud                    0.018497            0.050734  0.002668  1.000000  
../_images/502908139b9324d1f8e443a187a82ffa45c4b7d942d7d493e6c434e0fa295fb4.png

We can see that whether a case is constituted as fraudulent does not have a strong correlation with any of the features. However, this makes sense as a lot of the features take note of the instance of transaction and what device, amount or other characteristics these transactions have. If you think about the real world we would not expect that one characteristic of a transaction would affect very highly whether it is fraud or not.

Addressing the minority class#

In a data set like this, the number of fraudulent cases is expected to be a lot lower just because most financial transactions are not fraudulent. From a suggestion by chatgpt, I began to research SMOTE. SMOTE is advantageous because it creates more instances of the minority cases of the target feature. It is an interesting method as it takes from the concept of k-nearest neighbors in interpolating between the existing cases of fraud to create new cases of fraud.

code was learned from: https://www.geeksforgeeks.org/smote-for-imbalanced-classification-with-python/

#all feature columns
feature_cols = ['amount',
                'velocity_score',
                'geo_anomaly_score',
                'transaction_type_TE',
                'merchant_category_TE',
                'location_TE',
                'device_used_TE',
                'payment_channel_TE',
               'hour']
#applying smote 
smote = SMOTE(random_state=8)

# separated the features from the target.
X_train = train_df[feature_cols]
y_train = train_df['is_fraud']

#resampling with SMOTE
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

#Separating my testing sets
X_test = test_df[feature_cols] 
y_test = test_df['is_fraud']
# Plotting before and after smote results

#converting the training set to a panda series. 
y_train_df = y_train_resampled.to_frame()


fig, axes = plt.subplots(1, 2, figsize=(12, 5))
sns.countplot(x = 'is_fraud', data = train_df, ax=axes[0])
axes[0].set_title('Before SMOTE')
axes[0].set_xlabel('Is Fraud')
axes[0].set_ylabel('Count')

sns.countplot(x = 'is_fraud', data = y_train_df, ax=axes[1])
axes[1].set_title('After SMOTE')
axes[1].set_xlabel('Is Fraud')
axes[1].set_ylabel('Count')
plt.tight_layout()
../_images/971d5460f8baa7b8bbf5c4af3bf32aec3afe14f3db460bf7a7784e588de33c6d.png

Logistic Regression#

Now I perform Logistic Regression by fitting on the training set and then predicting the train set. The logistic regression is fit with the SMOTE resampled set. Keep in mind that my my training and testing set only have features that I deem as important. For example, I did not include the columns that contained IP address, identification of case, etc.

model = LogisticRegression(max_iter=1000) # max_iter from chat gpt
# fitting the logistic regression
model.fit(X_train_resampled, y_train_resampled)

# predicting the amount of fraud from the testing set
Y_pred = model.predict(X_test)
Y_pred_series = pd.Series(Y_pred)

#resetting the index for the testing and predicted values.
Y_actual_new_idx = test_df['is_fraud'].reset_index(drop=True)
Y_pred_series = pd.Series(Y_pred).reset_index(drop=True)

#dataframe for compare test
df_compare_test = pd.DataFrame({ 'Actual' : Y_actual_new_idx, 'Predicted' : Y_pred_series})

#plotting a confusion matrix
confusion_mtrx = confusion_matrix(df_compare_test['Actual'], df_compare_test['Predicted'])
sns.heatmap(confusion_mtrx, annot=True, fmt='d', cmap='Blues', xticklabels=['Not Fraud', 'Fraud'], yticklabels=['Not Fraud', 'Fraud'])
plt.xlabel('Predicted Label')
plt.ylabel('Actual Label')
Text(50.722222222222214, 0.5, 'Actual Label')
../_images/db83ccf9822a6c1460c4d9eba01d4432954fba0d2c9e0b6920a6492ce98a2d47.png

It seems like a basic logistic regression did not accurately classify non-fraudulent cases. This make sense because logistic regression assumes a linear decision boundary, or in this case, a multidimensional plane. This then further assumes that theres a big distinction between fraudulent and non-fraudulent cases. However, this is not likely because it is very possible that the fraudulent case will be very similar to a non-fraudulent case.

Now we can try K nearest neighbors.#

We want to start by finding the optimal K.

#testing from 1 nn to 30.
k_range = range(1, 30)

#comparing the score for each value of k neighbors.
k_scores = []

for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train_resampled, y_train_resampled, cv=5, scoring='accuracy')
    k_scores.append(scores.mean())
#Plotting the accuracy of each k vs. the number I chose for k
plt.figure(figsize=(10, 6))
plt.plot(k_range, k_scores)
plt.title('Accuracy vs. Number of Neighbors (K)')
plt.xlabel('Number of Neighbors (K)')
plt.ylabel('Cross-Validated Accuracy')

best_k = k_range[np.argmax(k_scores)]
best_score = max(k_scores)

plt.scatter(best_k, best_score, color='red', s=150, label=f'Best k = {best_k}')
plt.legend()

plt.show()

print(f'Best k value: {best_k}')
print(f'Best Cross-Val-Score: {best_score:.4f}')

model_knn = KNeighborsClassifier(n_neighbors=best_k)
model_knn.fit(X_train_resampled, y_train_resampled)
y_pred_knn = model_knn.predict(X_test)
print(f'Accuracy: {accuracy_score(y_test, y_pred_knn)}')
print('Classification Report:')
print(classification_report(y_test, y_pred_knn, zero_division=0)) # code from chatgpt
../_images/35d94fc8646310fc973399f258c3bb30ee7d3a1afd9243bbc5fef839ce34f9a4.png
Best k value: 1
Best Cross-Val-Score: 0.8356
Accuracy: 0.744
Classification Report:
              precision    recall  f1-score   support

       False       0.96      0.76      0.85       956
        True       0.05      0.30      0.09        44

    accuracy                           0.74      1000
   macro avg       0.51      0.53      0.47      1000
weighted avg       0.92      0.74      0.82      1000
cm = confusion_matrix(y_test, y_pred_knn)
sns.heatmap(cm, annot=True, fmt='d', cmap='Reds',xticklabels=['Not Fraud', 'Fraud'], yticklabels=['Not Fraud', 'Fraud'])
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('Actual Label')
plt.show()
../_images/91b337410c2bb055d2be735bc08506999d90763bc010dfc4ce26ab5875f2859c.png

As we can see, the KNN method seems better at classifying cases than the logistic regression. For example, previously it predicted that 455 of the non-fraud cases were fraudulent however with the choice of 1 nearest neighbor, it classified only 225 non-fraudulent cases incorrectly. So we are seeing that in the case of fraud detection K-nearest neighbors seem more effective, this makes sense because rather than seeing a clear relationship between fraud cases, we are looking at data points that are close to each other.

Visualization of each feature and the target value.#

Now, I am interested in finding if there is a specifice feature that is prominent in all of these fraud cases. We will do this by looking at the Random forest importance graph.

Code adopted from: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Code fixed by chatgpt.

# Train model
rf_model = RandomForestClassifier(random_state=8)
rf_model.fit(X_train_resampled, y_train_resampled)

# Plot feature importances
importances = rf_model.feature_importances_
feature_names = X_train_resampled.columns

# Create bar plot
plt.figure(figsize=(10, 6))
pd.Series(importances, index=feature_names).sort_values().plot(kind='barh')
plt.title('Feature Importance (Random Forest)')
plt.xlabel('Importance Score')
plt.show()
../_images/4d4efee534c98ee46cf0d65a73f4f67dfd30404c1a8cc43c0669e0d537b98175.png

Note that payment_channel_TE has the highest importance value out of all our features. But the importance score is around 0.29, however that is not large compared to an importance value of 1.

Interpreting the decision tree.#

On the decision tree, the value is given in the form [non-fraud case, fraud case].
Code from: https://scikit-learn.org/stable/modules/tree.html
Learned from: https://www.geeksforgeeks.org/decision-tree-implementation-python/

import graphviz
from sklearn.tree import DecisionTreeClassifier, export_graphviz


# Create a decision tree classifier
clf = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)


# Fit the classifier on the dataset
clf.fit(X_train_resampled, y_train_resampled)

# Extract decision tree information
dot_data = export_graphviz(clf, out_file=None, feature_names=feature_cols)

# Create a graph object and render
graph = graphviz.Source(dot_data)
display(graph)
../_images/9484cb8da34d3762d60297cde98990c9f3f4cdc34e346576fa707e5c5e5dd920.svg

If you look at the final classification, and look at the second branch node to the right, it correctly determined that if the data’s payment_channel_TE value is \(0.039 \leq x \leq 0.062\). Then these cases will be fraud cases. So we can take a look at the relation of the payment channel with whether the case is fraud or not.

#We are looking at the rows that have this feature, but more specifically, the transaction type that corresponds to the value.
train_df.loc[(train_df['payment_channel_TE']>=0.039) & (train_df['payment_channel_TE']<=0.062)].head(3)
transaction_id timestamp sender_account receiver_account amount transaction_type merchant_category location device_used is_fraud ... geo_anomaly_score payment_channel ip_address device_hash hour transaction_type_TE merchant_category_TE location_TE device_used_TE payment_channel_TE
2519793 T3499115 2023-07-06 02:00:02.246527+00:00 ACC536513 ACC597424 46.06 transfer utilities London web False ... 0.62 UPI 31.16.229.76 D4775399 2 0.044834 0.033272 0.045455 0.040415 0.061866
3405053 T4397515 2023-09-06 08:33:07.867799+00:00 ACC598679 ACC663765 16.46 transfer entertainment New York pos False ... 0.69 UPI 100.127.10.209 D8857882 8 0.044834 0.047170 0.053030 0.045498 0.061866
740399 T1563167 2023-04-16 05:22:52.174555+00:00 ACC804679 ACC412353 1274.26 deposit online Toronto pos False ... 0.16 UPI 138.129.139.178 D5495797 5 0.046701 0.063877 0.035225 0.045498 0.061866

3 rows × 24 columns

#We want to count how many fraud cases there are per payment channel
fraud_count = stratified_sample[stratified_sample['is_fraud']==1].groupby('payment_channel').size()
nonfraud_count = stratified_sample[stratified_sample['is_fraud']==0].groupby('payment_channel').size()

#Then we find how many fraud cases there are vs. non-fraud cases
total_counts_fraud = stratified_sample.groupby('is_fraud').size()

#divide the fraud case per payment channel by total fraud case
fraud_rate = fraud_count/total_counts_fraud.iloc[1]

#divide the non-fraud case per payment channel by total non-fraud case
nonfraud_rate = nonfraud_count/total_counts_fraud.iloc[0]

#turn these values into a dataframe so they are easily plotted
rate_df = pd.DataFrame({'Fraud_Rate':fraud_rate,
                        'Non-Fraud_Rate': nonfraud_rate})

#plot the proportions nect to each other.
rate_df.plot(kind='bar', color=['crimson', 'steelblue'], figsize=(10,6))
plt.title('Fraud Rate by Payment Channel')
plt.xlabel('Payment Channel')
plt.ylabel('Proportion of fraud/non-fraud to total fraud/non-fraud cases')
plt.xticks(rotation=0)
plt.ylim(0, fraud_rate.max()+.05) 
plt.tight_layout()
plt.show()
../_images/3414e86e2cee8a6af3a275e95398740b72d764a93a0ec1f73a7047d78fdf60bd.png

Conclusion#

In conclusion, although we have narrowed down where a lot of fraudulent cases come from, the relationship between fraud and the features I have included is inconclusive. Because we consider that while for the payment channel ‘UPI’, while a lot of the fraud cases corresponded to this payment channel, there is still many cases of non-fraud that makes mayment through UPI. So ultimately the data is inconclusive.