Fraud Detection

Fraud Detection#

Author: Madeline Chu

Course Project, UC Irvine, Math 10, Spring 25

I would like to post my notebook on the course’s website. Yes

https://www.kaggle.com/datasets/aryan208/financial-transactions-dataset-for-fraud-detection/data

Imports#

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import graphviz
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import StratifiedShuffleSplit, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from collections import Counter
from IPython.display import display
import graphviz
from sklearn.tree import DecisionTreeClassifier, export_graphviz

Stratified Sample#

The original data contained 5 million samples of financial transactions. I will be using a sample size of 5000.

I am stratifying the data by whether the activity is fraud or not, this will ensure that I have enough cases of financial fraud so I can analyze the relationship between the fraud and features.

# This is all code that I used to stratify the sample.

# Load the data set
df = pd.read_csv('/Users/madelinechu/Downloads/financial_fraud_detection_dataset.csv')

#change the NaN values from fraudtype to 'Not Fraud' so they don't get dropped
df['fraud_type'] = df['fraud_type'].fillna('Not Fraud')

#cleaning and reindexing so I can create a stratified sample
df_clean = df.dropna()
df_clean = df_clean.reset_index(drop=True)

#create an index for the column
target_col = 'is_fraud'

# Stratified sampling, these lines are from Chatgpt.
splitter = StratifiedShuffleSplit(n_splits=1, test_size=5000, random_state=8)

for _, sample_index in splitter.split(df_clean, df_clean[target_col]):
    stratified_sample = df_clean.loc[sample_index]

# Use 3000 of the collected data
stratified_sample.to_csv('/Users/madelinechu/Downloads/stratified_fraud_sample_5000.csv', index=False)

# Show sample
stratified_sample.head()

	transaction_id	timestamp	sender_account	receiver_account	amount	transaction_type	merchant_category	location	device_used	is_fraud	fraud_type	time_since_last_transaction	spending_deviation_score	velocity_score	geo_anomaly_score	payment_channel	ip_address	device_hash
2185037	T3154711	2023-03-30T07:15:17.369205	ACC247896	ACC799578	224.51	withdrawal	utilities	Dubai	mobile	False	Not Fraud	-3339.180043	0.79	8	0.49	card	188.113.95.154	D2057361
2482596	T3461026	2023-10-02T00:28:00.955486	ACC460461	ACC454022	16.83	payment	grocery	Berlin	mobile	False	Not Fraud	2234.889287	0.80	18	0.06	card	203.60.84.46	D4574144
181111	T737613	2023-10-10T18:10:04.452613	ACC169410	ACC432109	89.04	payment	other	Toronto	web	False	Not Fraud	-794.250890	-0.36	8	0.32	card	152.29.53.109	D6762473
3571094	T4564845	2023-09-22T20:53:07.315571	ACC492144	ACC410595	4.12	transfer	grocery	Dubai	web	True	card_not_present	3705.568076	0.28	11	0.18	ACH	230.161.113.129	D8221144
972296	T1842120	2023-04-30T08:12:58.062014	ACC339542	ACC130395	393.32	withdrawal	travel	Sydney	mobile	False	Not Fraud	-4001.069331	1.89	15	0.03	ACH	13.91.200.205	D9433689

Features#

Transaction ID - ID number for each transaction.

Timestamp - Time which the transaction took place. In ISO format.

Sender Account - Account number of the sender.

Receiver Account - Account number of the receiver.

Amount - Amount involved in the transaction.

Transaction Type - Type of transaction: deposit, withdrawal, transfer, or payment.

Merchant Category - Type of business the payment was involved in. (retail, traveling, etc.)

Location - City the transaction happened.

Device Used - Type of device. (mobile, POS, ATM, etc.)

Is it Fraud? - Was it fraud, in booleans.

Fraud Type - Type of fraud. (Hacker, money laundering, account takeover, etc.)

Time Since Last Transaction - Time in hours between transactions.

Spending Deviation Score - Deviation from a gaussian distribution.

Velocity Score - Not the time but how many transactions made in a short period of time.

Geo Anomaly Score - Measure of how unusual the transaction was.

Payment Channel - Channel used: card, wire transfer, etc.

IP Address - IP addresses of the transaction.

Device Hash - A unique way of identification on a device.

Data Cleaning#

I will start by taking all my categorical data and turning it into numerical data. After doing some research, I was choosing between one-hot encode or target encoding however I ultimately wanted to the number of columns to a minimum so I will be applying the target encode method to my dataset.

#the indices of the columns that are categorical
encode_cols = ['transaction_type', 'merchant_category', 'location', 'device_used', 'payment_channel']

#Changing the date time format and only taking out the hour of the day the transaction was made. 
#The decision was that I felt like hour of the day would most likely hold the most information with regards to being fraud/not-fraud.
stratified_sample['timestamp'] = pd.to_datetime(df['timestamp'], format='ISO8601', utc=True) #code fixed by chatgpt
stratified_sample['hour'] = stratified_sample['timestamp'].dt.hour

#splitting the data, (df is used but it is on the stratified sample)
train_df, test_df = train_test_split(stratified_sample, test_size=0.2, stratify=stratified_sample['is_fraud'], random_state=8)

stratified_sample.head()

	transaction_id	timestamp	sender_account	receiver_account	amount	transaction_type	merchant_category	location	device_used	is_fraud	fraud_type	time_since_last_transaction	spending_deviation_score	velocity_score	geo_anomaly_score	payment_channel	ip_address	device_hash	hour
2185037	T3154711	2023-12-11 23:53:08.903327+00:00	ACC247896	ACC799578	224.51	withdrawal	utilities	Dubai	mobile	False	Not Fraud	-3339.180043	0.79	8	0.49	card	188.113.95.154	D2057361	23
2482596	T3461026	2023-05-06 00:23:18.546703+00:00	ACC460461	ACC454022	16.83	payment	grocery	Berlin	mobile	False	Not Fraud	2234.889287	0.80	18	0.06	card	203.60.84.46	D4574144	0
181111	T737613	2023-03-29 01:48:24.296081+00:00	ACC169410	ACC432109	89.04	payment	other	Toronto	web	False	Not Fraud	-794.250890	-0.36	8	0.32	card	152.29.53.109	D6762473	1
3571094	T4564845	2023-11-28 09:11:16.099287+00:00	ACC492144	ACC410595	4.12	transfer	grocery	Dubai	web	True	card_not_present	3705.568076	0.28	11	0.18	ACH	230.161.113.129	D8221144	9
972296	T1842120	2023-04-05 05:37:45.332087+00:00	ACC339542	ACC130395	393.32	withdrawal	travel	Sydney	mobile	False	Not Fraud	-4001.069331	1.89	15	0.03	ACH	13.91.200.205	D9433689	5

#target column name and list of column indices
target_col = 'is_fraud'
feature_cols = ['amount',
                'velocity_score',
                'geo_anomaly_score',
                'transaction_type_TE',
                'merchant_category_TE',
                'location_TE',
                'device_used_TE',
                'payment_channel_TE',
               'hour']

#target mean for all columns
for col in encode_cols:
    target_means = train_df.groupby(col)[target_col].mean() # command from chatgpt
    train_df[f'{col}' + '_TE'] = train_df[col].map(target_means) # command from chatgpt
    test_df[f'{col}' + '_TE'] = test_df[col].map(target_means) # command from chatgpt


corr = train_df[feature_cols + [target_col]].corr()
train_df.head()

#correlation matrix, code taken from Professor Ray's website
corr = train_df[feature_cols + ['is_fraud']].corr()
sns.heatmap(corr, annot=True, cmap="coolwarm", fmt = ".2f")
print(corr)

                        amount  velocity_score  geo_anomaly_score  \
amount                1.000000       -0.024284           0.002632   
velocity_score       -0.024284        1.000000          -0.006721   
geo_anomaly_score     0.002632       -0.006721           1.000000   
transaction_type_TE   0.512677       -0.033048           0.008495   
merchant_category_TE  0.005506       -0.025558           0.011403   
location_TE          -0.007487       -0.003133           0.009528   
device_used_TE        0.043387        0.003006           0.001229   
payment_channel_TE    0.028426        0.005446           0.011014   
hour                  0.027196        0.004868          -0.019790   
is_fraud              0.009866       -0.003198           0.003820   

                      transaction_type_TE  merchant_category_TE  location_TE  \
amount                           0.512677              0.005506    -0.007487   
velocity_score                  -0.033048             -0.025558    -0.003133   
geo_anomaly_score                0.008495              0.011403     0.009528   
transaction_type_TE              1.000000              0.004880    -0.016420   
merchant_category_TE             0.004880              1.000000    -0.009343   
location_TE                     -0.016420             -0.009343     1.000000   
device_used_TE                   0.009352              0.027775     0.016785   
payment_channel_TE               0.022888              0.003020    -0.013950   
hour                             0.006624             -0.005921    -0.011389   
is_fraud                         0.013632              0.051202     0.042580   

                      device_used_TE  payment_channel_TE      hour  is_fraud  
amount                      0.043387            0.028426  0.027196  0.009866  
velocity_score              0.003006            0.005446  0.004868 -0.003198  
geo_anomaly_score           0.001229            0.011014 -0.019790  0.003820  
transaction_type_TE         0.009352            0.022888  0.006624  0.013632  
merchant_category_TE        0.027775            0.003020 -0.005921  0.051202  
location_TE                 0.016785           -0.013950 -0.011389  0.042580  
device_used_TE              1.000000            0.018502 -0.011701  0.018497  
payment_channel_TE          0.018502            1.000000 -0.006758  0.050734  
hour                       -0.011701           -0.006758  1.000000  0.002668  
is_fraud                    0.018497            0.050734  0.002668  1.000000  

../_images/502908139b9324d1f8e443a187a82ffa45c4b7d942d7d493e6c434e0fa295fb4.png

We can see that whether a case is constituted as fraudulent does not have a strong correlation with any of the features. However, this makes sense as a lot of the features take note of the instance of transaction and what device, amount or other characteristics these transactions have. If you think about the real world we would not expect that one characteristic of a transaction would affect very highly whether it is fraud or not.

Addressing the minority class#

In a data set like this, the number of fraudulent cases is expected to be a lot lower just because most financial transactions are not fraudulent. From a suggestion by chatgpt, I began to research SMOTE. SMOTE is advantageous because it creates more instances of the minority cases of the target feature. It is an interesting method as it takes from the concept of k-nearest neighbors in interpolating between the existing cases of fraud to create new cases of fraud.

code was learned from: https://www.geeksforgeeks.org/smote-for-imbalanced-classification-with-python/

#all feature columns
feature_cols = ['amount',
                'velocity_score',
                'geo_anomaly_score',
                'transaction_type_TE',
                'merchant_category_TE',
                'location_TE',
                'device_used_TE',
                'payment_channel_TE',
               'hour']
#applying smote 
smote = SMOTE(random_state=8)

# separated the features from the target.
X_train = train_df[feature_cols]
y_train = train_df['is_fraud']

#resampling with SMOTE
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

#Separating my testing sets
X_test = test_df[feature_cols] 
y_test = test_df['is_fraud']

# Plotting before and after smote results

#converting the training set to a panda series. 
y_train_df = y_train_resampled.to_frame()


fig, axes = plt.subplots(1, 2, figsize=(12, 5))
sns.countplot(x = 'is_fraud', data = train_df, ax=axes[0])
axes[0].set_title('Before SMOTE')
axes[0].set_xlabel('Is Fraud')
axes[0].set_ylabel('Count')

sns.countplot(x = 'is_fraud', data = y_train_df, ax=axes[1])
axes[1].set_title('After SMOTE')
axes[1].set_xlabel('Is Fraud')
axes[1].set_ylabel('Count')
plt.tight_layout()

../_images/971d5460f8baa7b8bbf5c4af3bf32aec3afe14f3db460bf7a7784e588de33c6d.png

Logistic Regression#

Now I perform Logistic Regression by fitting on the training set and then predicting the train set. The logistic regression is fit with the SMOTE resampled set. Keep in mind that my my training and testing set only have features that I deem as important. For example, I did not include the columns that contained IP address, identification of case, etc.

model = LogisticRegression(max_iter=1000) # max_iter from chat gpt
# fitting the logistic regression
model.fit(X_train_resampled, y_train_resampled)

# predicting the amount of fraud from the testing set
Y_pred = model.predict(X_test)
Y_pred_series = pd.Series(Y_pred)

#resetting the index for the testing and predicted values.
Y_actual_new_idx = test_df['is_fraud'].reset_index(drop=True)
Y_pred_series = pd.Series(Y_pred).reset_index(drop=True)

#dataframe for compare test
df_compare_test = pd.DataFrame({ 'Actual' : Y_actual_new_idx, 'Predicted' : Y_pred_series})

#plotting a confusion matrix
confusion_mtrx = confusion_matrix(df_compare_test['Actual'], df_compare_test['Predicted'])
sns.heatmap(confusion_mtrx, annot=True, fmt='d', cmap='Blues', xticklabels=['Not Fraud', 'Fraud'], yticklabels=['Not Fraud', 'Fraud'])
plt.xlabel('Predicted Label')
plt.ylabel('Actual Label')

Text(50.722222222222214, 0.5, 'Actual Label')

../_images/db83ccf9822a6c1460c4d9eba01d4432954fba0d2c9e0b6920a6492ce98a2d47.png

It seems like a basic logistic regression did not accurately classify non-fraudulent cases. This make sense because logistic regression assumes a linear decision boundary, or in this case, a multidimensional plane. This then further assumes that theres a big distinction between fraudulent and non-fraudulent cases. However, this is not likely because it is very possible that the fraudulent case will be very similar to a non-fraudulent case.

Now we can try K nearest neighbors.#

We want to start by finding the optimal K.

#testing from 1 nn to 30.
k_range = range(1, 30)

#comparing the score for each value of k neighbors.
k_scores = []

for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X_train_resampled, y_train_resampled, cv=5, scoring='accuracy')
    k_scores.append(scores.mean())
#Plotting the accuracy of each k vs. the number I chose for k
plt.figure(figsize=(10, 6))
plt.plot(k_range, k_scores)
plt.title('Accuracy vs. Number of Neighbors (K)')
plt.xlabel('Number of Neighbors (K)')
plt.ylabel('Cross-Validated Accuracy')

best_k = k_range[np.argmax(k_scores)]
best_score = max(k_scores)

plt.scatter(best_k, best_score, color='red', s=150, label=f'Best k = {best_k}')
plt.legend()

plt.show()

print(f'Best k value: {best_k}')
print(f'Best Cross-Val-Score: {best_score:.4f}')

model_knn = KNeighborsClassifier(n_neighbors=best_k)
model_knn.fit(X_train_resampled, y_train_resampled)
y_pred_knn = model_knn.predict(X_test)
print(f'Accuracy: {accuracy_score(y_test, y_pred_knn)}')
print('Classification Report:')
print(classification_report(y_test, y_pred_knn, zero_division=0)) # code from chatgpt

../_images/35d94fc8646310fc973399f258c3bb30ee7d3a1afd9243bbc5fef839ce34f9a4.png

Best k value: 1
Best Cross-Val-Score: 0.8356
Accuracy: 0.744
Classification Report:
              precision    recall  f1-score   support

       False       0.96      0.76      0.85       956
        True       0.05      0.30      0.09        44

    accuracy                           0.74      1000
   macro avg       0.51      0.53      0.47      1000
weighted avg       0.92      0.74      0.82      1000

cm = confusion_matrix(y_test, y_pred_knn)
sns.heatmap(cm, annot=True, fmt='d', cmap='Reds',xticklabels=['Not Fraud', 'Fraud'], yticklabels=['Not Fraud', 'Fraud'])
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('Actual Label')
plt.show()

../_images/91b337410c2bb055d2be735bc08506999d90763bc010dfc4ce26ab5875f2859c.png

As we can see, the KNN method seems better at classifying cases than the logistic regression. For example, previously it predicted that 455 of the non-fraud cases were fraudulent however with the choice of 1 nearest neighbor, it classified only 225 non-fraudulent cases incorrectly. So we are seeing that in the case of fraud detection K-nearest neighbors seem more effective, this makes sense because rather than seeing a clear relationship between fraud cases, we are looking at data points that are close to each other.

Visualization of each feature and the target value.#

Now, I am interested in finding if there is a specifice feature that is prominent in all of these fraud cases. We will do this by looking at the Random forest importance graph.

Code adopted from: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Code fixed by chatgpt.

# Train model
rf_model = RandomForestClassifier(random_state=8)
rf_model.fit(X_train_resampled, y_train_resampled)

# Plot feature importances
importances = rf_model.feature_importances_
feature_names = X_train_resampled.columns

# Create bar plot
plt.figure(figsize=(10, 6))
pd.Series(importances, index=feature_names).sort_values().plot(kind='barh')
plt.title('Feature Importance (Random Forest)')
plt.xlabel('Importance Score')
plt.show()

../_images/4d4efee534c98ee46cf0d65a73f4f67dfd30404c1a8cc43c0669e0d537b98175.png

Note that payment_channel_TE has the highest importance value out of all our features. But the importance score is around 0.29, however that is not large compared to an importance value of 1.

Interpreting the decision tree.#

On the decision tree, the value is given in the form [non-fraud case, fraud case].
Code from: https://scikit-learn.org/stable/modules/tree.html
Learned from: https://www.geeksforgeeks.org/decision-tree-implementation-python/

import graphviz
from sklearn.tree import DecisionTreeClassifier, export_graphviz


# Create a decision tree classifier
clf = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=42)


# Fit the classifier on the dataset
clf.fit(X_train_resampled, y_train_resampled)

# Extract decision tree information
dot_data = export_graphviz(clf, out_file=None, feature_names=feature_cols)

# Create a graph object and render
graph = graphviz.Source(dot_data)
display(graph)

../_images/9484cb8da34d3762d60297cde98990c9f3f4cdc34e346576fa707e5c5e5dd920.svg

If you look at the final classification, and look at the second branch node to the right, it correctly determined that if the data’s payment_channel_TE value is \(0.039 \leq x \leq 0.062\). Then these cases will be fraud cases. So we can take a look at the relation of the payment channel with whether the case is fraud or not.

#We are looking at the rows that have this feature, but more specifically, the transaction type that corresponds to the value.
train_df.loc[(train_df['payment_channel_TE']>=0.039) & (train_df['payment_channel_TE']<=0.062)].head(3)

	transaction_id	timestamp	sender_account	receiver_account	amount	transaction_type	merchant_category	location	device_used	is_fraud	...	geo_anomaly_score	payment_channel	ip_address	device_hash	hour	transaction_type_TE	merchant_category_TE	location_TE	device_used_TE	payment_channel_TE
2519793	T3499115	2023-07-06 02:00:02.246527+00:00	ACC536513	ACC597424	46.06	transfer	utilities	London	web	False	...	0.62	UPI	31.16.229.76	D4775399	2	0.044834	0.033272	0.045455	0.040415	0.061866
3405053	T4397515	2023-09-06 08:33:07.867799+00:00	ACC598679	ACC663765	16.46	transfer	entertainment	New York	pos	False	...	0.69	UPI	100.127.10.209	D8857882	8	0.044834	0.047170	0.053030	0.045498	0.061866
740399	T1563167	2023-04-16 05:22:52.174555+00:00	ACC804679	ACC412353	1274.26	deposit	online	Toronto	pos	False	...	0.16	UPI	138.129.139.178	D5495797	5	0.046701	0.063877	0.035225	0.045498	0.061866

3 rows × 24 columns

#We want to count how many fraud cases there are per payment channel
fraud_count = stratified_sample[stratified_sample['is_fraud']==1].groupby('payment_channel').size()
nonfraud_count = stratified_sample[stratified_sample['is_fraud']==0].groupby('payment_channel').size()

#Then we find how many fraud cases there are vs. non-fraud cases
total_counts_fraud = stratified_sample.groupby('is_fraud').size()

#divide the fraud case per payment channel by total fraud case
fraud_rate = fraud_count/total_counts_fraud.iloc[1]

#divide the non-fraud case per payment channel by total non-fraud case
nonfraud_rate = nonfraud_count/total_counts_fraud.iloc[0]

#turn these values into a dataframe so they are easily plotted
rate_df = pd.DataFrame({'Fraud_Rate':fraud_rate,
                        'Non-Fraud_Rate': nonfraud_rate})

#plot the proportions nect to each other.
rate_df.plot(kind='bar', color=['crimson', 'steelblue'], figsize=(10,6))
plt.title('Fraud Rate by Payment Channel')
plt.xlabel('Payment Channel')
plt.ylabel('Proportion of fraud/non-fraud to total fraud/non-fraud cases')
plt.xticks(rotation=0)
plt.ylim(0, fraud_rate.max()+.05) 
plt.tight_layout()
plt.show()

../_images/3414e86e2cee8a6af3a275e95398740b72d764a93a0ec1f73a7047d78fdf60bd.png

Conclusion#

In conclusion, although we have narrowed down where a lot of fraudulent cases come from, the relationship between fraud and the features I have included is inconclusive. Because we consider that while for the payment channel ‘UPI’, while a lot of the fraud cases corresponded to this payment channel, there is still many cases of non-fraud that makes mayment through UPI. So ultimately the data is inconclusive.