The dependent variable is fraud. ”Fraud,” in this context, refers to the
transactions carried out by fraudulent actors inside the simulation.
More specifically, the fraudulent activity of the agents tries to profit
by seizing control of client accounts and laundering the cash by moving
them to another system. The funds are then cashed out of the system.
Fraud was coded as 1 = fraud and 0 = no fraud and is represented in
equation 1.
\(y\ =\{1,fraud,\) 0
\(\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ no-fraud\ \}\)eq . 1
Data Cleaning and Preprocessing
Some features were redundant and had to be dropped before model
building. As a result, the features ”nameorig” and ”nameDest” are no
longer relevant and must be removed. There was no longitude or latitude
associated with these features to locate the destinations. There was no
variation in the feature ”isFlaggedFraud,” and it was also dropped from
the model.
Feature Scaling
One of the most important aspects of preprocessing data for ML is
feature scaling. Preprocesing is especially necessary when the features
are on different scales and span a wide range of values. ML models are
highly sensitive to features with different scales and, if not handled
properly, can throw off the model and lead to sub-optimal performance or
even incorrect predictions. There are a few different ways to scale
features, but the most common is min-max scaling. This approach scales
all values to be between 0 and 1. Other methods include standardization,
which scales values so that they have a mean of 0 and a standard
deviation of 1. The data for this project contains features with
different scales and ranges. Given that the data was not normally
distributed, normalization with MinMaxScaler was used to normalize the
data. The formula used to normalize the data is shown in equation 5.
X-\(\overset{\overline{}}{X}\) eq. 5
X -X
Min Max
{0 1}
SMOTe-ENN for Imbalance Data
The dataset used for this study was highly imbalanced. The fraud to no
fraud ratio was 99% (no fraud) and 1% (fraud). One common approach
when working with imbalanced datasets is to upsample the minority class.
Upsampling can be done in various ways, but one popular method is the
Synthetic Minority Oversampling Technique (SMOTe) plus Edited Nearest
Neighbour (ENN). The SMOTe-ENN method combines the SMOTe and ENN
algorithms to improve the performance of the ML classifiers46,47. SMOTe creates synthetic minority examples by
interpolating between existing minority examples 48.
ENN then cleans up the resulting oversampled data set by removing
outliers 49. The SMOTe-ENN method has been shown to be
more effective than either algorithm alone 46,47,49.
SMOTE-ENN is particularly effective at handling imbalanced data sets,
often in real-world applications. As a result, the SMOTe-ENN method is
often used in fields such as credit scoring and fraud detection47,49.
Figure 1 shows the data before SMOTe-ENN resampling. Because of the
imbalanced nature of the data, the no-fraud observations are scattered
along the dotted red line. ML classifier modelling on imbalanced data
will lead to biased results, with the algorithms only reading the
no-fraud observations 19,40. Oversampling with
SMOTe-ENN will create new, synthetic data points similar to the existing
minority class.