The dependent variable is fraud. ”Fraud,” in this context, refers to the transactions carried out by fraudulent actors inside the simulation. More specifically, the fraudulent activity of the agents tries to profit by seizing control of client accounts and laundering the cash by moving them to another system. The funds are then cashed out of the system. Fraud was coded as 1 = fraud and 0 = no fraud and is represented in equation 1.
\(y\ =\{1,fraud,\) 0
\(\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ no-fraud\ \}\)eq . 1
Data Cleaning and Preprocessing
Some features were redundant and had to be dropped before model building. As a result, the features ”nameorig” and ”nameDest” are no longer relevant and must be removed. There was no longitude or latitude associated with these features to locate the destinations. There was no variation in the feature ”isFlaggedFraud,” and it was also dropped from the model.
Feature Scaling
One of the most important aspects of preprocessing data for ML is feature scaling. Preprocesing is especially necessary when the features are on different scales and span a wide range of values. ML models are highly sensitive to features with different scales and, if not handled properly, can throw off the model and lead to sub-optimal performance or even incorrect predictions. There are a few different ways to scale features, but the most common is min-max scaling. This approach scales all values to be between 0 and 1. Other methods include standardization, which scales values so that they have a mean of 0 and a standard deviation of 1. The data for this project contains features with different scales and ranges. Given that the data was not normally distributed, normalization with MinMaxScaler was used to normalize the data. The formula used to normalize the data is shown in equation 5.
X-\(\overset{\overline{}}{X}\) eq. 5
X -X
Min Max
{0 1}
SMOTe-ENN for Imbalance Data
The dataset used for this study was highly imbalanced. The fraud to no fraud ratio was 99% (no fraud) and 1% (fraud). One common approach when working with imbalanced datasets is to upsample the minority class. Upsampling can be done in various ways, but one popular method is the Synthetic Minority Oversampling Technique (SMOTe) plus Edited Nearest Neighbour (ENN). The SMOTe-ENN method combines the SMOTe and ENN algorithms to improve the performance of the ML classifiers46,47. SMOTe creates synthetic minority examples by interpolating between existing minority examples 48. ENN then cleans up the resulting oversampled data set by removing outliers 49. The SMOTe-ENN method has been shown to be more effective than either algorithm alone 46,47,49. SMOTE-ENN is particularly effective at handling imbalanced data sets, often in real-world applications. As a result, the SMOTe-ENN method is often used in fields such as credit scoring and fraud detection47,49.
Figure 1 shows the data before SMOTe-ENN resampling. Because of the imbalanced nature of the data, the no-fraud observations are scattered along the dotted red line. ML classifier modelling on imbalanced data will lead to biased results, with the algorithms only reading the no-fraud observations 19,40. Oversampling with SMOTe-ENN will create new, synthetic data points similar to the existing minority class.