Classification Measures
When dealing with imbalanced datasets, it is important to use measures
of performance that are more robust than simple accuracy40. The performance accuracy only measures the FP and
FN and does not provide enough information to properly assess the
model’s performance. When dealing with imbalanced datasets, it is
important to use measures of performance that are more robust than
simple accuracy 37. In addition, we need to consider
the TN and TP rates because these measures provide a complete picture of
the model’s performance. For example, if we have a high TN rate, our
model correctly identifies negative instances. On the other hand, if we
have a low TN rate, our model incorrectly identifies positive instances.
As a result, the TN and TP rates are significant for assessing the
performance of imbalanced datasets 15.
When choosing a machine learning model for fraud detection, it is
important to consider how well the model will perform in terms of
precision, recall, and the F1- score 40,49. These
measures are better indicators of how well a model predicts laundered
transactions than accuracy alone. However, it is important to remember
that all of these measures can be affected by choice of threshold. A
high threshold will result in fewer FPs and TPs, while a low threshold
will have the opposite effect. Therefore, tuning the threshold according
to the application’s needs is important. In some cases, it may even be
necessary to use multiple thresholds to achieve the desired level of
performance 20,42. To offset the trade-off between
precision and recall, the threshold was tuned to ensure that it was
equal before running the classifiers.
Table 7 shows the results of the classifiers’ performance. Note that the
decision tree classifier had the highest precision score (1.0), followed
by the random forest model (.96). In contrast, the logistic regression
and gradient classifiers had the highest recall scores (.96). The random
forest (.87) followed by logistic regression (.85) had the highest F1
scores. The gradient descent and decision tree classifiers had the
lowest F1- scores, respectively (.84). These results indicate that the
random forest is the best performing classifier overall because the
model achieved high scores across all three measures. Contextually,
these results can be interpreted to mean that the random forest
classifier correctly identifies a high proportion of positives and
negatives and achieve a high degree of overall accuracy. The logistic
regression and gradient descent classifiers also performed well,
achieving high scores in recall and F1. However, they did not achieve
the same high precision score as the random forest classifier. These
results suggest that the logistic regression and gradient descent
classifiers may misclassify more cases than the random forest
classifier.
Table 7: Algorithms Classification Metrics