Classification Measures
When dealing with imbalanced datasets, it is important to use measures of performance that are more robust than simple accuracy40. The performance accuracy only measures the FP and FN and does not provide enough information to properly assess the model’s performance. When dealing with imbalanced datasets, it is important to use measures of performance that are more robust than simple accuracy 37. In addition, we need to consider the TN and TP rates because these measures provide a complete picture of the model’s performance. For example, if we have a high TN rate, our model correctly identifies negative instances. On the other hand, if we have a low TN rate, our model incorrectly identifies positive instances. As a result, the TN and TP rates are significant for assessing the performance of imbalanced datasets 15.
When choosing a machine learning model for fraud detection, it is important to consider how well the model will perform in terms of precision, recall, and the F1- score 40,49. These measures are better indicators of how well a model predicts laundered transactions than accuracy alone. However, it is important to remember that all of these measures can be affected by choice of threshold. A high threshold will result in fewer FPs and TPs, while a low threshold will have the opposite effect. Therefore, tuning the threshold according to the application’s needs is important. In some cases, it may even be necessary to use multiple thresholds to achieve the desired level of performance 20,42. To offset the trade-off between precision and recall, the threshold was tuned to ensure that it was equal before running the classifiers.
Table 7 shows the results of the classifiers’ performance. Note that the decision tree classifier had the highest precision score (1.0), followed by the random forest model (.96). In contrast, the logistic regression and gradient classifiers had the highest recall scores (.96). The random forest (.87) followed by logistic regression (.85) had the highest F1 scores. The gradient descent and decision tree classifiers had the lowest F1- scores, respectively (.84). These results indicate that the random forest is the best performing classifier overall because the model achieved high scores across all three measures. Contextually, these results can be interpreted to mean that the random forest classifier correctly identifies a high proportion of positives and negatives and achieve a high degree of overall accuracy. The logistic regression and gradient descent classifiers also performed well, achieving high scores in recall and F1. However, they did not achieve the same high precision score as the random forest classifier. These results suggest that the logistic regression and gradient descent classifiers may misclassify more cases than the random forest classifier.
Table 7: Algorithms Classification Metrics