Figure 4: Psychological model of explanation. Yellow boxes illustrate the underlying process. The green boxes illustrate measurement opportunities. White boxes illustrate potential outcomes.

Evaluation

Evaluation was originally envisioned to be based on a common set of problems, within the data analytics and autonomy domains. However, it quickly became clear that it would be more valuable to explore a variety of approaches across a breadth of problem domains. In order to evaluate performance in the final year of the program, the evaluation team, led by LT Eric Vorm, PhD of the U.S. Naval Research Laboratory (NRL), developed an explanation scoring system (ESS). This scoring system provided a quantitative mechanism for assessing the designs of XAI user studies in terms of technical and methodological appropriateness and robustness. The ESS enabled the assessments of multiple elements of each user study, including the task, domain, explanations, explanation interface, users, hypothesis, data collection, and analysis to ensure that each study met the high standards of human subjects research. XAI evaluation measures are shown in Figure 5, and include functional measures, learning performance measures, and explanation effectiveness measures. The DARPA XAI program demonstrated definitively the importance of carefully designing user studies in order to accurately evaluate the effectiveness of explanations in ways that directly enhance appropriate use and trust by human users, and appropriately support human-machine teaming. Often times, multiple types of measures (i.e., performance, functionality, explanation effectiveness) will be necessary to evaluate the performance of an XAI algorithm. XAI user study design can be tricky and the DARPA XAI program discovered that the most effective research teams were ones that featured diverse teams with cross-disciplinary expertise (i.e., computer science combined with human-computer interaction and/or experimental psychology, etc).