XAI Results, Lessons Learned

Three major evaluations were conducted during the program: one during Phase 1 and two during phase 2. In order to evaluate the effectiveness of XAI techniques, researchers on the program designed and executed user studies. User studies are still the gold standard for assessing explanations. There were approximately 12,700 participants in user studies carried out by XAI researchers, including approximately 1900 supervised participants, where the individual was guided through the experiment by the research team (e.g. in person or on Zoom) and 10800 unsupervised participants, where the individual self-guided through the experiment and was not actively guided by the research team (e.g. Amazon Mechanical Turk). In accordance with policy for all US Department of Defense (DoD) funded human subjects research, each research protocol was reviewed by a local institutional review board (IRB) and then a DoD human research protection office reviewed the protocol and the local IRB findings.
In the course of those user studies, several key takeaways were identified.
As mentioned earlier, there seemed to be a natural tension between learning performance and explainability. However, throughout the course of the program, we found evidence that explainability can improve performance (Kim et al. 2021, Watkins et al. 2021). From an intuitive perspective, training a system to produce explanations provides additional supervision, via additional loss functions, training data, or other mechanisms, that encourages a system to learn more effective representations of the world. While this may not be true in all cases and significant work remains to characterize when explainable techniques will be more performant, it provides hope that future XAI systems can be more performant than current systems while meeting user needs for explanations.