XAI Results, Lessons
Learned
Three major evaluations were conducted during the program: one during
Phase 1 and two during phase 2. In order to evaluate the effectiveness
of XAI techniques, researchers on the program designed and executed user
studies. User studies are still the gold standard for assessing
explanations. There were approximately 12,700 participants in user
studies carried out by XAI researchers, including approximately 1900
supervised participants, where the individual was guided through the
experiment by the research team (e.g. in person or on Zoom) and 10800
unsupervised participants, where the individual self-guided through the
experiment and was not actively guided by the research team (e.g. Amazon
Mechanical Turk). In accordance with policy for all US Department of
Defense (DoD) funded human subjects research, each research protocol was
reviewed by a local institutional review board (IRB) and then a DoD
human research protection office reviewed the protocol and the local IRB
findings.
In the course of those user studies, several key takeaways were
identified.
- Users prefer systems that provide decisions with explanations over
systems that provide only decisions. Tasks where explanations provide
the most value are those where a user needs to understand the inner
workings of how an AI system makes decisions. [Supported by 11
experiments across performer teams]
- In order for explanations to improve user task performance, the task
must be difficult enough that the AI explanation helps. [PARC, UT
Dallas]
- User cognitive load to interpret explanations can hinder user
performance. Combined with the previous point, explanations and task
difficulty need to be calibrated in order to improve user performance.
[UCLA, Oregon State]
- Explanations are more helpful when an AI is incorrect and are
particularly valuable for edge cases. [UCLA, Rutgers]
- Measures of explanation effectiveness can change over time.
[Raytheon, BBN]
- Advisability can improve user trust significantly over explanations
alone. [UC Berkeley]
- XAI is useful for measuring and aligning mental models for users and
XAI systems. [Rutgers, SRI]
- Lastly, since the last year of XAI took place during the unprecedented
times of the COVID-19 pandemic, our performer teams developed
best-practices for designing web interfaces to conduct XAI user
studies when in-person studies were not possible. [OSU, UCLA]
Dikkala 2021
As mentioned earlier, there seemed to be a natural tension between
learning performance and explainability. However, throughout the course
of the program, we found evidence that explainability can improve
performance (Kim et al. 2021, Watkins et al. 2021). From an intuitive
perspective, training a system to produce explanations provides
additional supervision, via additional loss functions, training data, or
other mechanisms, that encourages a system to learn more effective
representations of the world. While this may not be true in all cases
and significant work remains to characterize when explainable techniques
will be more performant, it provides hope that future XAI systems can be
more performant than current systems while meeting user needs for
explanations.