2.4 | Estimating accuracy
A key attribute of experimental methods for determining macromolecular
structure is that they return a generally reliable estimate of accuracy
at the individual amino acid sequence level. In order to be taken
seriously, calculated structures must also provide this information.
CASP has long assessed self-accuracy estimates and also estimates by
third parties who have developed methods for this purpose.
In CASP14 (2020), the new deep learning methods, especially AlphaFold2,
provided models with very reliable per residue accuracy estimates,
expressed as plDDT, the predicted lDDT. (lDDT is a metric reflecting the
accuracy of a residue’s environment in terms of the difference between
experimental and calculated inter-atom distances (35)). In CASP14 it
also became clear that third-party accuracy estimates are now generally
less reliable than self-estimates and vary in reliability depending on
the method used to build a model. So, in this CASP, assessment of
self-accuracy estimates for tertiary structures is included, but
third-party methods have been dropped. The assessor for this category
showed that for single structures self-estimates of accuracy continue to
be overall highly reliable, although more analysis is needed to
determine how the reliability varies with circumstances (12).
For multimers, third-party selection of models appears impressive, with
about 2/3 of targets having a loss of accuracy less than 0.1 in TM score
(36) when models estimated to be most accurate are selected. Over half
have a loss of less than 0.05. However, some of the methods use
consensus over many models to estimate accuracy, rarely possible in
practice. Amongst the best is a ‘naïve’ control method, suggesting the
sophistication of other consensus methods is not adding much overall.
But some publicly available methods requiring at most a few additional
models rank highly and may be valuable for users, for example (32), (37)
and (38) in this issue. Less satisfying is that, judging by the Pearson
correlation between estimated and actual accuracy, ranking of accuracy
across the full ranges of models is sometimes poor.
Self-accuracy estimates for multimers are only available in the form of
submitted per residue plDDT values. The assessor provides average
differences of these from actual lDDT values for core, surface, and
interface residues. Overall, average differences are small for the best
methods: less than 0.1 for core residues, but somewhat higher (0.16) for
interface ones.