2.4 | Estimating accuracy
A key attribute of experimental methods for determining macromolecular structure is that they return a generally reliable estimate of accuracy at the individual amino acid sequence level. In order to be taken seriously, calculated structures must also provide this information. CASP has long assessed self-accuracy estimates and also estimates by third parties who have developed methods for this purpose.
In CASP14 (2020), the new deep learning methods, especially AlphaFold2, provided models with very reliable per residue accuracy estimates, expressed as plDDT, the predicted lDDT. (lDDT is a metric reflecting the accuracy of a residue’s environment in terms of the difference between experimental and calculated inter-atom distances (35)). In CASP14 it also became clear that third-party accuracy estimates are now generally less reliable than self-estimates and vary in reliability depending on the method used to build a model. So, in this CASP, assessment of self-accuracy estimates for tertiary structures is included, but third-party methods have been dropped. The assessor for this category showed that for single structures self-estimates of accuracy continue to be overall highly reliable, although more analysis is needed to determine how the reliability varies with circumstances (12).
For multimers, third-party selection of models appears impressive, with about 2/3 of targets having a loss of accuracy less than 0.1 in TM score (36) when models estimated to be most accurate are selected. Over half have a loss of less than 0.05. However, some of the methods use consensus over many models to estimate accuracy, rarely possible in practice. Amongst the best is a ‘naïve’ control method, suggesting the sophistication of other consensus methods is not adding much overall. But some publicly available methods requiring at most a few additional models rank highly and may be valuable for users, for example (32), (37) and (38) in this issue. Less satisfying is that, judging by the Pearson correlation between estimated and actual accuracy, ranking of accuracy across the full ranges of models is sometimes poor.
Self-accuracy estimates for multimers are only available in the form of submitted per residue plDDT values. The assessor provides average differences of these from actual lDDT values for core, surface, and interface residues. Overall, average differences are small for the best methods: less than 0.1 for core residues, but somewhat higher (0.16) for interface ones.