Figure 3: CASP15 performance for protein tertiary
structure, compared with earlier CASPs. The Y axis shows backbone
agreement with experiment in GDT_TS units (19). On this scale, a random
model scores approximately 20 to 30%, a correctly folded model around
50, and a model within experimental accuracy, around 90. Open circles
show CASP15 best model results for each target, trend lines show
performance for each CASP. Overall best performance of CASP15 (black
line) is similar to that in CASP14 (blue line). The dotted line shows
best server performance in CASP15. AlphaFold2-based methods dominate
best performance. However, performance with standard AlphaFold2
protocols available at the time of the experiment is lower (dashed black
lines). The X-axis difficulty scale represents the extent to which
homology-based could be utilized. Targets in each CASP are ordered by
difficulty calculated as a cumulative rank of the sequence identity and
the coverage of the target by the best homologous structure available at
the time of each experiment. The templates are found by running Foldseek
(20) and LGA (21) versus experimental structures deposited to the PDB.
All the most successful CASP15 methods are based on the CASP14 DeepMind
AlphaFold2 method, AF2, and subsequent DeepMind updates. But strikingly,
two versions of standard AF2 procedures (dashed black lines), one models
from the ColabFold server (22) and one using a local installation of AF2
with default parameters (23) have substantially worse performance, and
performance does fall off markedly as homology information weakens. We
consulted DeepMind about this finding, and they undertook to run the
CASP targets internally. The results from that process are not shown
since they are not official CASP outcomes, but overall are very similar
to the CASP15 best performance line. The primary reason for the
difference in best CASP15 and standard AF2 performance appears to be
that the most successful methods all used greater sampling of possible
structures than the AF2 defaults, including different combinations of a
larger number of seeds, more recycles, and network dropout. Most also
used customized and sometimes enhanced multiple sequence alignments.
There are two important conclusions. First, as of mid-2022, AF2 based
methods were clearly more accurate than others. The next best
performance was from RosettaFold (24), a method developed by the Baker
group following AF2 principles (note this is RosettaFold version 1;
version 2 has since been released (25)). Participating deep learning
Large Language Models (LLMs) did not perform well. Second, to get the
best results from AF2 it is often necessary to sample more extensively
than the default parameters allow and to carefully choose and adjust the
multiple sequence alignment. For about 1/3 of the targets, there is a
gain in GDT_TS of 10% or more in using the improved methods with
enhanced sampling. For a few targets the difference in GDT_TS is more
than two-fold, but nearly all of those cases are domains of large
proteins. As discussed below, these targets are generally more
challenging.
Figure 3 shows that while many targets achieve a GDT_TS score of at
least 90%, there are also some low scoring targets. The lowest scoring
target is T1131. This protein is a member of a rapidly evolving family
of proteins in aphids, likely involved in creating a gall on the plant
host (26). There were no detectable sequence relatives outside that
family, and the family sequences were only available on a specialized
web site. Thus, this appears to be a clear example of single sequence
being insufficient to produce a good model, a deficiency also seen in
DeepMind’s benchmarking (27) and in the previous CASP (4). It was hoped
that Large Language Models would remove this limitation (for example
(28)), but in CASP15 that was not yet the case. T1122, the next lowest
performing target, has a best GDT_TS of just over 40%. This protein
also has a shallow sequence alignment, but the crystal structure may be
highly influenced by the very tight intermolecular interactions (21%
crystal solvent content). Supplementary Figure 1 shows that in CASP15
there is a tendency for performance to fall off with decreasing
alignment depth, and this is much more pronounced when the ratio of
alignment depth to target length is below 0.1 (less than 10
appropriately diverse sequences per 100 residues.) As in the previous
CASP, there are also targets with shallow alignments which nevertheless
have high agreement with experiment. That is, consistent with DeepMind’s
benchmarking (27), shallow alignments sometimes result in poor models
but in other instances are fine. There are nine other targets with a
best GDT_TS score between 70 and 80. One of these is an NMR target
(T1155) containing a large flexible loop (personal communication,
Luciano Abriata), so it may have multiple conformations. As discussed
later, the new CASP ensembles category is beginning to shed light on
this type of issue. The others are all domains of large targets
(~1200 to over 4000 residues), and in general,
performance seems to be a little worse for domains of very large
proteins. Unlike in the previous CASP, there is no fall-off in
performance with experimental resolution.