Figure 3: CASP15 performance for protein tertiary structure, compared with earlier CASPs. The Y axis shows backbone agreement with experiment in GDT_TS units (19). On this scale, a random model scores approximately 20 to 30%, a correctly folded model around 50, and a model within experimental accuracy, around 90. Open circles show CASP15 best model results for each target, trend lines show performance for each CASP. Overall best performance of CASP15 (black line) is similar to that in CASP14 (blue line). The dotted line shows best server performance in CASP15. AlphaFold2-based methods dominate best performance. However, performance with standard AlphaFold2 protocols available at the time of the experiment is lower (dashed black lines). The X-axis difficulty scale represents the extent to which homology-based could be utilized. Targets in each CASP are ordered by difficulty calculated as a cumulative rank of the sequence identity and the coverage of the target by the best homologous structure available at the time of each experiment. The templates are found by running Foldseek (20) and LGA (21) versus experimental structures deposited to the PDB.
All the most successful CASP15 methods are based on the CASP14 DeepMind AlphaFold2 method, AF2, and subsequent DeepMind updates. But strikingly, two versions of standard AF2 procedures (dashed black lines), one models from the ColabFold server (22) and one using a local installation of AF2 with default parameters (23) have substantially worse performance, and performance does fall off markedly as homology information weakens. We consulted DeepMind about this finding, and they undertook to run the CASP targets internally. The results from that process are not shown since they are not official CASP outcomes, but overall are very similar to the CASP15 best performance line. The primary reason for the difference in best CASP15 and standard AF2 performance appears to be that the most successful methods all used greater sampling of possible structures than the AF2 defaults, including different combinations of a larger number of seeds, more recycles, and network dropout. Most also used customized and sometimes enhanced multiple sequence alignments.
There are two important conclusions. First, as of mid-2022, AF2 based methods were clearly more accurate than others. The next best performance was from RosettaFold (24), a method developed by the Baker group following AF2 principles (note this is RosettaFold version 1; version 2 has since been released (25)). Participating deep learning Large Language Models (LLMs) did not perform well. Second, to get the best results from AF2 it is often necessary to sample more extensively than the default parameters allow and to carefully choose and adjust the multiple sequence alignment. For about 1/3 of the targets, there is a gain in GDT_TS of 10% or more in using the improved methods with enhanced sampling. For a few targets the difference in GDT_TS is more than two-fold, but nearly all of those cases are domains of large proteins. As discussed below, these targets are generally more challenging.
Figure 3 shows that while many targets achieve a GDT_TS score of at least 90%, there are also some low scoring targets. The lowest scoring target is T1131. This protein is a member of a rapidly evolving family of proteins in aphids, likely involved in creating a gall on the plant host (26). There were no detectable sequence relatives outside that family, and the family sequences were only available on a specialized web site. Thus, this appears to be a clear example of single sequence being insufficient to produce a good model, a deficiency also seen in DeepMind’s benchmarking (27) and in the previous CASP (4). It was hoped that Large Language Models would remove this limitation (for example (28)), but in CASP15 that was not yet the case. T1122, the next lowest performing target, has a best GDT_TS of just over 40%. This protein also has a shallow sequence alignment, but the crystal structure may be highly influenced by the very tight intermolecular interactions (21% crystal solvent content). Supplementary Figure 1 shows that in CASP15 there is a tendency for performance to fall off with decreasing alignment depth, and this is much more pronounced when the ratio of alignment depth to target length is below 0.1 (less than 10 appropriately diverse sequences per 100 residues.) As in the previous CASP, there are also targets with shallow alignments which nevertheless have high agreement with experiment. That is, consistent with DeepMind’s benchmarking (27), shallow alignments sometimes result in poor models but in other instances are fine. There are nine other targets with a best GDT_TS score between 70 and 80. One of these is an NMR target (T1155) containing a large flexible loop (personal communication, Luciano Abriata), so it may have multiple conformations. As discussed later, the new CASP ensembles category is beginning to shed light on this type of issue. The others are all domains of large targets (~1200 to over 4000 residues), and in general, performance seems to be a little worse for domains of very large proteins. Unlike in the previous CASP, there is no fall-off in performance with experimental resolution.