What we learned about binder design from the Adaptyv competition

Brian Naughton | Mon 30 December 2024 | ai | ai biotech proteindesign

This article is a deeper look at Adaptyv's binder design competition, and some thoughts on what we learned. If you are unfamiliar with the competition, there is background information on the Adaptyv blog and my previous article.

The data

Adaptyv did a really nice job of packaging up the data from the competition (both round 1 and round 2). The also did a comprehensive analysis of which metrics predicted successful binding in this blogpost.

The data from round 2 is more comprehensive than round 1 — it even includes Alphafolded structures — so I downloaded the round 2 csv and did some analysis.

Regressions

Unlike the Adaptyv blogpost, which does a deep dive on each metric in turn, I just wanted to see how well I could predict binding affinity (Kd) using the following features provided in the csv: pae_interaction, esm_pll, iptm, plddt, design_models (converted to one-hot), seq_len (inferred from sequence). Three of these metrics (pae_interaction, esm_pll, iptm) were used to determine each entry's rank in the competition's virtual leaderboard, which was used to prioritize entries going into the binding assay.

I also added one more feature, prodigy_kd, which I generated from the PDB files provided using prodigy. Prodigy is an old-ish tool for predicting binding affinity that identifies all the major contacts (polar–polar, charged–charged, etc.) and reports a predicted Kd (prodigy_Kd).

I used the typical regression tools: Random Forest, Kaggle favorite XGBoost, SVR, linear regression, as well as just using the mean Kd as a baseline. There is not a ton of data here for cross-validation, especially if you split by submitter, which I think is fairest. If you do not split by submitter, then you can end up with very similar proteins in different folds.

# get data and script
git clone https://github.com/adaptyvbio/egfr_competition_2
cd egfr_competition_2/results
wget https://gist.githubusercontent.com/hgbrian/1262066e680fc82dcb98e60449899ff9/raw/regress_adaptyv_round_2.py
# run prodigy on all pdbs, munge into a tsv
find structure_predictions -name "*.pdb" | xargs -I{} uv run --with prodigy-prot prodigy {} > prodigy_kds.txt
(echo -e "name\tprodigy_kd"; rg "Read.+\.pdb|25.0˚C" prodigy_kds.txt | sed 's/.*\///' | sed 's/.*25.0˚C:  //' | paste - - | sed 's/\.pdb//') > prodigy_kds.tsv
# run regressions
uv run --with scikit-learn --with polars --with matplotlib --with seaborn --with pyarrow --with xgboost regress_adaptyv_round_2.py

The results are not great! There are a few ways to slice the data (including replicates or not; including similarity_check or not; including non-binders or not). There is a little signal, but I think it's fair to say nothing was strongly predictive.

Model	R²	RMSE (log units)	Median Fold Error
Linear Regression	0.150	0.729	1.8x
Random Forest Regression	0.188	0.712	1.4x
SVM Regression	0.022	0.781	1.2x
XGBoost	0.061	0.766	1.2x
Mean Kd only	-0.009	0.794	1.9x

XGBoost performance looks ok here but is not much more predictive than just taking the mean Kd

Surprisingly, no one feature dominates in terms of predictive power

Virtual leaderboard rank vs competition rank

If there really is no predictive power in these computational metrics, there should be no correlation between rank in the virtual leaderboard and rank in the competition. In fact, there is a weak but significant correlation (Spearman correlation ~= 0.2). However, if you constrain to the top 200 (of 400 total), there is no correlation. My interpretation is that these metrics can discriminate no-hope-of-binding from some-hope-of-binding, but not more than that.

It may be too much to ask one set of metrics to work for antibodies (poor PLL, poor PAE?), de novo binders (poor PLL), and EGF/TNFa-derived binders (natural, so excellent PLL). However, since I include design_models as a covariate, the regression models above can use different strategies for different design types, so at the very least we know there is not a trivial separation that can be made.

BindCraft's scoring heuristics

So how can BindCraft work if it's mostly using these same metrics as heuristics? I asked this on twitter and got an interesting response.

It is possible that PyRosetta's InterfaceAnalyzer is adding a lot of information. However, if this were the case, you might expect Prodigy's Kd prediction to also help, which it does not. It is also possible that by using AlphaFold2, the structures produced by BindCraft are inherently biased towards natural binding modes. Then a part of the binding heuristics are implicit in the weights of the model?

What did we learn?

I learned a couple of things:

Some tools, specifically BindCraft, can consistently generate decent binders, at least against targets and binding pockets present in its training set (PDB). (The BindCraft paper also shows success with at least one de novo protein not present in the PDB.)
We do not have a way to predict if a given protein will bind a given target.

I think this is pretty interesting, and a bit counterintuitive. More evidence that we cannot predict binding comes from the Dickinson lab's Prediction Challenges, where the goal is to match the binder to the target. Apparently no approach can (yet).

The Adaptyv blogpost ends by stating that binder design has not been solved yet. This is clearly true. So what comes next?

We could find computational metrics that work, based on the current sequence and structure data. For example, BindCraft includes "number of unsatisfied hydrogen bonds at the interface" in its heuristics. I am skeptical that we can do a lot better with this approach. For one thing, Adaptyv has already iterated once on its ranking metrics, with negligible improvement in prediction.
We could get better at Molecular Dynamics, which probably contains some useful information today (at exorbitant computational cost), and could soon be much better with deep learning approaches.
We could develop an "AlphaFold for Kd prediction". There are certainly attempts at this, e.g., ProAffinity-GNN and the PPB-Affinity dataset to pick two recent examples, but I don't know if anything works that well. The big problem here, as with many biology problems, is a lack of data; PDBbind is not that big (currently ~2800 protein–protein affinities.)

Luckily, progress in this field is bewilderingly fast so I'm sure we'll see a ton of developments in 2025. Kudos to Adaptyv for helping push things forward.

Comment

The ABCs of Alphafold 3, Boltz and Chai-1

Brian Naughton | Sat 30 November 2024 | ai | ai biotech proteindesign

Alphafold 3 (AF3) came out in May 2024, and included several major advances over Alphafold 2 (AF2). In this post I will give a brief review of Alphafold 3, and compare the various open and less-open AF3-inspired models that have come out over the past six months. Finally, I will show some results from folding antibody complexes.

Alphafold 3

AF3 has many new capabilities compared to AF2: it can work with small molecules, nucleic acids, ions, and modified residues. It also has arguably a streamlined architecture compared to AF2 (pairformer instead of evoformer, no rotation invariance).

310.ai did a nice review and small benchmark of AlphaFold3 that is worth reading.

The AF3 paper hardly shows any data comparing AF3 to AF2, and is mainly focused on its new capabilities working with non-amino acids. In all cases tested, it performed as well as or exceeded state-of-the-art. For most regular protein folding problems, AF3 and AF2 work comparably well (more specifically, Alphafold-Multimer (AF2-M), the AF2 revision that allowed for multiple protein chains) though for antibodies there is a jump in performance.

Still, despite being an excellent model, AF3 gets relatively little discussion. This is because the parameters are not available so nobody outside DeepMind/Isomorphic Labs really uses it. The open source AF2-M still dominates, especially when used via the amazing colabfold project.

Alphafold-alikes

As soon as AF3 was published, the race was on to reimplement the core ideas. The chronology so far:

Date	Software	Code available?	Parameters available?	Lines of Python code
2024-05	Alphafold 3	❌ (CC-BY-NC-SA 4.0)	❌ (you must request access)	32k
2024-08	HelixFold3	❌ (CC-BY-NC-SA 4.0)	❌ (CC-BY-NC-SA 4.0)	17k
2024-10	Chai-1	❌ (Apache 2.0, inference only)	✅ (Apache 2.0)	10k
2024-11	Protenix	❌ (CC-BY-NC-SA 4.0)	❌ (CC-BY-NC-SA 4.0)	36k
2024-11	Boltz	✅ (MIT)	✅ (MIT)	17k

There are a few other models that are not yet of interest: Ligo's AF3 implementation is not finished and perhaps not under active development, LucidRains' AF3 implementation is not finished but is still under active development.

It's been pretty incredible to see so many reimplementation attempts within the span of a few months, even if most are not usable due to license issues.

Code and parameter availability

As a scientist who works in industry, it's always annoying to try to figure out which tools are ok to use or not. It causes a lot of friction and wastes a lot of time. For example, I started using ChimeraX a while back, only to find out after sinking many hours into it that this was not allowed.

There are many definitions of "open" software. When I say open I really mean you can use it without checking with a lawyer. For example, even if you are in academia, if the license says the code is not free for commercial use, then what happens if you start a collaboration with someone in industry? What if you later want to commercialize? These are common occurrences.

In some cases (AF3, HelixFold3, Protenix, and Chai-1), they make a server available, which is nice for very perfunctory testing, but precludes testing anything proprietary or folding more than a few structures. If you have the code and the training set, it would cost around $100k to train one of these models (specifically, the Chai-1 and Protenix papers give numbers in this range, though that is just the final run). So in theory there is no huge blocker to retraining. In practice it does not seem to happen, perhaps for license issues.

The specific license matters. Before today, I thought MIT was just a more open Apache 2.0, but apparently there is an advantage to Apache 2.0 around patents! My non-expert conclusion is that unlicense, MIT and Apache are usable, GPL and CC-BY-NC-SA are not.

Which model to choose?

There are a few key considerations: availability; extensibility / support; performance.

1. Availability

In terms of availability, I think only Chai-1 and Boltz are in contention. The other models are not viable for any commercial work, and would only be worth considering if their capabilities were truly differentiated. As far as I know, they are not.

2. Extensibility and support

I think this one is maybe under-appreciated. If an open source project is truly open and gains enough mindshare, it can attract high quality bug reports, documentation, and improvements. Over time, this effect can compound. I think currently Boltz is the only model that can make this claim.

A big difference between Bolt and Chai-1 is that Boltz includes the training code and neural network architecture, whereas Chai-1 only includes inference code and uses pre-compiled models. I only realized this when I noticed the Chai-1 codebase is half the size of the Boltz codebase. Most users will not retrain or finetune the model, but the ability for others to improve the code is important.

To be clear, I am grateful to Chai for making their code and weights available for commercial purposes, and I intend to use the code, but from my perspective Boltz should be able to advance much quicker. There is maybe an analogy to Linux or Blender vs proprietary software.

3. Performance

It's quite hard to tell from the literature who has the edge in performance. You can squint at the graphs in each paper, but fundamentally all of these models are AF3-derivatives trained on the same data, so it's not surprising that performance is generally very similar.

Chai-1 and AF3 perform almost identically

Boltz and Chai-1 perform almost identically

Protenix and AF-3 perform almost identically

Benchmarking performance

I decided to do my own mini-benchmark, by taking 10 recent (i.e., not in any training data) antibody-containing PDB entries and folding them using Boltz and Chai-1.

Both models took around 10 minutes per antibody fold on a single A100 (80GB for Boltz, 40GB for Chai-1). Chai-1 is a little faster, which is expected since it uses ESM embeddings instead of multiple sequence alignments (MSAs). (Note, I did not test Chai-1 in MSA mode, giving it a small disadvantage compared to Boltz.)

Tangentially, I was surprised I could not find a "pdb to fasta" tool that would output protein, nucleic acids, and ligands. Maybe we need a new file format? You can get protein and RNA/DNA from pdb, but it will be the complete sequence of the protein, not the sequence in the PDB file (this may or may not be what you want). Extracting ligands from PDB files is actually very painful since the necessary bond information is absent! The best code I know of to do this is a pretty buried old Pat Walters gist.

Most of the PDBs I tested were protein-only, one had RNA, and I skipped one glycoprotein. I evaluated performance using USalign, using either the average "local" subunit-by-subunit alignment (USalign -mm 1) or one "global" all-subunit alignment (USalign -mm 2). Both models do extremely well when judged on local subunit accuracy, but much worse for global accuracy — sadly this is quite relevant for an antibody model! It appears that these models well understand how antibodies fold, but not how they bind.

Conclusions

On my antibody benchmark, Boltz and Chai-1 perform eerily similar, with a couple of cases where Boltz wins out. That, combined with all the data from the literature, makes the conclusion straightforward, at least for me. Boltz performs as well as or better than any of the models, has a clean, complete codebase with relatively little code, is hackable, and is by far the most open model. I am excited to see how Boltz progresses in 2025!

Technical details

I ran Boltz and Chai-1 on modal using my biomodals repo.

modal run modal_boltz.py --input-faa 8zre.fasta --run-name 8zre
modal run modal_chai1.py --input-faa 8zre.fasta --run-name 8zre

Here is a folder with all the pdb files and images shown below.

Addendum

On BlueSky, Diego del Alamo notes that Chai-1 outperformed Boltz in a head-to-head of antibody–antigen modeling.

On linkedin, Joshua Meier (co-founder Chai Discovery) recommended running Chai-1 with msa_server turned on, to make for a fairer comparison. I reran the benchmark with Chai-1 using MSAs, and it showed improvements in 8ZRE (matching Boltz) and 9E6K (exceeding Boltz.)

I think it is still fair to say that the results are very close.

Complex

Boltz

Chai-1

9CIA: T cell receptor complex