This article is a deeper look at Adaptyv's binder design competition, and some thoughts on what we learned. If you are unfamiliar with the competition, there is background information on the Adaptyv blog and my previous article.
The data
Adaptyv did a really nice job of packaging up the data from the competition (both round 1 and round 2). The also did a comprehensive analysis of which metrics predicted successful binding in this blogpost.
The data from round 2 is more comprehensive than round 1 — it even includes Alphafolded structures — so I downloaded the round 2 csv and did some analysis.
Regressions
Unlike the Adaptyv blogpost,
which does a deep dive on each metric in turn,
I just wanted to see how well I could predict binding affinity (Kd
)
using the following features provided in the csv:
pae_interaction
, esm_pll
, iptm
, plddt
,
design_models
(converted to one-hot),
seq_len
(inferred from sequence).
Three of these metrics (pae_interaction
, esm_pll
, iptm
)
were used to determine each entry's rank in the competition's
virtual leaderboard,
which was used to prioritize entries going into the binding assay.
I also added one more feature, prodigy_kd
,
which I generated from the PDB files provided
using prodigy.
Prodigy is an old-ish tool for predicting binding affinity
that identifies all the major contacts (polar–polar, charged–charged, etc.)
and reports a predicted Kd (prodigy_Kd
).
I used the typical regression tools:
Random Forest,
Kaggle favorite XGBoost,
SVR, linear regression,
as well as just using the mean Kd
as a baseline.
There is not a ton of data here for cross-validation,
especially if you split by submitter,
which I think is fairest.
If you do not split by submitter,
then you can end up with very similar proteins in different folds.
# get data and script
git clone https://github.com/adaptyvbio/egfr_competition_2
cd egfr_competition_2/results
wget https://gist.githubusercontent.com/hgbrian/1262066e680fc82dcb98e60449899ff9/raw/regress_adaptyv_round_2.py
# run prodigy on all pdbs, munge into a tsv
find structure_predictions -name "*.pdb" | xargs -I{} uv run --with prodigy-prot prodigy {} > prodigy_kds.txt
(echo -e "name\tprodigy_kd"; rg "Read.+\.pdb|25.0˚C" prodigy_kds.txt | sed 's/.*\///' | sed 's/.*25.0˚C: //' | paste - - | sed 's/\.pdb//') > prodigy_kds.tsv
# run regressions
uv run --with scikit-learn --with polars --with matplotlib --with seaborn --with pyarrow --with xgboost regress_adaptyv_round_2.py
The results are not great!
There are a few ways to slice the data
(including replicates or not; including similarity_check
or not; including non-binders or not).
There is a little signal, but I think it's fair to say nothing was strongly predictive.
Model | R² | RMSE (log units) | Median Fold Error |
---|---|---|---|
Linear Regression | 0.150 | 0.729 | 1.8x |
Random Forest Regression | 0.188 | 0.712 | 1.4x |
SVM Regression | 0.022 | 0.781 | 1.2x |
XGBoost | 0.061 | 0.766 | 1.2x |
Mean Kd only | -0.009 | 0.794 | 1.9x |
XGBoost performance looks ok here but is not much more predictive than just taking the mean Kd
Surprisingly, no one feature dominates in terms of predictive power
Virtual leaderboard rank vs competition rank
If there really is no predictive power in these computational metrics, there should be no correlation between rank in the virtual leaderboard and rank in the competition. In fact, there is a weak but significant correlation (Spearman correlation ~= 0.2). However, if you constrain to the top 200 (of 400 total), there is no correlation. My interpretation is that these metrics can discriminate no-hope-of-binding from some-hope-of-binding, but not more than that.
It may be too much to ask one set of metrics
to work for antibodies (poor PLL, poor PAE?), de novo binders (poor PLL),
and EGF/TNFa-derived binders (natural, so excellent PLL).
However, since I include design_models
as a covariate,
the regression models above can use different strategies for different design types,
so at the very least we know there is not a trivial separation that can be made.
BindCraft's scoring heuristics
So how can BindCraft work if it's mostly using these same metrics as heuristics? I asked this on twitter and got an interesting response.
It is possible that PyRosetta's InterfaceAnalyzer is adding a lot of information. However, if this were the case, you might expect Prodigy's Kd prediction to also help, which it does not. It is also possible that by using AlphaFold2, the structures produced by BindCraft are inherently biased towards natural binding modes. Then a part of the binding heuristics are implicit in the weights of the model?
What did we learn?
I learned a couple of things:
- Some tools, specifically BindCraft, can consistently generate decent binders, at least against targets and binding pockets present in its training set (PDB). (The BindCraft paper also shows success with at least one de novo protein not present in the PDB.)
- We do not have a way to predict if a given protein will bind a given target.
I think this is pretty interesting, and a bit counterintuitive.
More evidence that we cannot predict binding comes from the
Dickinson lab's Prediction Challenges,
where the goal is to match the binder to the target. Apparently no approach can (yet).
The Adaptyv blogpost ends by stating that binder design has not been solved yet. This is clearly true. So what comes next?
- We could find computational metrics that work, based on the current sequence and structure data. For example, BindCraft includes "number of unsatisfied hydrogen bonds at the interface" in its heuristics. I am skeptical that we can do a lot better with this approach. For one thing, Adaptyv has already iterated once on its ranking metrics, with negligible improvement in prediction.
- We could get better at Molecular Dynamics, which probably contains some useful information today (at exorbitant computational cost), and could soon be much better with deep learning approaches.
- We could develop an "AlphaFold for Kd prediction". There are certainly attempts at this, e.g., ProAffinity-GNN and the PPB-Affinity dataset to pick two recent examples, but I don't know if anything works that well. The big problem here, as with many biology problems, is a lack of data; PDBbind is not that big (currently ~2800 protein–protein affinities.)
Luckily, progress in this field is bewilderingly fast so I'm sure we'll see a ton of developments in 2025. Kudos to Adaptyv for helping push things forward.
Comment