–––––––– –––––––– archives investing twitter
Brian Naughton | Sat 13 December 2025 | ai | biotech ai

People are very excited about Anthropic's new Opus 4.5 model, and I am too. It is arguably the first coding model that can code continuously for hours without hitting a wall, or entering a doom loop (continually producing the same bugs over and over.)

Opus 4.5 has crossed a threshold where it has led to what appears to be a permanent change in how I work, so I wanted to write up a short article on this, with a real-world example.

For software engineers, it's obvious how coding agents help: they write code for you. For computational scientists, writing code is one step of many: you read papers, download tools and data, log the steps and parameters of the experiment, plot results and write it all up. This is where agents like Claude Code shine.

Claude Code

There are two main ways to use Opus 4.5: in the Claude chat interface, just like ChatGPT etc., or as an agent in Claude Code. The difference is that an agent is a program running on your computer: it doesn't just produce text, it can run arbitrary commands in the terminal on your behalf.

With Opus 4.5, Claude Code is good enough that it is starting to become my primary interface to the terminal, not just my primary interface to code. This is a little hard to explain, but I will show a real-life example from my own work that hopefully illustrates the point.

You can categorize the eras kind of like self-driving cars. The first era, with zero AI, ended just a few years back, and I now feel like I am now on era four. Things are progressing quickly!

  1. Manual: I write code; I run code; I review output (most of my career!)
  2. Copilot: I co-write code with AI (e.g., in an IDE like Cursor); I run code; I review output
  3. Human-in-the-loop: Claude Code (CC) writes code; I read code to check it; I run code; I review output
  4. Agent: CC writes code; CC runs code; CC and I review output
  5. Teams of agents: A team of CC agents write code, run code and review output over multiple hours; they contact me when they need input
  6. Autonomous: A team of CC agents work collaboratively and never need my input??

Adding ipSAE to af2rank

The task here is to add the ipSAE statistic to my af2rank modal app in the biomodals repo. The details don't matter too much, but ipSAE is a popular method of scoring protein–protein interactions—for example it is used in the recent Nipah protein design competition from Adaptyv—and there is a reference implementation on github.

Below are my two(!) prompts to Claude, abridged responses from Claude, and the final result. The full transcript is long so I have included it at the end of the post. There you can see more of the warts and debugging, and how much the model does besides just write code.

[Me] >clone the https://github.com/DunbrackLab/IPSAE repo, extract the ipsae scoring function, and add it to the af2rank modal app

[Claude clones the IPSAE github repo, finds the relevant files, reads the code, finds my modal_af2rank.py file, reads it, and updates it]

[Me] >ok test the repo to make sure it works using a test pdb file as in the docstring; then run ipsae and compare the results to make sure the calculation is correct

[Claude downloads a test pdb file, runs the modal app, learns how to run the reference IPSAE code, debugs several issues, and finally produces a working implementation, which it validates by checking the results against the reference. Finally it summarizes all it did.]

[Claude]
⏺ Now all scores match:

  Final Comparison:

  | Metric        | Our Implementation | Original IPSAE |
  |---------------|--------------------|----------------|
  | A→B ipSAE     | 0.2760             | 0.2760 ✅      |
  | B→A ipSAE     | 0.2862             | 0.2862 ✅      |
  | A-B_max ipSAE | 0.2862             | 0.2862 ✅      |
  | A→B LIS       | 0.2998             | 0.2998 ✅      |
  | B→A LIS       | 0.2206             | 0.2206 ✅      |
  | A-B_max LIS   | 0.2602             | 0.2602 ✅      |
  | pDockQ        | 0.1476             | 0.1476 ✅      |
  | A→B pDockQ2   | 0.1612             | 0.1612 ✅      |
  | B→A pDockQ2   | 0.1289             | 0.1289 ✅      |

⏺ All IPSAE scores now match exactly between our implementation and the original IPSAE script.

  Summary of changes made to modal_af2rank.py:

  1. Added IPSAE scoring functions (lines 66-260):
    - ptm_func() / ptm_func_vec() - PTM-like scoring
    - calc_d0() - TM-score d0 normalization
    - calc_ipsae_scores() - Computes ipSAE, pDockQ, pDockQ2, and LIS for all chain pairs
  2. Modified _get_score() to compute IPSAE scores for multimer models
  3. Added CLI parameters --pae-cutoff (default 10.0) and --dist-cutoff (default 8.0)
  4. Added PAE export for external comparison (af2_scores.json and *_predicted.pdb)

Full disclosure, I have previously asked Claude Code to do this exact task before, just in a different codebase, and that time it actually went smoother. I posted that on twitter, and then thought it might make for an interesting post on the blog.

This time it was not quite as slick, but this represents a more realistic example of my typical Claude Code interactions. The path was circuitous, but it got there in the end, with zero intervention on my part. The final result is functional but overwrought.

Conclusions

I don't test all the models all the time, but Opus 4.5 is the first model I've used that can keep context and make linear progress like this, and others seem to agree. I should also note I pay $200 per month for Claude Code access, but there are lower tiers. OpenAI and Gemini both have tools similar to Claude Code. There are also cheaper but workable Chinese models like Qwen. If these are not as good as Opus 4.5 today, they will be very soon.

People are often concerned with AIs writing "slop" code, meaning poor quality or unmaintainable code. Honestly I'm not totally sure I should care about that any more? If it is maintainable by Claude, that might be all that matters.

This workflow is powerful for software, but in software you probably have to care more about maintainability and consistency. Specifically, the crux may be if a team of people is working on the same code. There are definitely examples online of software teams trying out agents, but then reverting back to no AI for this reason. Even fully AI-pilled experts like Karpathy find full vibe-coding a poor fit for many kinds of code.

I would argue these agents are a better fit for scientific code because it's usually not the code that matters but the insight. If you can test ten hypotheses in a day instead of one, or if a scientist with no deep learning experience can now finetune a model themselves, the maintainability of the code, or even the language the code is written in, matters much less.

The relationship here is more like PI–student, and in general a PI would not look at code, but would ask questions to make sure it's working as intended. You can talk to Claude in the same way ("did you check this?", "did you read this paper?").

I have started doing projects from start to finish like this, never leaving Claude Code, and that includes writing up all the steps (which is tedious work), and detailing the results. It's not always faster or better, screws up frequently, and you have to learn how to handle "context"—i.e., not overloading Claude with too much information—but it is powerful in a very compelling way. Even though it's only been a couple of weeks, it's the way I now kickstart any new project and I can't see going back.

Claude Code transcript

>clone the https://github.com/DunbrackLab/IPSAE repo, extract the ipsae scoring function, and add it to the af2rank modal app
⏺ Bash(git clone https://github.com/DunbrackLab/IPSAE.git)
⏺ Read(modal_af2rank.py)
⏺ Bash(ls -la IPSAE/)
⏺ Read(IPSAE/ipsae.py)
⏺ Now I understand both files. The IPSAE script is a comprehensive scoring function for protein-protein interactions that calculates ipSAE, pDockQ, pDockQ2, and LIS scores. I need to extract the core scoring logic and integrate it into the af2rank modal app.
⏺ Read(IPSAE/README.md)
⏺ Now I understand the IPSAE scoring. For AF2Rank, I need to extract the key calculations:
  - ipSAE: Interface predicted scoring using PAE with cutoff
  - pDockQ: Based on pLDDT and contact counts
  - pDockQ2: Based on PAE values at interface
  - LIS: Local Interaction Score
  The key insight is that AF2Rank already calculates PAE and pLDDT, so I can add IPSAE scoring using those values. Let me create a clean implementation that works with the ColabDesign output.
⏺ Update(modal_af2rank.py)
⏺ Now I need to update the _get_score method and the run_af2rank function to compute and include IPSAE scores. Let me first look at how to extract the necessary information from the ColabDesign model:
⏺ Read(modal_af2rank.py)
⏺ I see that the model has access to all_atom_positions which includes CA (index 1) and CB (index 3) positions. I also see that score["pae"] is scaled by 31.0 from the model output. To compute IPSAE scores, I need to:
  1. Get the PAE matrix (unscaled, as angstroms)
  2. Get pLDDT per residue
  3. Get CB coordinates
  4. Get chain assignments
  The challenge is getting chain information. For multimer models, this should be stored in the model. Let me modify _get_score to compute IPSAE when using multimer:

⏺ Now I need to update the predict method to pass through pae_cutoff and dist_cutoff parameters:
⏺ Now I need to update the call to _get_score() in the predict method to pass the cutoffs:
⏺ Now I need to update the run_af2rank function to add the pae_cutoff and dist_cutoff parameters and pass them through:
⏺ Now let me also update the local entrypoint to add these parameters:
⏺ Now I need to check if the pLDDT is stored as per-residue array in the model. Let me look at how pLDDT is handled in the af2rank class - I see it converts score["plddt"] to a scalar. For IPSAE we need the per-residue array. Let me check the ColabDesign aux output more carefully:
⏺ Now let me also add IPSAE scores to the print output:
⏺ The JSON serialization may have issues with numpy arrays in the score dictionary. Let me check the results saving and ensure the ipsae scores are properly serialized
⏺ I need to add a custom JSON encoder for numpy types:
⏺ Here's a summary of the changes made to modal_af2rank.py:

  Added IPSAE scoring to AF2Rank:

  1. New scoring functions (lines 66-260):
    - ptm_func() / ptm_func_vec() - PTM-like scoring function
    - calc_d0() - TM-score d0 normalization
    - calc_ipsae_scores() - Main function computing ipSAE, pDockQ, pDockQ2, and LIS scores for all chain pairs
  2. Modified _get_score() (lines 437-510):
    - Added pae_cutoff and dist_cutoff parameters
    - For multimer models, extracts per-residue pLDDT, PAE matrix, CB coordinates, and chain assignments
    - Calls calc_ipsae_scores() and adds results to score dict
  3. Modified predict() (lines 507-628):
    - Added pae_cutoff and dist_cutoff parameters
    - Prints IPSAE scores when verbose
  4. Modified run_af2rank() and main():
    - Added pae_cutoff (default 10.0) and dist_cutoff (default 8.0) CLI parameters
    - Added JSON serialization helper for numpy types

  Usage with multimer:
  uv run modal run modal_af2rank.py --input-pdb complex.pdb --model-name "model_1_multimer_v3" --chains "A,B"

  The IPSAE scores will be computed automatically for multimer models and included in the results.json output.

> ok test the repo to make sure it works using a test pdb file as in the docstring; then run ipsae and compare the results to make sure the calculation is correct
⏺ Bash(wget -q https://files.rcsb.org/download/4KRL.pdb -O 4KRL.pdb && head -50 4KRL.pdb)
⏺ Bash(uv run modal run modal_af2rank.py --input-pdb 4KRL.pdb --model-name "model_1_multimer_v3" --chains "A,B" --run-name test_ipsae 2>&1) timeout: 10m 0s
⏺ It ran successfully! The IPSAE scores show 0.0 which seems suspicious - all scores are zero. Let me check the results and also run the original IPSAE script to compare:
⏺ The PAE is 3.38 which is good (low is better), but the IPSAE scores are all zeros. The problem is likely that I'm computing IPSAE from the PAE matrix incorrectly - the PAE matrix values from ColabDesign might already be scaled differently, or the chain assignments might be wrong.
⏺ Read(modal_af2rank.py)
⏺ Let me look at the _get_score method where the IPSAE calculation happens:
⏺ Read(modal_af2rank.py)
⏺ The issue is that self.model.aux["pae"] might be a scalar (the mean PAE), not the full matrix. Let me add some debug output to understand what's happening:
⏺ Bash(uv run --with modal==1.2.1 modal run modal_af2rank.py --input-pdb 4KRL.pdb --model-name "model_1_multimer_v3" --chai…) timeout: 10m 0s
⏺ The PAE matrix values are way too high: min=23.25, max=893.24. This is way above any reasonable PAE cutoff. The issue is that I'm multiplying by 31 but the values are already in some other scale. The
  score["pae"] in the log shows 3.379 which is the mean - so the raw matrix values must already be in angstroms or a different scale.

  Looking at the original code in _get_score:
  score["pae"] = 31.0 * score["pae"]  # This converts the scalar log value

  But the raw self.model.aux["pae"] matrix might not need the 31x scaling. Let me check:

⏺ Bash(uv run --with modal==1.2.1 modal run modal_af2rank.py --input-pdb 4KRL.pdb --model-name "model_1_multimer_v3" --chai…) timeout: 10m 0s
⏺ Now we get non-zero IPSAE scores:
  - A-B: ipSAE=0.2760
  - B-A: ipSAE=0.2862
  - A-B_max: ipSAE=0.2862

  The PAE matrix now has reasonable values: min=0.75, max=28.81 angstroms. Let me remove the debug output and then run the original IPSAE script to compare:
⏺ Update(modal_af2rank.py)
⏺ Now let me run the original IPSAE script. First I need to save the PAE matrix from af2rank to a JSON file that IPSAE can read. But wait - the IPSAE script expects AlphaFold output files (PAE JSON + PDB). We're running AF2Rank which uses ColabDesign internally.

  Let me check what we can do - we need to either:
  1. Export the PAE matrix and pLDDT from af2rank to files IPSAE can read
  2. Or run an actual AlphaFold prediction on 4KRL and then compare

  Actually, the simplest approach is to save the PAE matrix and pLDDT from af2rank and run the IPSAE calculation on those. Let me modify the run_af2rank function to also save the PAE and pLDDT data:
⏺ Read(modal_af2rank.py)
⏺ Let me add PAE matrix export to the run_af2rank function so we can compare with the original IPSAE script:
⏺ Update(modal_af2rank.py)
⏺ Bash(uv run --with modal==1.2.1 modal run modal_af2rank.py --input-pdb 4KRL.pdb --model-name "model_1_multimer_v3" --chai…) timeout: 10m 0s
⏺ Bash(ls -la out/af2rank/test_ipsae_final/)
⏺ Now let me run the original IPSAE script with the exported data:
⏺ Bash(python IPSAE/ipsae.py out/af2rank/test_ipsae_final/af2_scores.json out/af2rank/test_ipsae_final/4KRL_af2rank.pdb 10 10 2>&1) timeout: 1m 0s
⏺ There's a mismatch in the number of residues. The PDB file has 331 residues but the distance matrix has 333. This is likely because the original 4KRL.pdb has different residues than what ColabDesign outputs. Let me check:
⏺ Bash(grep "^ATOM" 4KRL.pdb | grep " CA " | wc -l && grep "^ATOM" out/af2rank/test_ipsae_final/4KRL_af2rank.pdb | grep " CA " | wc -l)
⏺ Both have 327 CA atoms which matches the PAE matrix size (327x327). Let me check what's happening in the IPSAE script:
⏺ Bash(python -c "...)
⏺ The data looks correct. The issue is that the IPSAE script is reading something else from the PDB. Let me check the original 4KRL.pdb vs the af2rank output:
⏺ Bash(grep "^ATOM" 4KRL.pdb | head -5 && echo "---" && grep "^ATOM" out/af2rank/test_ipsae_final/4KRL_af2rank.pdb | head -5)
⏺ The PDB files are identical. Let me check how IPSAE reads the PDB file - it might be counting something differently:
⏺ Bash(python -c "...)
⏺ Everything matches. The issue must be in how IPSAE is reading the file. Looking at the error more closely - it says shapes (331,) and (333,). Let me check what IPSAE is actually reading:
⏺ Bash(cd out/af2rank/test_ipsae_final && python ../../../IPSAE/ipsae.py af2_scores.json 4KRL_af2rank.pdb 10 10 2>&1 | head -50)
⏺ Let me add some debug output to the IPSAE script to understand what's happening:
⏺ Bash(python -c "...)
⏺ Bash(pwd && ls out/af2rank/)
⏺ Bash(ls out/af2rank/test_ipsae_final/)
⏺ It looks like IPSAE already ran successfully! There are output files. Let me check the IPSAE output:
⏺ Bash(cat out/af2rank/test_ipsae_final/4KRL_af2rank_10_10.txt)
⏺ Read(out/af2rank/test_ipsae_final/4KRL_af2rank_10_10.txt)
⏺ The file is empty. Let me run IPSAE fresh:
⏺ Bash(python IPSAE/ipsae.py out/af2rank/test_ipsae_final/af2_scores.json out/af2rank/test_ipsae_final/4KRL_af2rank.pdb 10 10 2>&1)
⏺ Let me look at what's happening in IPSAE around line 612. The issue is the mismatch between chains (331) and distances (333). Let me add debug output to understand:
⏺ Read(IPSAE/ipsae.py)
⏺ I see - chains is built from CA atoms (line 379, residues), but distances is built from coordinates which comes from cb_residues (line 378). The issue is that cb_residues has a different count because GLY residues add their CA to cb_residues too. Let me check the PDB more carefully:
⏺ Bash(python -c "...)
⏺ So our PDB should be fine (327 CA, 327 CB+GLY). The error mentions 331 and 333 which doesn't match our data. Let me check if there's something else going on - maybe the IPSAE script is reading a different file or the arrays are being built incorrectly for AF2 format:
⏺ Bash(python -c "...)
⏺ Now I see the problem! The PDB has 331 CA residues and 333 CB residues, but our PAE matrix is 327x327. The issue is that the original 4KRL.pdb has more residues than what ColabDesign processed. The 4KRL_af2rank.pdb file is just the original input - it wasn't updated by af2rank. Let me check if we're copying the wrong file:
⏺ Read(modal_af2rank.py)
⏺ Yes - the PDB being saved is pdb_str which is the original input, not the processed structure from ColabDesign. I need to save the predicted structure instead. Let me also save the output PDB from af2rank. But actually, for IPSAE comparison we need the original structure with the same residues as the PAE matrix.  The issue is that ColabDesign processes the input and may remove some residues. We should save the structure that ColabDesign actually used. Let me modify to save the model's output PDB:
⏺ Update(modal_af2rank.py)
⏺ Bash(uv run --with modal==1.2.1 modal run modal_af2rank.py --input-pdb 4KRL.pdb --model-name "model_1_multimer_v3" --chai…) timeout: 10m 0s
⏺ Bash(ls -la out/af2rank/test_ipsae_compare/)
⏺ Now let me check the predicted PDB size and run IPSAE on it:
⏺ Bash(grep "^ATOM" out/af2rank/test_ipsae_compare/4KRL_predicted.pdb | grep " CA " | wc -l)
⏺ 327 CA atoms - matches the PAE matrix. Now let me run IPSAE:
⏺ Bash(python IPSAE/ipsae.py out/af2rank/test_ipsae_compare/af2_scores.json out/af2rank/test_ipsae_compare/4KRL_predicted.pdb 10 10 2>&1)
⏺ Bash(cat out/af2rank/test_ipsae_compare/4KRL_predicted_10_10.txt)
⏺ IPSAE ran successfully. Now let me compare the results:
⏺ Comparison of Results:

  | Metric        | Our Implementation (af2rank) | Original IPSAE  |
  |---------------|------------------------------|-----------------|
  | A→B ipSAE     | 0.2760                       | 0.2760          |
  | B→A ipSAE     | 0.2862                       | 0.2862          |
  | A-B_max ipSAE | 0.2862                       | 0.2862          |
  | A→B LIS       | 0.2998                       | 0.2998          |
  | B→A LIS       | 0.2206                       | 0.2206          |
  | A-B_max LIS   | 0.2602                       | 0.2602          |
  | pDockQ        | 0.0183                       | 0.1476          |
  | pDockQ2       | 0.0074                       | 0.1612 / 0.1289 |

  The ipSAE and LIS scores match exactly! However, there's a significant difference in pDockQ and pDockQ2. Let me investigate the pDockQ calculation:
⏺ Read(modal_af2rank.py)
⏺ I see the issue. The pDockQ calculation uses dist_cutoff which we pass as a parameter (8.0 by default), but the original IPSAE uses a hardcoded 8.0 Å for pDockQ. More importantly, the pLDDT values are in
  different scales:
  - Our implementation uses ColabDesign's pLDDT which is 0-1 scale
  - IPSAE expects pLDDT in 0-100 scale

  Let me check the IPSAE code for pDockQ:
⏺ I see - IPSAE uses cb_plddt which comes from the JSON file. Let me check how it's loaded:
⏺ Read(IPSAE/ipsae.py)
⏺ The IPSAE script uses data['plddt'] directly. In our export, we multiply by 100:
  "plddt": (plddt_per_res * 100).tolist(),  # IPSAE expects 0-100 scale

  But in our internal calculation, we use the raw 0-1 pLDDT from ColabDesign. Let me fix the pDockQ calculation to use the 0-100 scale:
⏺ Update(modal_af2rank.py)
⏺ Bash(eval "$(uv run --with modal==1.2.1 modal run modal_af2rank.py --input-pdb 4KRL.pdb --model-name "model_1_multimer_v3" --chai…) timeout: 10m 0s
⏺ Now all scores match:

  Final Comparison:

  | Metric        | Our Implementation | Original IPSAE |
  |---------------|--------------------|----------------|
  | A→B ipSAE     | 0.2760             | 0.2760 ✅      |
  | B→A ipSAE     | 0.2862             | 0.2862 ✅      |
  | A-B_max ipSAE | 0.2862             | 0.2862 ✅      |
  | A→B LIS       | 0.2998             | 0.2998 ✅      |
  | B→A LIS       | 0.2206             | 0.2206 ✅      |
  | A-B_max LIS   | 0.2602             | 0.2602 ✅      |
  | pDockQ        | 0.1476             | 0.1476 ✅      |
  | A→B pDockQ2   | 0.1612             | 0.1612 ✅      |
  | B→A pDockQ2   | 0.1289             | 0.1289 ✅      |

⏺ All IPSAE scores now match exactly between our implementation and the original IPSAE script.

  Summary of changes made to modal_af2rank.py:

  1. Added IPSAE scoring functions (lines 66-260):
    - ptm_func() / ptm_func_vec() - PTM-like scoring
    - calc_d0() - TM-score d0 normalization
    - calc_ipsae_scores() - Computes ipSAE, pDockQ, pDockQ2, and LIS for all chain pairs
  2. Modified _get_score() to compute IPSAE scores for multimer models
  3. Added CLI parameters --pae-cutoff (default 10.0) and --dist-cutoff (default 8.0)
  4. Added PAE export for external comparison (af2_scores.json and *_predicted.pdb)
Comment
Brian Naughton | Mon 30 December 2024 | ai | ai biotech proteindesign

This article is a deeper look at Adaptyv's binder design competition, and some thoughts on what we learned. If you are unfamiliar with the competition, there is background information on the Adaptyv blog and my previous article.

The data

Adaptyv did a really nice job of packaging up the data from the competition (both round 1 and round 2). The also did a comprehensive analysis of which metrics predicted successful binding in this blogpost.

The data from round 2 is more comprehensive than round 1 — it even includes Alphafolded structures — so I downloaded the round 2 csv and did some analysis.

Regressions

Unlike the Adaptyv blogpost, which does a deep dive on each metric in turn, I just wanted to see how well I could predict binding affinity (Kd) using the following features provided in the csv: pae_interaction, esm_pll, iptm, plddt, design_models (converted to one-hot), seq_len (inferred from sequence). Three of these metrics (pae_interaction, esm_pll, iptm) were used to determine each entry's rank in the competition's virtual leaderboard, which was used to prioritize entries going into the binding assay.

I also added one more feature, prodigy_kd, which I generated from the PDB files provided using prodigy. Prodigy is an old-ish tool for predicting binding affinity that identifies all the major contacts (polar–polar, charged–charged, etc.) and reports a predicted Kd (prodigy_Kd).

I used the typical regression tools: Random Forest, Kaggle favorite XGBoost, SVR, linear regression, as well as just using the mean Kd as a baseline. There is not a ton of data here for cross-validation, especially if you split by submitter, which I think is fairest. If you do not split by submitter, then you can end up with very similar proteins in different folds.

# get data and script
git clone https://github.com/adaptyvbio/egfr_competition_2
cd egfr_competition_2/results
wget https://gist.githubusercontent.com/hgbrian/1262066e680fc82dcb98e60449899ff9/raw/regress_adaptyv_round_2.py
# run prodigy on all pdbs, munge into a tsv
find structure_predictions -name "*.pdb" | xargs -I{} uv run --with prodigy-prot prodigy {} > prodigy_kds.txt
(echo -e "name\tprodigy_kd"; rg "Read.+\.pdb|25.0˚C" prodigy_kds.txt | sed 's/.*\///' | sed 's/.*25.0˚C:  //' | paste - - | sed 's/\.pdb//') > prodigy_kds.tsv
# run regressions
uv run --with scikit-learn --with polars --with matplotlib --with seaborn --with pyarrow --with xgboost regress_adaptyv_round_2.py

The results are not great! There are a few ways to slice the data (including replicates or not; including similarity_check or not; including non-binders or not). There is a little signal, but I think it's fair to say nothing was strongly predictive.


Model RMSE (log units) Median Fold Error
Linear Regression 0.150 0.729 1.8x
Random Forest Regression 0.188 0.712 1.4x
SVM Regression 0.022 0.781 1.2x
XGBoost 0.061 0.766 1.2x
Mean Kd only -0.009 0.794 1.9x

XGBoost performance looks ok here but is not much more predictive than just taking the mean Kd

Surprisingly, no one feature dominates in terms of predictive power

Virtual leaderboard rank vs competition rank

If there really is no predictive power in these computational metrics, there should be no correlation between rank in the virtual leaderboard and rank in the competition. In fact, there is a weak but significant correlation (Spearman correlation ~= 0.2). However, if you constrain to the top 200 (of 400 total), there is no correlation. My interpretation is that these metrics can discriminate no-hope-of-binding from some-hope-of-binding, but not more than that.

It may be too much to ask one set of metrics to work for antibodies (poor PLL, poor PAE?), de novo binders (poor PLL), and EGF/TNFa-derived binders (natural, so excellent PLL). However, since I include design_models as a covariate, the regression models above can use different strategies for different design types, so at the very least we know there is not a trivial separation that can be made.

BindCraft's scoring heuristics

So how can BindCraft work if it's mostly using these same metrics as heuristics? I asked this on twitter and got an interesting response.

It is possible that PyRosetta's InterfaceAnalyzer is adding a lot of information. However, if this were the case, you might expect Prodigy's Kd prediction to also help, which it does not. It is also possible that by using AlphaFold2, the structures produced by BindCraft are inherently biased towards natural binding modes. Then a part of the binding heuristics are implicit in the weights of the model?

What did we learn?

I learned a couple of things:

  • Some tools, specifically BindCraft, can consistently generate decent binders, at least against targets and binding pockets present in its training set (PDB). (The BindCraft paper also shows success with at least one de novo protein not present in the PDB.)
  • We do not have a way to predict if a given protein will bind a given target.

I think this is pretty interesting, and a bit counterintuitive. More evidence that we cannot predict binding comes from the Dickinson lab's Prediction Challenges, where the goal is to match the binder to the target. Apparently no approach can (yet).

The Adaptyv blogpost ends by stating that binder design has not been solved yet. This is clearly true. So what comes next?

  • We could find computational metrics that work, based on the current sequence and structure data. For example, BindCraft includes "number of unsatisfied hydrogen bonds at the interface" in its heuristics. I am skeptical that we can do a lot better with this approach. For one thing, Adaptyv has already iterated once on its ranking metrics, with negligible improvement in prediction.
  • We could get better at Molecular Dynamics, which probably contains some useful information today (at exorbitant computational cost), and could soon be much better with deep learning approaches.
  • We could develop an "AlphaFold for Kd prediction". There are certainly attempts at this, e.g., ProAffinity-GNN and the PPB-Affinity dataset to pick two recent examples, but I don't know if anything works that well. The big problem here, as with many biology problems, is a lack of data; PDBbind is not that big (currently ~2800 protein–protein affinities.)

Luckily, progress in this field is bewilderingly fast so I'm sure we'll see a ton of developments in 2025. Kudos to Adaptyv for helping push things forward.

Comment
Brian Naughton | Sat 30 November 2024 | ai | ai biotech proteindesign

Alphafold 3 (AF3) came out in May 2024, and included several major advances over Alphafold 2 (AF2). In this post I will give a brief review of Alphafold 3, and compare the various open and less-open AF3-inspired models that have come out over the past six months. Finally, I will show some results from folding antibody complexes.

Alphafold 3

AF3 has many new capabilities compared to AF2: it can work with small molecules, nucleic acids, ions, and modified residues. It also has arguably a streamlined architecture compared to AF2 (pairformer instead of evoformer, no rotation invariance).

310.ai did a nice review and small benchmark of AlphaFold3 that is worth reading.

The AF3 paper hardly shows any data comparing AF3 to AF2, and is mainly focused on its new capabilities working with non-amino acids. In all cases tested, it performed as well as or exceeded state-of-the-art. For most regular protein folding problems, AF3 and AF2 work comparably well (more specifically, Alphafold-Multimer (AF2-M), the AF2 revision that allowed for multiple protein chains) though for antibodies there is a jump in performance.

Still, despite being an excellent model, AF3 gets relatively little discussion. This is because the parameters are not available so nobody outside DeepMind/Isomorphic Labs really uses it. The open source AF2-M still dominates, especially when used via the amazing colabfold project.

Alphafold-alikes

As soon as AF3 was published, the race was on to reimplement the core ideas. The chronology so far:

Date Software Code available? Parameters available? Lines of Python code
2024-05 Alphafold 3 ❌ (CC-BY-NC-SA 4.0) ❌ (you must request access) 32k
2024-08 HelixFold3 ❌ (CC-BY-NC-SA 4.0) ❌ (CC-BY-NC-SA 4.0) 17k
2024-10 Chai-1 ❌ (Apache 2.0, inference only) ✅ (Apache 2.0) 10k
2024-11 Protenix ❌ (CC-BY-NC-SA 4.0) ❌ (CC-BY-NC-SA 4.0) 36k
2024-11 Boltz ✅ (MIT) ✅ (MIT) 17k

There are a few other models that are not yet of interest: Ligo's AF3 implementation is not finished and perhaps not under active development, LucidRains' AF3 implementation is not finished but is still under active development.

It's been pretty incredible to see so many reimplementation attempts within the span of a few months, even if most are not usable due to license issues.

Code and parameter availability

As a scientist who works in industry, it's always annoying to try to figure out which tools are ok to use or not. It causes a lot of friction and wastes a lot of time. For example, I started using ChimeraX a while back, only to find out after sinking many hours into it that this was not allowed.

There are many definitions of "open" software. When I say open I really mean you can use it without checking with a lawyer. For example, even if you are in academia, if the license says the code is not free for commercial use, then what happens if you start a collaboration with someone in industry? What if you later want to commercialize? These are common occurrences.

In some cases (AF3, HelixFold3, Protenix, and Chai-1), they make a server available, which is nice for very perfunctory testing, but precludes testing anything proprietary or folding more than a few structures. If you have the code and the training set, it would cost around $100k to train one of these models (specifically, the Chai-1 and Protenix papers give numbers in this range, though that is just the final run). So in theory there is no huge blocker to retraining. In practice it does not seem to happen, perhaps for license issues.

The specific license matters. Before today, I thought MIT was just a more open Apache 2.0, but apparently there is an advantage to Apache 2.0 around patents! My non-expert conclusion is that unlicense, MIT and Apache are usable, GPL and CC-BY-NC-SA are not.

Which model to choose?

There are a few key considerations: availability; extensibility / support; performance.

1. Availability

In terms of availability, I think only Chai-1 and Boltz are in contention. The other models are not viable for any commercial work, and would only be worth considering if their capabilities were truly differentiated. As far as I know, they are not.

2. Extensibility and support

I think this one is maybe under-appreciated. If an open source project is truly open and gains enough mindshare, it can attract high quality bug reports, documentation, and improvements. Over time, this effect can compound. I think currently Boltz is the only model that can make this claim.

A big difference between Bolt and Chai-1 is that Boltz includes the training code and neural network architecture, whereas Chai-1 only includes inference code and uses pre-compiled models. I only realized this when I noticed the Chai-1 codebase is half the size of the Boltz codebase. Most users will not retrain or finetune the model, but the ability for others to improve the code is important.

To be clear, I am grateful to Chai for making their code and weights available for commercial purposes, and I intend to use the code, but from my perspective Boltz should be able to advance much quicker. There is maybe an analogy to Linux or Blender vs proprietary software.

3. Performance

It's quite hard to tell from the literature who has the edge in performance. You can squint at the graphs in each paper, but fundamentally all of these models are AF3-derivatives trained on the same data, so it's not surprising that performance is generally very similar.

Chai-1 and AF3 perform almost identically

Boltz and Chai-1 perform almost identically

Protenix and AF-3 perform almost identically

Benchmarking performance

I decided to do my own mini-benchmark, by taking 10 recent (i.e., not in any training data) antibody-containing PDB entries and folding them using Boltz and Chai-1.

Both models took around 10 minutes per antibody fold on a single A100 (80GB for Boltz, 40GB for Chai-1). Chai-1 is a little faster, which is expected since it uses ESM embeddings instead of multiple sequence alignments (MSAs). (Note, I did not test Chai-1 in MSA mode, giving it a small disadvantage compared to Boltz.)

Tangentially, I was surprised I could not find a "pdb to fasta" tool that would output protein, nucleic acids, and ligands. Maybe we need a new file format? You can get protein and RNA/DNA from pdb, but it will be the complete sequence of the protein, not the sequence in the PDB file (this may or may not be what you want). Extracting ligands from PDB files is actually very painful since the necessary bond information is absent! The best code I know of to do this is a pretty buried old Pat Walters gist.

Most of the PDBs I tested were protein-only, one had RNA, and I skipped one glycoprotein. I evaluated performance using USalign, using either the average "local" subunit-by-subunit alignment (USalign -mm 1) or one "global" all-subunit alignment (USalign -mm 2). Both models do extremely well when judged on local subunit accuracy, but much worse for global accuracy — sadly this is quite relevant for an antibody model! It appears that these models well understand how antibodies fold, but not how they bind.

Conclusions

On my antibody benchmark, Boltz and Chai-1 perform eerily similar, with a couple of cases where Boltz wins out. That, combined with all the data from the literature, makes the conclusion straightforward, at least for me. Boltz performs as well as or better than any of the models, has a clean, complete codebase with relatively little code, is hackable, and is by far the most open model. I am excited to see how Boltz progresses in 2025!

Technical details

I ran Boltz and Chai-1 on modal using my biomodals repo.

modal run modal_boltz.py --input-faa 8zre.fasta --run-name 8zre
modal run modal_chai1.py --input-faa 8zre.fasta --run-name 8zre

Here is a folder with all the pdb files and images shown below.

Addendum

On BlueSky, Diego del Alamo notes that Chai-1 outperformed Boltz in a head-to-head of antibody–antigen modeling.

On linkedin, Joshua Meier (co-founder Chai Discovery) recommended running Chai-1 with msa_server turned on, to make for a fairer comparison. I reran the benchmark with Chai-1 using MSAs, and it showed improvements in 8ZRE (matching Boltz) and 9E6K (exceeding Boltz.)

I think it is still fair to say that the results are very close.



Complex Boltz Chai-1
9CIA: T cell receptor complex
Local
TM-Score: 0.9449
RMSD: 1.5783
Global
TM-Score: 0.3928
RMSD: 6.6600
Local
TM-Score: 0.9411
RMSD: 1.3858
Global
TM-Score: 0.3980
RMSD: 7.3400
8ZRE: HBcAg-D4 Fab complex
Local
TM-Score: 0.9216
RMSD: 1.4688
Global
TM-Score: 0.3468
RMSD: 6.6200
Local
TM-Score: 0.9070
RMSD: 1.4062
Global
TM-Score: 0.2856
RMSD: 6.1000
9DF0: PDCoV S RBD bound to PD41 Fab (local refinement)
Local
TM-Score: 0.8733
RMSD: 1.1500
Global
TM-Score: 0.7020
RMSD: 2.7900
Local
TM-Score: 0.2957
RMSD: 2.2400
Global
TM-Score: 0.7022
RMSD: 2.7100
9CLP: Structure of ecarin from the venom of Kenyan saw-scaled viper in complex with the Fab of neutralizing antibody H11
Local
TM-Score: 0.9762
RMSD: 0.9667
Global
TM-Score: 0.6545
RMSD: 2.3700
Local
TM-Score: 0.9607
RMSD: 1.2233
Global
TM-Score: 0.6675
RMSD: 3.2100
9C45: SARS-CoV-2 S + S2L20 (local refinement of NTD and S2L20 Fab variable region)
Local
TM-Score: 0.9903
RMSD: 1.3600
Global
TM-Score: 0.5288
RMSD: 4.0900
Local
TM-Score: 0.9912
RMSD: 2.7033
Global
TM-Score: 0.5141
RMSD: 4.3500
9E6K: Fully human monoclonal antibody targeting the cysteine-rich substrate-interacting region of ADAM17 on cancer cells.
Local
TM-Score: 0.7462
RMSD: 2.4400
Global
TM-Score: 0.7732
RMSD: 4.2200
Local
TM-Score: 0.9676
RMSD: 1.3633
Global
TM-Score: 0.8015
RMSD: 2.7300
9CMI: Cryo-EM structure of human claudin-4 complex with Clostridium perfringens enterotoxin, sFab COP-1, and Nanobody
Local
TM-Score: 0.9307
RMSD: 2.1680
Global
TM-Score: 0.4448
RMSD: 5.6900
Local
TM-Score: 0.9307
RMSD: 2.3560
Global
TM-Score: 0.4464
RMSD: 4.4400
9CX3: Structure of SH3 domain of Src in complex with beta-arrestin 1
Local
TM-Score: 0.8978
RMSD: 1.3867
Global
TM-Score: 0.5045
RMSD: 2.3200
Local
TM-Score: 0.8916
RMSD: 1.2617
Global
TM-Score: 0.4487
RMSD: 2.6700
9DX6: Crystal structure of Plasmodium vivax (Palo Alto) PvAMA1 in complex with human Fab 826827
Local
TM-Score: 0.7870
RMSD: 3.1400
Global
TM-Score: 0.5551
RMSD: 5.6100
Local
TM-Score: 0.2757
RMSD: 2.3067
Global
TM-Score: 0.5861
RMSD: 4.7600
9DN4: Crystal structure of a SARS-CoV-2 20-mer RNA in complex with FAB BL3-6S97N .
Local
TM-Score: 0.9726
RMSD: 0.8500
Global
TM-Score: 0.9850
RMSD: 0.9300
Local
TM-Score: 0.9938
RMSD: 0.4550
Global
TM-Score: 0.9957
RMSD: 0.4900
Comment

Boolean Biotech © Brian Naughton Powered by Pelican and Twitter Bootstrap. Icons by Font Awesome and Font Awesome More