–––––––– –––––––– archives investing twitter
Brian Naughton | Mon 05 May 2025 | biotech | biotech ai ip

The new class of protein AI design tools are amazing, and could revolutionize many areas of science, including therapeutics, diagnostics, and biosensors. Surprisingly, one important area that I haven't seen discussed too much is how these tools could impact patents. I am not a lawyer, so obviously this post is just my basic understanding, and I'd be happy to hear corrections. If there is a more expert critique I did not find it.

Patents are wordy and convoluted by design. For proteins, because a string of amino acids defines them, there are some common elements: they often include the sequence(s) being patented, and a threshold for how similar another sequence can be before infringing. That means there is a target to hit, and AI is really good at hitting targets.

There are two major categories of protein patents: biologics (usually meaning antibodies) and enzymes.

Antibodies

According to the European Patent Office, there are two main ways to patent an antibody:

  • "functional" claims, usually meaning the antibody's associated antigen or epitope;
  • "structural" claims, usually meaning a sequence and sequence identity threshold, along with the epitope or some other support.

Over the past few years, the "functional" claim has been going away. In the US it was killed off by the 2023 Amgen vs Sanofi ruling, which essentially said you can't patent the concept of an antibody against PCSK9. That means antibodies are now almost exclusively patented based on their structure (more specifically, a sequence plus some supporting functional information like epitope affinity.)

For antibody sequences, it used to be common for claims to cover any sequence 80%+ identical in the heavy or light chains. These days it seems like you have to be more specific, with claims only covering 100% identity to all 6 CDRs.

To take some real examples:

  • Zanidatamab, a HER2 bispecific approved in 2024, claims sequences with 100% sequence identity to its CDRs;
  • Epcoritamab, a CD3/CD20 bispecific approved in 2024, also has claims sequences with 100% sequence identity to its CDRs;
  • Trastuzumab, the famous HER2 antibody approved in 1998 (filed in 2013), claims sequences with 85%+ sequence identity to the heavy and light chains, and does not mention CDRs at all.

The EPO says: "the slightest modification of the CDRs can affect the recognition of the target." There is a nice breakdown of the differences between the USPTO vs EPO approach to antibody patents here.

Enzymes

For enzymes, the patent landscape is more complicated, or at least more varied. Unlike antibodies, where the patents are pretty uniformly focused on the sequence that binds an epitope, enzymes can perform any number of functions. Enzyme types include enzyme replacement therapies, industrial enzymes like detergents, and molecular biology tools like CRISPR-Cas9. It is still typical for these patents to include a sequence and supporting information.

Some examples:

  • this detergent patent, granted in 2018, claims sequences with 60%+ sequence identity to the reference;
  • this proteinase patent, granted in 2022, claims sequences with 90%+ sequence identity to the reference;
  • this novel Taq polymerase patent, granted in 2025, claims sequences with 95%+ sequence identity to the reference.

Cas9

The Cas9 patents are unusually diverse: there are hundreds of them and they mostly cover the many applications of the invention rather than the sequences. Since the 2013 ruling against Myriad Genetics, sequences from naturally occurring enzymes like Cas9 cannot be patented. Engineered sequences can be patented with other supporting functional information. You cannot take one of the thousands of unique Cas9 sequences in GenBank and use that to circumvent the CRISPR-Cas9 patents.

There are hundreds of Cas9 patents covering everything anyone could think of

AI

Given that the amino acid sequence is so important in protein patents, I am surprised that it is not bigger news that AI has effectively broken the direct connection between sequence and function.

For patents where protein sequence identity is protected, it is now relatively straightforward to generate new sequences that fold to the same structure but have 50% or lower sequence identity.

For antibody patents where the CDR sequence is protected, I believe it is also relatively straightforward to introduce a mutation that does not disrupt binding. To be honest, I am not even sure AI is required here, since a mutation scan could perform the same function. Perhaps for this reason, a recent paper called for "comprehensive CDR scanning" to protect a panel of CDR sequences instead of just one.

ProteinMPNN, published in 2022 by Baker lab, is the most prominent tool for producing a new sequence that folds to a known structure. ProteinMPNN is widely used as a step in many protein design workflows. For example, methods like RFdiffusion generate backbone coordinates only, and ProteinMPNN turns that into an amino acid sequence.

In a follow-up ProteinMPNN paper, the authors demonstrated that they could make a myoglobin and TEV protease with comparable or better function and greater stability than the natural versions, with sequence identities as low as 40%. This is below the sequence identity threshold in any patent I have seen.

ProteinMPNN can be used to produce a new sequence for a protein while maintaining its function

Sequence vs Structure

If this ability for AI to circumvent sequence-based patents is an issue, maybe the obvious change here would be to base patent protection on structure. This is a bit more complex than sequence identity, but one way to do this would be with TM-align or a similar tool. TM-align has >3k citations so it is arguably the standard in the field. A TM-score of above 0.8 indicates "the same topology"—in other words a very close structure. I think this would work well for many proteins, though it might need to be constrained to subdomains (akin to CDRs) in some cases.

Interestingly, the only literature I found on patenting 3D structure is from 20 years ago. Maybe this has been debated already and rejected for some reason. I suspect it was just easier to use sequence though.

OpenCRISPR-1

OpenCRISPR-1 was published in 2024 by the protein AI company Profluent. This is a de novo Cas9 enzyme that is substantially different in sequence to any known Cas9 (according to the abstract, "400 mutations away in sequence [from SpCas9]"—specifically 403/1380, or 71% identity).

Cas9 is a bilobed enzyme, with a REC lobe (nucleotide recognition) and a NUC lobe (DNA cleavage and PAM recognition.) Broadly speaking, the REC lobe is the first half of the enzyme (amino acids 50–700), and the NUC lobe is the second (1–50 and 700–1350.) These two lobes are connected by a "bridge helix".

Cartoon representation of Cas9 from addgene.

The OpenCRISPR-1 enzyme is not as novel as it might seem. In fact, I found it is actually 98% identical to a sequence constructed from three Cas9's spliced together from Streptococcus cristatus, Streptococcus pyogenes and Streptococcus sanguinis (24 amino acids are unique to OpenCRISPR-1).

This raises an interesting question, which is whether you could create a "novel" Cas9 by simply stitching together the REC lobe from one species' Cas9 and the NUC lobe from another. I believe this enzyme would work, and this sequence would meet any sequence identity threshold requirements.

The Profluent paper says the OpenCRISPR-1 enzyme was released for "research and commercial applications", but there is a big caveat here. Since CRISPR-Cas9 patents post-date the Myriad decision, almost all are functional / method of use, and naturally the most protected part is the use of Cas9 in "commercial applications" like therapeutics and diagnostics.

It is commendable that Profluent tried to broaden the availability of Cas9, so I appreciate the work behind this, but as I understand it, OpenCRISPR-1 is not really more available for commercial use than any Cas9.

There is actually another "royalty-free" Cas, a "Class 2 Type V" Cas nuclease called MAD7, released by Inscripta for commercial use in 2023. I do not know how this enzyme intersects with the many Cas9 patents.

Conclusion

One upshot of all this AI work is that me-too and biosimilar antibodies will be easier to make. That saves some time and money, but does not necessarily save on the major clinical trial costs, although the probability of success could go up a lot if the antibody is functionally identical.

While many enzyme patents will be affected, patents like CRISPR-Cas9 that rely on functional or method of use claims do not seem to be impacted as much. I don't know how many enzyme patents rely on sequence identity claims vs other claims these days. It would be interesting to (get an AI to) do a proper survey.

For internal research use, it's unclear to me that using AI to reproduce a patented protein does a whole lot, since at least in drug development, the research exemption seems to allow for the use of patented material quite broadly.

Comment
Brian Naughton | Sat 08 March 2025 | biotech | biotech ai

I have written about protein binder design a few times now (the Adaptyv competition; a follow up). Corin Wagen recently wrote a great piece about protein–ligand binding. This purpose of this post is to review how well protein binder design is working today, and point out some interesting differences in model performance that I do not understand.

Protein design

There are two major types of protein design:

  1. Design a sequence to perform some task: e.g., produce a sequence that improves upon some property of the protein
  2. Design a structure to perform some task: e.g., produce a protein structure that binds another protein

There is spillover between these two classes but I think it's useful to split this way.

Sequence models

Sequence models include open-source models like the original ESM2, ProSST, SaProt, and semi-open or fully proprietary models from EvolutionaryScale (ESM3), OpenProtein (PoET-2), and Cradle Bio. The ProteinGym benchmark puts ProSST, PoET-2 and SaProt up near the top.

Many of the recent sequence-based models now also include structure information, represented as a parallel sequence, with one "structure token" per amino acid. This addition seems to improve performance quite a lot, allows sequence models to make use of the PDB, and — analogously to Vision Transformers — blurs the line between sequence and structure models.

SaProt uses a FoldSeek-derived alphabet to encode structural information

The most basic use-case for sequence models is probably improving the stability of a protein. You can take a protein sequence, make whatever edits your model deems high likelihood, and this should produce a sequence that retains the same fold, but is more "canonical", and so may have improved stability too.

An elaboration of this experiment is to find some data, e.g., thermostability for a few thousand proteins, and fine-tune the original language model to be able to predict that property. SaProtHub makes this essentially push-button.

A further elaboration is doing active learning, where you propose edits using your model, generate empirical data for these edits (e.g., binding affinity), and go back and forth, hopefully improving performance each iteration. For example, EVOLVEpro, Nabla Bio's JAM (which also uses structure), and Prescient's Lab-in-the-loop. These systems can be complex, but can also be as simple as running regressions on the output of the sequence models.

EvolvePro's learning loop

Sequence-based models are a natural fit to these kinds of problems, since you can easily edit the sequence but maintain the same fold and function. Profluent and other companies make use of this ability by producing patent-unencumbered sequences like OpenCRISPR.

This is especially enabling for the biosimilars industry. Many biologics patents protect the sequence by setting amino acid identity thresholds. For example, in the Herceptin/trastuzumab patent they protect any sequence >=85% identical to the heavy (SEQ ID NO: I) or light chain (SEQ ID NO: II).

Excerpt from the main trastuzumab patent

Patent attorneys will layer as many other protections on top of this as they can think of, but the sequence of the antibody is the primary IP. (Tangentially, it is insane how patents always give examples of numbers greater than X. Hopefully, the AIs that will soon be writing patents won't do this.)

For binder design, sequence models appear to have limits. Naively, since you do not know the positions of the atoms, then unless you are apeing known interaction motifs, you would assume binder design should be difficult?

Diego del Alamo points out apparent limits in the performance of sequence models for antibody design

Structural models

Structural models include the original RFdiffusion and the recently released antibody variant RFantibody from the Baker lab, RSO from the ColabDesign team, BindCraft, EvoBind2, foldingdiff from Microsoft, and models from startups like Generate Biomedicines (Chroma), Chai Discovery, and Diffuse Bio. (Some of these tools are available on my biomodals repo).

Structural models are trained on both sequence data (e.g., UniRef) and structure data (PDB), but they deal in atom co-ordinates instead of amino acid strings. That difference means diffusion-style models dominate here over the discrete-token–focused transformers.

There are two major classes of structural models:

  • Diffusion models like RFdiffusion and RFantibody
  • AlphaFold2-based models like BindCraft, RSO, and EvoBind2

The success rates of RFdiffusion and RFantibody are not great. For some targets they achieve a >1% success rate (if we define success as finding a <1µM binder), but in other cases they nominate thousands of designs and find no strong binder.

An example from the RFantibody paper showing a low success rate

BindCraft and RSO are two similar methods that produce minibinders (small-ish non-antibody–based proteins) and rely on inverting AlphaFold2 to turn structure into sequence. EvoBind2 produces cyclic or linear peptides, and also relies heavily on an AlphaFold confidence metric (pLDDT) as part of its loss.

BindCraft (top) and EvoBind2 (bottom) have similar loss functions that rely on AF2's pLDDT and intermolecular contacts

Even though these AF2-based models work very well, one non-obvious catch is that you cannot take a binding pose and get AlphaFold2 to evaluate it. These models can generate binders, but not discriminate binders from non-binders. In the EvoBind2 paper, they found that "No in silico metric separates true from false binders", which means the problem is a bit more complex than just "ask AF2 if it looks good".

According to the AF2Rank paper, the AF2 model has a good model of the physics of protein folding, but may not find the global minimum. The MSAs' job is to help focus that search. This was surprising to me! The protein folding/binding problem is more of a search problem than I realized, which means more compute should straightforwardly improve performance by simply doing more searching. This is also evidenced by the AlphaFold 3 paper, where re-folding antibodies 1000 times led to improved prediction quality.

Excerpt from the AF2Rank paper (top), and a tweet from Sergey Ovchinnikov (bottom) explaining the primacy of sequence data in structure prediction

RFdiffusion/RFantibody vs BindCraft/EvoBind2

The main comparison I wanted to make in this post is between RFdiffusion/RFantibody vs BindCraft and EvoBind2.

These are all recently released, state-of-the-art models from top labs. However, the difference in claimed performance is pretty striking.

While the RFdiffusion and RFantibody papers caution that you may need to test hundreds or even thousands of proteins to find one good binder, the BindCraft and EvoBind2 papers appear to show very high success rates, perhaps even as high as 50%. (EvoBind2 only shows results for one ribonuclease target but BindCraft includes multiple).

Words of caution from the RFantibody github repo (top) and BindCraft's impressive results for 10 targets (bottom)

There is no true benchmark to reference here, but I think under reasonable assumptions, BindCraft (and arguably EvoBind2) achieve a >10X greater success rate than RFdiffusion or RFantibody. The Baker lab is the leading and best resourced lab in this domain, so what accounts for this large difference in performance? I can think of a few possibilities:

  • RoseTTAFold2 was not the best filter for RFantibody to use, and switching to AlphaFold3 would improve performance. This is plausible, but it is hard to believe that is a 10X improvement.
  • Antibodies are just harder than minibinders or cyclic peptides. Hypervariable regions are known to be difficult to fold, since they do not have the advantage of evolutionary conservation. However, RFdiffusion also produces minibinders, so this is not a satisfactory explanation.
  • BindCraft and EvoBind2 are testing on easier targets. There is likely some truth to this. Most (but not all) examples in the BindCraft paper are for proteins with known binders; EvoBind2 is only tested against a target with a known peptide binder. However, most of RFantibody's targets also have known antibodies in PDB.
  • Diffusion currently just does not work as well as AlphaFold-based methods. AlphaFold2 (and its descendants, AF3, Boltz, Chai-1, etc.) have learned enough physics to recognize binding, and by leaning on this ability heavily, and filtering carefully, you get much better performance.

What comes next?

RFdiffusion and RFantibody are arguably the first examples of successful de novo binder design and antibody design, and for that reason are important papers. BindCraft and EvoBind2 have proven they can produce one-shot nanomolar binders under certain circumstances, which is technically extremely impressive.

However, if we could get another 10X improvement in performance, then I think these tools are being used in every biotech and research lab. Some ideas for future directions:

  • More compute: One of the interesting things about BindCraft and EvoBind2 is how long they take to produce anything. In BindCraft's case, it generates a lot of candidates, but has a long list of criteria that must be met. One BindCraft run will screen hundreds or thousands of candidates and can easily cost $10+. Similary, EvoBind2 can run for 5+ hours before producing anything, again easily costing $10+. This approach of throwing compute at the problem may be generally applicable, and may be analogous to the recently successful LLM reasoning approaches.
  • Combine diffusion and AlphaFold-based methods: I have no specific idea here, but since they are quite different approaches, maybe integrating some ideas from RFdiffusion into EvoBind2 or BindCraft could help.
  • Combine sequence models and structure models: There is already a lot of work happening here, both from the sequence side and structure side. In the simplest case, the output of a sequence model like ESM2 could be an independent contributor to the loss of a structure model. At the very least, this could help filter out structures that do not fold.
  • Neural Network Potentials: Neural Network Potentials are an exciting new tool for molecular dynamics (see Duignan, 2024 or Barnett, 2024). Achira just got funded to work on this, and has several of the pioneers of the field on board. Semi-open source models like orb-v2 from Orbital Materials are being actively developed too. The amount of compute required is prohibitive right now, but even a short trajectory could plausibly help with rank ordering binders, and would be independent of the AF2 metrics.

Tweet from Tim Duignan at Orbital Materials

Comment
Brian Naughton | Mon 30 December 2024 | ai | ai biotech proteindesign

This article is a deeper look at Adaptyv's binder design competition, and some thoughts on what we learned. If you are unfamiliar with the competition, there is background information on the Adaptyv blog and my previous article.

The data

Adaptyv did a really nice job of packaging up the data from the competition (both round 1 and round 2). The also did a comprehensive analysis of which metrics predicted successful binding in this blogpost.

The data from round 2 is more comprehensive than round 1 — it even includes Alphafolded structures — so I downloaded the round 2 csv and did some analysis.

Regressions

Unlike the Adaptyv blogpost, which does a deep dive on each metric in turn, I just wanted to see how well I could predict binding affinity (Kd) using the following features provided in the csv: pae_interaction, esm_pll, iptm, plddt, design_models (converted to one-hot), seq_len (inferred from sequence). Three of these metrics (pae_interaction, esm_pll, iptm) were used to determine each entry's rank in the competition's virtual leaderboard, which was used to prioritize entries going into the binding assay.

I also added one more feature, prodigy_kd, which I generated from the PDB files provided using prodigy. Prodigy is an old-ish tool for predicting binding affinity that identifies all the major contacts (polar–polar, charged–charged, etc.) and reports a predicted Kd (prodigy_Kd).

I used the typical regression tools: Random Forest, Kaggle favorite XGBoost, SVR, linear regression, as well as just using the mean Kd as a baseline. There is not a ton of data here for cross-validation, especially if you split by submitter, which I think is fairest. If you do not split by submitter, then you can end up with very similar proteins in different folds.

# get data and script
git clone https://github.com/adaptyvbio/egfr_competition_2
cd egfr_competition_2/results
wget https://gist.githubusercontent.com/hgbrian/1262066e680fc82dcb98e60449899ff9/raw/regress_adaptyv_round_2.py
# run prodigy on all pdbs, munge into a tsv
find structure_predictions -name "*.pdb" | xargs -I{} uv run --with prodigy-prot prodigy {} > prodigy_kds.txt
(echo -e "name\tprodigy_kd"; rg "Read.+\.pdb|25.0˚C" prodigy_kds.txt | sed 's/.*\///' | sed 's/.*25.0˚C:  //' | paste - - | sed 's/\.pdb//') > prodigy_kds.tsv
# run regressions
uv run --with scikit-learn --with polars --with matplotlib --with seaborn --with pyarrow --with xgboost regress_adaptyv_round_2.py

The results are not great! There are a few ways to slice the data (including replicates or not; including similarity_check or not; including non-binders or not). There is a little signal, but I think it's fair to say nothing was strongly predictive.


Model RMSE (log units) Median Fold Error
Linear Regression 0.150 0.729 1.8x
Random Forest Regression 0.188 0.712 1.4x
SVM Regression 0.022 0.781 1.2x
XGBoost 0.061 0.766 1.2x
Mean Kd only -0.009 0.794 1.9x

XGBoost performance looks ok here but is not much more predictive than just taking the mean Kd

Surprisingly, no one feature dominates in terms of predictive power

Virtual leaderboard rank vs competition rank

If there really is no predictive power in these computational metrics, there should be no correlation between rank in the virtual leaderboard and rank in the competition. In fact, there is a weak but significant correlation (Spearman correlation ~= 0.2). However, if you constrain to the top 200 (of 400 total), there is no correlation. My interpretation is that these metrics can discriminate no-hope-of-binding from some-hope-of-binding, but not more than that.

It may be too much to ask one set of metrics to work for antibodies (poor PLL, poor PAE?), de novo binders (poor PLL), and EGF/TNFa-derived binders (natural, so excellent PLL). However, since I include design_models as a covariate, the regression models above can use different strategies for different design types, so at the very least we know there is not a trivial separation that can be made.

BindCraft's scoring heuristics

So how can BindCraft work if it's mostly using these same metrics as heuristics? I asked this on twitter and got an interesting response.

It is possible that PyRosetta's InterfaceAnalyzer is adding a lot of information. However, if this were the case, you might expect Prodigy's Kd prediction to also help, which it does not. It is also possible that by using AlphaFold2, the structures produced by BindCraft are inherently biased towards natural binding modes. Then a part of the binding heuristics are implicit in the weights of the model?

What did we learn?

I learned a couple of things:

  • Some tools, specifically BindCraft, can consistently generate decent binders, at least against targets and binding pockets present in its training set (PDB). (The BindCraft paper also shows success with at least one de novo protein not present in the PDB.)
  • We do not have a way to predict if a given protein will bind a given target.

I think this is pretty interesting, and a bit counterintuitive. More evidence that we cannot predict binding comes from the Dickinson lab's Prediction Challenges, where the goal is to match the binder to the target. Apparently no approach can (yet).

The Adaptyv blogpost ends by stating that binder design has not been solved yet. This is clearly true. So what comes next?

  • We could find computational metrics that work, based on the current sequence and structure data. For example, BindCraft includes "number of unsatisfied hydrogen bonds at the interface" in its heuristics. I am skeptical that we can do a lot better with this approach. For one thing, Adaptyv has already iterated once on its ranking metrics, with negligible improvement in prediction.
  • We could get better at Molecular Dynamics, which probably contains some useful information today (at exorbitant computational cost), and could soon be much better with deep learning approaches.
  • We could develop an "AlphaFold for Kd prediction". There are certainly attempts at this, e.g., ProAffinity-GNN and the PPB-Affinity dataset to pick two recent examples, but I don't know if anything works that well. The big problem here, as with many biology problems, is a lack of data; PDBbind is not that big (currently ~2800 protein–protein affinities.)

Luckily, progress in this field is bewilderingly fast so I'm sure we'll see a ton of developments in 2025. Kudos to Adaptyv for helping push things forward.

Comment
Brian Naughton | Sat 30 November 2024 | ai | ai biotech proteindesign

Comparing Alphafold 3, Boltz and Chai-1

Read More
Brian Naughton | Sat 07 September 2024 | biotech | biotech ai llm

Some notes on the Adaptyv binder design competition

Read More

Using LLMs to search PubMed and summarize information on longevity drugs.

Read More
Brian Naughton | Mon 04 September 2023 | biotech | biotech machine learning ai

Molecular dynamics code for protein–ligand interactions

Read More

Using colab to chain computational drug design tools

Read More
Brian Naughton | Sat 25 February 2023 | biotech | biotech machine learning ai

Using GPT-3 as a knowledge-base for a biotech

Read More

Computational tools for drug development

Read More

Boolean Biotech © Brian Naughton Powered by Pelican and Twitter Bootstrap. Icons by Font Awesome and Font Awesome More