The Adaptyv binder design competition

Brian Naughton | Sat 07 September 2024 | biotech | biotech ai llm

Adaptyv is a newish startup that sells high-throughput protein assays. The major innovations are (a) they tell you the price (a big innovation for biotech services!) (b) you only have to upload protein sequences, and you get results in a couple of weeks.

A typical Adaptyv workflow might look like the following:

Design N protein binders for a target of interest (Adaptyv has 50-100 pre-specified targets)
Submit your binder sequences to Adaptyv
Adaptyv synthesizes DNA, then protein, using your sequences
In ~3 weeks you get affinity measurements for each design at a cost of $149 per data-point

This is an exciting development since it decouples "design" and "evaluation" in a way that enables computation-only startups to get one more step towards a drug (or sensor, or tool). There are plenty of steps after this one, but it's still great progress!

The Adaptyv binder design competition

A couple of months ago, Adaptyv launched a binder design competition, where the goal was to design an EGFR binder. There was quite a lot of excitement about the competition on Twitter, and about 100 people ended up entering. At around the same time, Leash Bio launched a small molecule competition on Kaggle, so there was something in the air.

PAE and iPAE

For this competition, Adaptyv ranked designs based on the "PAE interaction" (iPAE) of the binder with EGFR.

PAE (Predicted Aligned Error) "indicates the expected positional error at residue x if the predicted and actual structures are aligned on residue y". iPAE is the average PAE for residues in the binder vs target. In other words, how accurate do we expect the relative positioning of binder and target to be? This is a metric that David Baker's lab seems to use quite a bit, at least for thresholding binders worth screening. It is straightforward to calculate using the PAE outputs from AlphaFold.

Unusually, compared to, say, a Kaggle competition, in this competition there are no held-out data that your model is evaluated on. Instead, if you can calculate iPAE, you know your expected position on the leaderboard before submitting.

The original paper Adaptyv reference is Improving de novo protein binder design with deep learning and the associated github repo has an implementation of iPAE that I use (and I assume the code Adaptyv use.)

Confusingly, there is also a metric called "iPAE" mentioned in the paper Systematic discovery of protein interaction interfaces using AlphaFold and experimental validation. It is different, but could actually be a more appropriate metric for binders?

At the end of last month (August 2024), there was a new Baker lab paper on Ras binders that also used iPAE, in combination with a few other metrics like pLDDT.

Experiments

A week or so after the competition ended, I found some time to try a few experiments.

Throughout these experiments, I include modal commands to run the relevant software. If you clone the biomodals repo it should just work(?)

iPAE vs Kd

The consensus seems to be that <10 represents a decent iPAE, but in order for iPAE to be useful, it should correlate with some physical measurement. As a small experiment, I took 55 PDB entries from PDBbind (out of ~100 binders that were <100 aas long, had an associated Kd, and only two chains), ran AlphaFold, calculated iPAE, and correlated this to the known Kd. I don't know that I really expected iPAE to correlate strongly with Kd, but it's pretty weak.

PDBbind Kd vs iPAE correlation

# download the PDBbind protein-protein dataset in a more convenient format and run AlphaFold on one example
wget https://gist.githubusercontent.com/hgbrian/413dbb33bd98d75cc5ee6054a9561c54/raw -O pdbbind_pp.tsv
tail -1 pdbbind_pp.tsv
wget https://www.rcsb.org/fasta/entry/6har/display -O 6har.fasta
echo ">6HAR\nYVDYKDDDDKEFEVCSEQAETGPCRACFSRWYFDVTEGKCAPFCYGGCGGNRNNFDTEEYCMAVCGSAIPRHHHHHHAAA:IVGGYTCEENSLPYQVSLNSGSHFCGGSLISEQWVVSAAHCYKTRIQVRLGEHNIKVLEGNEQFINAAKIIRHPKYNRDTLDNDIMLIKLSSPAVINARVSTISLPTAPPAAGTECLISGWGNTLSFGADYPDELKCLDAPVLTQAECKASYPGKITNSMFCVGFLEGGKDSCQRDAGGPVVCNGQLQGVVSWGHGCAWKNRPGVYTKVYNYVDWIKDTIAANS" > 6har_m.fasta
modal run modal_alphafold.py --input-fasta 6har_m.fasta --binder-len 80

Greedy search

This is about the simplest approach possible.

Start with EGF (53 amino acids)
Mask every amino acid, and have ESM propose the most likely amino acid
Fold and calculate iPAE for the top 30 options
Take the best scoring iPAE and iterate

Each round takes around 5-10 minutes and costs around $4 on an A10G on modal.

# predict one masked position in EGF using esm2
echo ">EGF\nNSDSECPLSHDGYCL<mask>DGVCMYIEALDKYACNCVVGYIGERCQYRDLKWWELR" > esm_masked.fasta
modal run modal_esm2_predict_masked.py --input-fasta esm_masked.fasta

# run AlphaFold on the EGF/EGFR complex and calculate iPAE
echo ">EGF\nNSDSECPLSHDGYCLHDGVCMYIEALDKYACNCVVGYIGERCQYRDLKWWELR:LEEKKVCQGTSNKLTQLGTFEDHFLSLQRMFNNCEVVLGNLEITYVQRNYDLSFLKTIQEVAGYVLIALNTVERIPLENLQIIRGNMYYENSYALAVLSNYDANKTGLKELPMRNLQEILHGAVRFSNNPALCNVESIQWRDIVSSDFLSNMSMDFQNHLGSCQKCDPSCPNGSCWGAGEENCQKLTKIICAQQCSGRCRGKSPSDCCHNQCAAGCTGPRESDCLVCRKFRDEATCKDTCPPLMLYNPTTYQMDVNPEGKYSFGATCVKKCPRNYVVTDHGSCVRACGADSYEMEEDGVRKCKKCEGPCRKVCNGIGIGEFKDSLSINATNIKHFKNCTSISGDLHILPVAFRGDSFTHTPPLDPQELDILKTVKEITGFLLIQAWPENRTDLHAFENLEIIRGRTKQHGQFSLAVVSLNITSLGLRSLKEISDGDVIISGNKNLCYANTINWKKLFGTSGQKTKIISNRGENSCKATGQVCHALCSPEGCWGPEPRDCVSCRNVSRGRECVDKCNLLEGEPREFVENSECIQCHPECLPQAMNITCTGRGPDNCIQCAHYIDGPHCVKTCPAGVMGENNTLVWKYADAGHVCHLCHPNCTYGCTGPGLEGCPTNGPKIPSI" > egf_01.fasta
modal run modal_alphafold.py --input-fasta egf_01.fasta --binder-len 53

One of the stipulations of the competition is that your design must be at least 10 amino acids different to any known binder, so you must run the loop above 10 or more times. Of course, there is no guarantee that there is a single amino acid change that will improve the score, so you can easily get stuck.

After 12 iteratations (at a cost of around $50 in Alphafold compute), the best score I got was 7.89, which would have been good enough to make the top 5. (I can't be sure, but I think my iPAE calculation is identical!) Still, this is really just brute-forcing EGF tweaks. I think the score was asymptoting, but there were also jumps in iPAE with certain substitutions, so who knows?

Unfortunately, though the spirit of the competition was to find novel binders, the way iPAE works means that the best scores are very likely to come from EGF-like sequences (or other sequences in AlphaFold's training set).

Adaptyv are attempting to mitigate this issue by (a) testing the top 200 and (b) taking the design process into account. It is a bit of an impossible situation, since the true wet lab evaluation happens only after the ranking step.

Bayesian optimization

Given an expensive black box like AlphaFold + iPAE, some samples, and a desire to find better samples, one appropriate method is Bayesian optimization.

Basically, this method allows you, in a principled way, to control how much "exploration" of new space is appropriate (looking for global minima) vs "exploitation" of variations on the current best solutions (optimizing local minima).

Bayesian optimization of a 1D function

The input to a Bayesian optimization is of course not amino acids, but numbers, so I thought reusing the ESM embeddings would be a decent, or at least convenient, idea here.

I tried both the Bayesian Optimization package and a numpyro Thompson sampling implementation. I saw some decent results at first (i.e., the first suggestions seemed reasonable and scored well), but I got stuck either proposing the same sequences over and over, or proposing sequences so diverged that testing them would be a waste of time. The total search space is gigantic, so testing random sequences will not help. I think probably the ESM embeddings were not helping me here, since there were a lot of near-zeros in there.

This is an interesting approach, and not too difficult to get started with, but I think it would work better with much deeper sampling of a smaller number of amino acids, or perhaps a cruder, less expensive, evaluation function.

ProteinMPNN

ProteinMPNN (now part of the LigandMPNN package), maps structure to sequence (i.e., the inverse of AlphaFold). For example, you can input an EGF PDB file, and it will return a sequence that should produce the same fold.

I found that for this task ProteinMPNN generally produced sequences with low confidence (as reported by ProteinMPNN), and as you'd expect, these resulted in low iPAEs. Some folds are difficult for ProteinMPNN, and I think EGF falls into this category. To run ProteinMPNN, I would recommend Simon Duerr's huggingface space, since it has a friendly interface and includes an AlphaFold validation step.

ProteinMPNN running on huggingface

# download a EGF/EGFR crytal structure and try to infer a new sequence that folds to chain C (EGF)
wget https://files.rcsb.org/download/1IVO.pdb
modal run modal_ligandmpnn.py --input-pdb 1IVO.pdb --extract-chains AC --params-str '--seed 1 --checkpoint_protein_mpnn "/LigandMPNN/model_params/proteinmpnn_v_48_020.pt" --chains_to_design "C" --save_stats 1 --batch_size 5 --number_of_batches 100'

RFdiffusion

RFdiffusion was the first protein diffusion method that showed really compelling results in generating de novo binders. I would recommend ColabDesign as a convenient interface to this and other protein design tools.

The input to RFdiffusion can be a protein fold to copy, or a target protein to bind to, and the output is a PDB file with the correct backbone co-ordinates, but with every amino acid labeled as Glycine. To turn this output into a sequence, this PDB file must then be fed into ProteinMPNN or similar. Finally, that ProteinMPNN output is typically folded with AlphaFold to see if the fold matches.

Although RFdiffusion massively enriches for binders over random peptides, you still have to screen many samples to find the really strong binders. So, it's probably optimistic to think that a few RFdiffusion-derived binders will show strong binding, even if you can somehow get a decent iPAE.

In my brief tests with RFdiffusion here, I could not generate anything that looked reasonable. I think in practice, the process of using RFdiffusion successfully is quite a bit more elaborate and heuristic-driven than anything I was going to attempt.

Figure 1 from De novo design of Ras isoform selective binders, showing multiple methods for running RFdiffusion

# Run RFdiffusion on the EGF/EGFR crystal structure, and diffuse a 50-mer binder against chain A (EGFR)
modal run modal_rfdiffusion.py --contigs='A:50' --pdb="1IVO"

Other things

A few other strategies I thought might be interesting:

Search FoldSeek for folds similar to EGF. The idea here is that you might find a protein in another organism that wants to bind EGFR. I do find some interesting human-parasitic nematode proteins in here, but decided these were unlikely to be EGFR binders.
Search NCBI for EGF-like sequences with blastp. You can find mouse, rat, chimp, etc. but nothing too interesting. The iPAEs are lower than human EGF, as expected.
Search the patent literature for EGFR binders. I did find some antibody-based binders, but as expected for folds that AlphaFold cannot solve, the iPAE was low.
Delete regions of the protein with low iPAE contributions to increase the average score. I really thought this would work for at least one or two amino acids, but it did not seem to. I did not do this comprehensively, but perhaps there are no truly redundant parts of this small binder?

Conclusion

All the top spots on the leaderboard went to Alex Naka, who helpfully detailed his methods in this thread. (A lot of this is similar to what I did above, including using modal!) Anthony Gitter also published an interesting thread on his attempts. I find these kinds of threads are very useful since they give a sense of the tools people are using in practice, including some I had never heard of, like pepmlm and Protrek.

Finally, I made a tree of the 200 designs that Adaptyv is screening (with iPAE <10 in green, <20 in orange, and >20 in red). All the top scoring sequences are EGF-like and cluster together. (Thanks to Andrew White for pointing me at the sequence data). We can look forward to seeing the wet lab results published in a couple of weeks.

Tree of Adaptyv binder designs

Comment

Evaluating longevity drugs with LLMs

Brian Naughton | Sun 31 March 2024 | datascience | biotech datascience ai llm

There is an interesting project currently running at Ora Biomedical, a longevity-focused biotech, called the Million Molecule Challenge. The idea is to test as many molecules as possible in a C. elegans longevity assay to look for something better than rapamycin. Ora claims to have already found a better MTOR inhibitor in their initial screens.

Individuals can sponsor testing one of 2720 FDA-approved compounds for $100.

The Million Molecule Challenge

This seemed like a good application for the LLM code from my previous blogpost. The idea is as follows: for each of the 2720 compounds, search PubMed for abstracts around longevity and aging; assess the evidence with an LLM; give each a score from 0-10.

LLM

I decided to try Claude this time instead of GPT-4, having heard good things about Claude Haiku. Haiku is extremely inexpensive at 25c per million tokens input. By comparison GPT-4 is $10 per million tokens. Anthropic also has the more advanced Sonnet ($3/million) and Opus ($15/million) models.

I wanted to try using Haiku, but also run enough compounds through Sonnet and Opus to validate they all gave approximately the same quality answers. In practice, I ran out of Haiku tokens too quickly (my limit was 1M tokens per day, or 25c of processing!) so I ended up using a combination of Haiku and Sonnet.

Results

For example, here I search PubMed for the "Acarbose" and "longevity", have the LLM read the returned abstracts, and the results look like this:

{
"Summary": "The evidence on whether the compound acarbose has a role in longevity is mixed. There are some promising results from cell-based and animal studies, but the overall evidence is not yet conclusive. More research, especially in human subjects, would be needed to establish a clear role for acarbose in longevity.",
"Compound": "Acarbose",
"Evidence": "Acarbose was identified as one of several longevity-enhancing interventions in a proteomic study of mouse tissues (Burns et al., Geroscience, 2023). Cell-based assays also found that acarbose shared clinically relevant activities with other longevity-enhancing compounds like rapamycin (Venn-Watson et al., Nutrients, 2023). However, the review paper by Elliehausen et al. (BMC Biology, 2023) suggests the relationship between acarbose and exercise for longevity is complex and requires further study.",
"Score": 7
}

To get results formatted as json, Anthropic recommends just asking for json format. In general, all models did a decent job outputting valid json, though Haiku was the worst. Sonnet did a bit better at outputting valid json, and Opus was about the same as Sonnet. As usual, I had to do a bit of semi-manual munging with sed to fix parsing issues.

(As in previous experiments, the first "Summary" entry is a chance for the LLM to think aloud and can be discarded.)

Comparing Haiku and Sonnet

Violin plot and histogram of differences between Sonnet and Haiku for scores > 0

Haiku and Sonnet produced similar scores across the board, with Haiku scores being a little higher. I considered doing a correction to calibrate the scores, or allowing Opus to refine results for the highest scorers, but decided the scores were close enough and there is no gold standard anyway.

Results

I put all 2720 results in a public Google Sheet. 11 compounds score 9/10, 97 score 8/10, 987 score 1-7/10, and 1625 score 0/10 — generally those with no identified literature evidence. Arguably the most interesting stuff will be in the unknown, but also it may make sense to test the drugs with the most evidence as longevity drugs first.

Compound	Score	Evidence
Ramipril	9	1) A study in long-lived male mice found that the combination of Ramipril (an ACE inhibitor) and simvastatin (a statin) significantly increased mean and median lifespan by 9%, while the individual therapies were ineffective (Spindler et al., Age, 2016). 2) In a mouse model of Alport syndrome, Ramipril therapy tripled the lifespan when combined with a neutralizing antibody against IL-11, compared to untreated Alport mice (Madison et al., J Pathol, 2024). 3) Another study in Alport mice showed that Ramipril alone can delay the onset of renal failure and reduce renal fibrosis (Gross et al., Kidney Int, 2003).
Sarcosine	9	Glycine N-methyltransferase (GNMT), which produces sarcosine from glycine, has been shown to extend lifespan in Drosophila by regulating S-adenosylmethionine (SAM) levels (Obata and Miura, Nat Commun, 2015). Sarcosine levels decline with age in rodents and humans, while dietary restriction increases sarcosine levels (Walters et al., Cell Rep, 2018). Sarcosine has been found to induce autophagy, a cellular process associated with longevity (Walters et al., Cell Rep, 2018; Johnson and Cuellar, Ageing Res Rev, 2023). Rilmenidine Phosphate 9 Rilmenidine extends lifespan and healthspan in Caenorhabditis elegans via a nischarin I1-imidazoline receptor (Bennett et al., Aging Cell, 2023)
Fulvestrant	9	Periodic fasting or a fasting-mimicking diet enhances the activity of the endocrine therapeutic fulvestrant by lowering circulating IGF1, insulin and leptin and inhibiting AKT-mTOR signaling in mouse models of hormone-receptor-positive breast cancer. When combined with palbociclib, adding fasting-mimicking diet cycles promotes long-lasting tumor regression and reverts acquired drug resistance (Caffa et al., Nature, 2020). Betulinic acid 9 Betulinic acid increased lifespan of Drosophila via Sir2 and FoxO activation (Lee and Min, Nutrients, 2024). It mitigated oxidative stress and apoptosis, enhancing longevity in yeast (Sudharshan et al., Free Radic Res, 2022). In C. elegans, it increased lifespan and stress resistance via the insulin/IGF-1 signaling pathway (Chen et al., Front Nutr, 2022).
Trametinib (GSK1120212)	9	Trametinib, an allosteric inhibitor of MEK in the Ras/MAPK pathway, robustly extends lifespan and reduces age-related gut pathology in Drosophila females by decreasing Pol III activity in intestinal stem cells (Ureña et al., Proc Natl Acad Sci USA, 2024). Trametinib, along with rapamycin and lithium, acts additively to increase longevity in Drosophila by targeting components of the nutrient-sensing network (Castillo-Quan et al., Proc Natl Acad Sci USA, 2019). Trametinib reduces translation errors, and its anti-aging effects are mediated through increased translational fidelity, a mechanism shared with rapamycin and Torin1 (Martinez-Miguel et al., Cell Metab, 2021). The Ras-Erk-ETS signaling pathway is identified as a drug target for longevity, with Trametinib extending lifespan in Drosophila (Slack et al., Cell, 2015).
GDC-0941	9	Ortega-Molina et al. (Cell Metab, 2015) showed that pharmacological inhibition of PI3K with GDC-0941 reduced adiposity and metabolic syndrome in obese mice and rhesus monkeys. Bharill et al. (Front Genet, 2013) demonstrated that GDC-0941 extended lifespan and increased stress tolerance in C. elegans, potentially by inhibiting the insulin-like signaling pathway.
Hydralazine HCl	9	Hydralazine targets cAMP-dependent protein kinase leading to SIRT1/SIRT5 activation and lifespan extension in C. elegans (Dehghan et al., Nature Communications, 2019). Hydralazine induces stress resistance and extends C. elegans lifespan by activating the NRF2/SKN-1 signaling pathway (Dehghan et al., Nature Communications, 2017). Hydralazine might produce anti-aging activity via AMPK activation and help stabilize genomic integrity to prolong life expectancy (Thanapairoje et al., Journal of Basic and Clinical Physiology and Pharmacology, 2023).
Geniposide	9	1) Treatment with Gardenia jasminoides Ellis fruit extract (GFE) rich in Geniposide increased the lifespan of C. elegans by 27.1% and improved healthspan markers like pharyngeal pumping, muscle quality, and stress resistance (Choi et al., J Gerontol A Biol Sci Med Sci, 2023). 2) Treatment with 10 mM Geniposide alone increased lifespan by 18.55% and improved healthspan markers in C. elegans (Choi et al., J Gerontol A Biol Sci Med Sci, 2023). 3) Eucommia ulmoides male flower extract (EUFE) containing Geniposide, aucubin, and asperuloside increased C. elegans lifespan by 18.61% and improved healthspan indicators like pharyngeal pumping, mobility, muscle morphology, and stress resistance (Chen et al., Food Funct, 2023).
Ivacaftor (VX-770)	9	Eur Respir J. (2024) reported that Ivacaftor, in combination with other CFTR modulators, significantly improved lung function (FEV1) and increased the number of adult cystic fibrosis patients, indicating improved longevity. Curr Opin Pulm Med. (2021) highlighted the potential of the triple combination of Ivacaftor, tezacaftor, and elexacaftor to improve lung function, weight, and quality of life in a significant portion of the cystic fibrosis population. J Acad Nutr Diet. (2021) found that Ivacaftor improved weight and BMI in specific CFTR mutation groups.
Hesperidin	9	Two papers in the Journal of Biomedical Sciences (Yeh et al., 2022; Shen et al., 2024) reported that hesperidin activates the CISD2 gene, which is involved in longevity and healthspan. In naturally aged mice, hesperetin treatment late in life prolonged healthspan and lifespan, ameliorated age-related metabolic decline, and rejuvenated the transcriptome. In human keratinocytes and mouse skin, hesperetin activated CISD2 and attenuated senescence in a CISD2-dependent manner. The 2024 paper in J Biomed Sci also showed that hesperetin can rejuvenate naturally aged skin in mice.

All results scoring 9/10

There are also some baffling and wrong results. For example, the evidence for Dimethyl Fumarate, which scored 8/10 by Sonnet, clearly points in the wrong direction:

"The paper (Harrison et al., Geroscience, 2024) states: "fisetin, SG1002 (hydrogen sulfide donor), dimethyl fumarate, mycophenolic acid, and 4-phenylbutyrate do not significantly affect lifespan in either sex at the doses and schedules used." It also notes that the amounts of dimethyl fumarate in the diet averaged 35% of the target dose, which may explain the absence of lifespan effects."

Still, overall, the results are good, and the super-cheap Haiku results are indistinguishable from Sonnet. The whole project, even using Sonnet for half, cost less than $10. Some obvious improvements would be widening the scope of the PubMed search beyond search results that directly mention "longevity", or allowing the LLM to read the full text of the papers.

I didn't decide on an intervention to sponsor yet, if any, but I love the crowd-funded aspect of this project, and I hope many more such projects spring up.

Script

The script is a lightly edited variant of the code in a previous blogpost.

import json
import re
import requests

from pathlib import Path
from textwrap import dedent
from time import sleep
from urllib.parse import quote

from tqdm.auto import tqdm

import anthropic
client = anthropic.Anthropic()
# os.environ["ANTHROPIC_API_KEY"] = "sk-xxx"
model, gpt_out_dir = "claude-3-opus-20240229", "out_opus"
model, gpt_out_dir = "claude-3-haiku-20240307", "out_haiku"
model, gpt_out_dir = "claude-3-sonnet-20240229", "out_sonnet"

SEARCH_PUBMED_URL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term={params}&retmax={max_abstracts}{api_key_param}"
GET_ABSTRACT_URL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id={pmid}&retmode=text&rettype=abstract{api_key_param}"
# https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/BioC-PubMed/
BIOC_PMCID_URL = "https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{pmcid}/unicode"

DEFAULT_MAX_ABSTRACTS = 10

# NCBI recommends that users post no more than three URL requests per second and limit large jobs
# Failure to comply with this policy may result in an IP address being blocked from accessing NCBI.
NCBI_API_KEY = None

def remove_ref_blank_entries_and_offsets(obj):
    """
    Recursively traverse through the JSON object (dict or list) and:
    1. Remove any node if it or any of its nested structures contains a dict with 'section_type': 'REF'.
    2. Remove any key-value pairs where the value is an empty list or empty dictionary.
    3. Remove any key-value pairs where the key is 'offset'.
    """
    if isinstance(obj, dict):
        # Check if any value of this dict is a nested dict with 'section_type': 'REF'
        if any(isinstance(v, dict) and v.get('section_type') == 'REF' for v in obj.values()):
            return None
        else:
            # No nested dict with 'section_type': 'REF', recursively process each key-value pair
            return {k: remove_ref_blank_entries_and_offsets(v) for k, v in obj.items() if k != 'offset' and
                       remove_ref_blank_entries_and_offsets(v) is not None and v != [] and v != {}}
    elif isinstance(obj, list):
        # Recursively process each item in the list
        return [remove_ref_blank_entries_and_offsets(item) for item in obj
                if remove_ref_blank_entries_and_offsets(item) is not None and item != [] and item != {}]
    else:
        # Return the item as is if it's not a dict or list
        return obj


def find_keywords(abstract):
    if "longevity" in abstract.lower():
        return -10
    elif "age" in abstract.lower() or "aging" in abstract.lower():
        return -5
    else:
        return 0


def main(compounds, role, max_abstracts=DEFAULT_MAX_ABSTRACTS, ncbi_api_key=NCBI_API_KEY):
    """Download abstracts and fulltext from pubmed, generate a prompt, query OpenAI.
    """
    Path("downloads").mkdir(exist_ok=True)
    Path(gpt_out_dir).mkdir(exist_ok=True)

    api_key_param = "&api_key={ncbi_api_key}" if ncbi_api_key is not None else ""

    for compound in tqdm(compounds, desc="Compounds", leave=True):
        abstracts = []
        fulltexts = []

        params = quote(f'({compound}) AND ("{role}")')

        pmtxt = requests.get(SEARCH_PUBMED_URL.format(params=params, max_abstracts=max_abstracts, api_key_param=api_key_param)).text
        pmids = re.findall("<Id>(\d+)</Id>", pmtxt)
        sleep(0.3)

        for pmid in tqdm(pmids, desc="pmids", leave=False):
            if Path(f"downloads/abstract.pmid_{pmid}.txt").exists():
                abstracts.append(open(f"downloads/abstract.pmid_{pmid}.txt").read())
                if Path(f"downloads/fulltext.pmid_{pmid}.json").exists():
                    fulltexts.append(json.load(open(f"downloads/fulltext.pmid_{pmid}.json")))
                continue

            abstract = requests.get(GET_ABSTRACT_URL.format(pmid=pmid, api_key_param=api_key_param)).text
            open(f"downloads/abstract.pmid_{pmid}.txt", 'w').write(abstract)
            abstracts.append(abstract)
            sleep(0.3)

            pmcid = re.findall("^PMCID:\s+(PMC\S+)$", abstract, re.MULTILINE)
            assert len(pmcid) <= 1, "there should be max one PMCID per paper"
            pmcid = pmcid[0] if len(pmcid) == 1 else None

            if pmcid is not None:
                fulltext_request = requests.get(BIOC_PMCID_URL.format(pmcid=pmcid))
                if fulltext_request.status_code == 200:
                    fulltext_json = remove_ref_blank_entries_and_offsets(fulltext_request.json())
                    json.dump(fulltext_json, open(f"downloads/fulltext.pmid_{pmid}.json", 'w'), indent=2)
                    fulltexts.append(fulltext_json)
                sleep(0.3)

        # If there are only a couple of papers then we can include the fulltext
        # 250k characters should be well below 100k tokens; 500k characters is too many
        abstracts.sort(key=find_keywords)

        abstract_len = len("\n".join(abstracts))
        fulltext_len = len("\n".join(json.dumps(j) for j in fulltexts))
        if abstract_len + fulltext_len < 250_000:
            abstracts_txt = "\n".join(abstracts) + "\n" + "\n".join(json.dumps(j) for j in fulltexts)
        else:
            abstracts_txt = "\n".join(abstracts)

        # clip data to the first few abstracts
        abstracts_txt = abstracts_txt[:10_000]

        gpt_out = f"{gpt_out_dir}/" + f"{compound}_{role}".replace(' ','_').replace('/','_').replace('\\','_') + ".txt"
        if len(abstracts_txt) > 1_000 and not Path(gpt_out).exists():
            completion = client.messages.create(
                model=model,
                max_tokens=1024,
                system = dedent("""
                You are an autoregressive language model that has been fine-tuned with instruction-tuning and RLHF.
                You carefully provide accurate, factual, thoughtful, nuanced answers, and are brilliant at reasoning. 
                If you think there might not be a correct answer, you say so. 
                Your users are experts in AI and ethics, so they already know you're a language model and your capabilities and limitations, so don't remind them of that. 
                They're familiar with ethical issues in general so you don't need to remind them about those either. 
                Don't be verbose in your answers, but do provide details and examples where it might help the explanation.

                Your users are also experts in science, and especially biology, medicine, statistics. 
                Do NOT add any details about how science or research works, tell me to ask my doctor or consult with a health professional.
                Do NOT add any details that such an expert would already know.
                """),
                messages=[
                {"role": "user", 
                "content": dedent(f"""
                Help me research whether this compound has the role of {role}: {compound}. 
                Here are the top abstracts from pubmed, and any full text that will fit in context:

                {abstracts_txt}

                Read and synthesize all the evidence from these papers.
                Prefer evidence from good journals and highly cited papers, and papers that are recent.
                If there is one standout result in a top journal like Nature, Science or Cell, focus on that.
                There should usually be one primary result and most of the evidence should depend on that.

                Include a score out of 10, where 1 would be a single paper with weak evidence,
                5 would be a mid-tier journal with a single paper with believable evidence,
                and 10 would be multiple Nature, Science or Cell papers with very strong evidence.
                A score above 7 is exceptional, implying at least the result has been replicated and is trustworthy.

                Make sure the results are valid json with fields exactly as in the examples below.
                Do NOT change the fields, change the order, or add any other fields.
                The "Compound" json entry MUST match the query compound. Here, {compound}.                
                Include AT LEAST ONE reference for each entry on evidence (unless there is no evidence), e.g., (Smith et al., Nature, 2021). ALWAYS include the journal name.
                To help you think step-by-step about your response, use the FIRST "Summary" entry to summarize the question
                and the relevant available evidence in your own words, in one hundred words or fewer.
                Do NOT output markdown, just raw json. The first character should be {{ and the last character should be }}.

                Here are some examples of json you might output:

                {{
                "Summary": "Aspirin has no evidence of {role}. The experiments are not convincing. There is some evidence for a related compound, which may indicate something."; // string
                "Compound": "Aspirin", // string
                "Evidence": "No evidence as {role}. No relevant papers found.", // string
                "Score": 0 // integer from 0-10
                }}

                {{
                "Summary": "There is some evidence for Abacavir, which may indicate something. The evidence is medium, e.g. a paper in a mid-tier journal" // string
                "Compound": "Abacavir", // string
                "Evidence": "A C. elegans screen identified Abacavir as {role} with a small effect (Jones et al., Nature Cancer, 2001)", // string
                "Score": 5 // integer from 0-10
                }}

                {{
                "Summary": "Rapamycin has been cited as {role} many times. The experiments are convincing. The evidence is strong: many papers including a recent Nature paper and a Science paper" // string
                "Compound": "Rapamycin" // string
                "Evidence": "A cell-based assay identified Rapamycin as {role} in longevity (Smith et al., Science, 2022, Thomson et al., Nature, 2019)", // string
                "Score": 10 // integer from 0-10
                }}
                """)
                }
            ]
            )
            open(gpt_out, 'w').write(completion.content[0].text)
            sleep(.1)
        elif not Path(gpt_out).exists():
            # just log null results
            open(gpt_out, 'w').write("")


if __name__ == "__main__":
    compounds = [x.strip() for x in open("compound_list_fixed.txt").readlines()]
    main(compounds = compounds, role = "longevity")

Comment

Progress towards parsing publications with LLMs

Brian Naughton | Sun 14 January 2024 | datascience | datascience ai llm

I wrote about querying research papers with LLMs last February. Since then things have progressed a lot, with significant new LLM results seemingly every week. However, I think it's fair to say there is still no turnkey way to search PubMed, load a corpus of PDFs into an LLM, query, and get good quality results. (If I've missed something great, let me know...)

There are several options:

The tools I used in my previous article, Paper QA and LlamaIndex, can work well if you already have a corpus of papers. Indexing papers with GPT-4 can get expensive though.
New RAG (Retrieval Augmented Generation) tools like RAGatouille (which uses the new ColBERT model, now built into LangChain)
All-in-one local LLM UIs like gpt4all
Web-based scientific services like elicit.org and scite.ai. I can't tell exactly which papers these tools have access to. scite.ai seems to be focused on "citation statements" (what?) and elicit.org on abstracts. I imagine it is difficult for these services to get access to all full text articles.

As of January 2024, GPT-4 still dominates all open-source models in terms of reasoning performance, but it is expensive for larger projects, so for now I am still exploring and comparing approaches. (For various reasons, I could get neither RAGatouille nor gpt4all to run on my mac!)

Long context

One really useful recent development in LLMs is long context (>100,000 tokens, or >100,000 words, approximately). A few models now have long context: GPT-4 Turbo, Claude, and some open-source models like Anima. By comparision, GPT-3 launched with a context length of only 2,048 tokens — barely enough to hold a few abstracts.

If you are trying to use an LLM to read and analyze documents, you have two main choices: you can fine-tune the model with additional training on a comparatively set of documents, or you can keep the default model and provide the document text directly in context. Perhaps surprisingly, it seems like the latter actually works fine, often better! Long context means you can directly load hundreds of pages of text into the model's memory. Intuitively, it does seem cleaner to keep the reasoning engine (LLM) separate from the domain knowledge (papers).

A recent post on twitter by @DeanCarignan

Notably, smaller contexts can work well with RAG, I assume because RAG pulls out only relevant passages, resulting in a better signal-to-noise ratio.

Text sources

For biomedical research, there are a few sources of text:

abstracts, available free online via PubMed and Google Scholar
full text papers (PDFs, HTML or JSON), some of which are available free via PubMed Central (more and more thanks to the NIH Public Access Policy)
patents, available free via USPTO or Google Patents
internet forums and discords. I have not actually looked into using this, but there is definitely a lot of high quality discourse on some discords.

BioC

BioC is a very useful format that is new to me. Essentially it's an NCBI-generated structured JSON file with annotated sections for "intro", "methods", etc.

Example BioC JSON

One unfortunate caveat is that I found not all PMIDs are available in BioC, even if the paper is available, so I have to use PMCIDs (PubMed Central publications) instead. No idea why, but I did email them about it.

Script

So here is a simple script that uses the PubMed and OpenAI APIs to rank a set of genes for a given "role" (e.g., ["MAP2K1", "F5"] and "cancer driver"). It searches PubMed, reads the abstracts to figure out if the gene has the specified role, summarizes the results in a JSON object, including a score from 0-10 that can be used for ranking. I found this script worked pretty well for me for ranking a few hundred genes — a task that would have been way too laborious for me to do — and it's a nice short script. The LLM prompt is far from optimized though, so buyer beware!

The long context model (GPT-4 Turbo) means I can simply include all the abstracts (I include the first 30) in the text of my query, and not worry about fine-tuning or RAG. The equivalent Paper QA script would have to additionally include an indexing and retrieval step, but could include the full text where available.

It feels strange to write these LLM scripts because most of the "code" is in English. The prompt engineering part took inspiration from Jeremy Howard from FastAI and this article from finedataproducts, which covers a lot of the same topics as this one, but in more depth.

Note that GPT-4 Turbo is still quite expensive, so although the results are pretty good, if you were attempting to search for all 20,000 genes in the human proteome, it would get expensive. Also, since you pay per token, the longer the context you provide, the more expensive it gets — proper RAG is probably a better idea for bigger projects!

Example results

Here are some results from running the script for the role "cancer driver". Note, the Summary field is an opportunity for the LLM to think "step-by-step". Surprisingly, the phrase "cancer driver" is not that commonly found with specific genes so some genes return no result.

Gene	Cancer type	Evidence	Summary	Score
MAT2A	Leukemia, Glioma, Gastric Cancer	MAT2A acts as an oncogenic driver by supporting methionine metabolism, crucial for MLLr leukemia (Fitzel et al. Neoplasia 2023), represents a vulnerability in H3K27M mutant glioma (Golbourn et al. Nat Cancer 2022), and protects cells from ferroptosis in gastric cancer via the MAT2A-ACSL3 pathway (Ma et al. Free Radic Biol Med 2022). Additionally, androgen-regulated alternative mRNA isoforms of MAT2A are linked with prostate cancer (Munkley et al. F1000Res 2018).	MAT2A is implicated in various cancers through its roles in methionine metabolism and impact on epigenetic regulation. The evidence is derived from several recent studies focusing on the gene’s role in promoting oncogenesis and tumor survival, with one study published in Nat Cancer revealing MAT2A as a vulnerability in H3K27M gliomas.	7
TOP1	MYC-driven cancers	A genome-wide CRISPR knockout screen in isogenic pairs of breast cancer cell lines reveals that TOP1 is a synthetic lethal vulnerability in MYC-driven cancers (Lin et al. Cancer Res 2023). Inhibition of TOP1 leads to accumulation of R-loops and reduced fitness of MYC-transformed tumors in vivo, and drug response to TOP1 inhibitors significantly correlates with MYC levels across several cancer cell lines and patient-derived organoids.	There is strong evidence that TOP1 has a role as a cancer driver gene in MYC-driven cancers. The experiment used a genome-wide CRISPR knockout screen to identify synthetic lethal targets for MYC, finding that TOP1 is critical for the survival of cancers with high MYC activity.	7
MAP2K1	Lung adenocarcinoma, head and neck squamous cancer, pilocytic astrocytoma	MAP2K1 mutations are associated with resistance to osimertinib in EGFR-mutated lung adenocarcinoma (Ito et al. 2023), and its expression is integrated into the APMHO prognostic score for head and neck squamous cancer (Zeng et al. 2023). MAP2K1 fusion has also been reported to activate the MAPK pathway in pilocytic astrocytoma (Yde et al. 2016).	MAP2K1 has been implicated as playing a role in cancer, with evidence pointing to its involvement in drug resistance, tumor progression, and potentially as a predictive marker. The evidence is substantial but not overwhelming.	6
BIRC2	Cervical cancer	Chr11q BFB event leading to YAP1/BIRC2/BIRC3 gene amplification is associated with earlier age of diagnosis in cervical cancer and is more common in African American women, suggesting potential for targeted therapy (Rodriguez et al., medRxiv, 2023)	BIRC2 is implicated in cervical cancer, related to BFB cycles resulting in the gene's amplification. The study from a preprint provides insight into a specific amplification event on chromosome 11, which includes the BIRC2 gene.	4
FANCA	Potentially implicated in general carcinogenesis	FANCA was identified as one of the genes with BaP-induced mutations predicted to impact protein function in post-stasis HMEC. The mutated FANCA gene is cited as a high-confidence cancer driver gene. This was observed in human mammary epithelial cells exposed to the carcinogen BaP, which is known to produce mutations typical of human lung cancers in smokers (Severson et al., Mutat Res Genet Toxicol Environ Mutagen, 2014)	FANCA has evidence of a role as a cancer driver gene from a single study that analyzed molecular changes in human mammary epithelial cells after carcinogen exposure. This is an indirect functional association, not a direct clinical demonstration	4
MYB	Breast cancer	No evidence from the provided study links MYB as a cancer driver in breast cancer (Ping et al., Scientific Reports, 2016)	MYB was not identified as a cancer driver gene in the provided study. The study focused on breast cancer and MYB was not listed among the significantly associated genetic abnormalities	1

Here are some other example use-cases for this script; in my experience, pretty common types of tasks in biotech:

Analyze a set of genes and rank them by their importance in cancer metabolism, immune response, lupus, etc.
Analyze a set of compounds and rank them by their toxicity, potency, etc.
Analyze a set of bacterial strains and rank them by their presence in human microbiome, potential use as probiotics, etc.

With these kinds of queries, you'll likely get reasonable results, with the obvious candidates at the top (and perhaps a few surprises), comparable to what you might get from an inexperienced intern with unlimited time.

import json
import re
import requests

from pathlib import Path
from textwrap import dedent
from time import sleep
from urllib.parse import quote

from openai import OpenAI
client = OpenAI()

# Requires OpenAI API Key: https://openai.com/blog/openai-api
# os.environ["OPENAI_API_KEY"] = "sk-xxx"

SEARCH_PUBMED_URL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term={params}&retmax={max_abstracts}{api_key_param}"
GET_ABSTRACT_URL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id={pmid}&retmode=text&rettype=abstract{api_key_param}"
# https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/BioC-PubMed/
BIOC_PMCID_URL = "https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_json/{pmcid}/unicode"

DEFAULT_MAX_ABSTRACTS = 30

# NCBI recommends that users post no more than three URL requests per second and limit large jobs
# Failure to comply with this policy may result in an IP address being blocked from accessing NCBI.
# Add an API key if you want to download faster
NCBI_API_KEY = None

def remove_ref_blank_entries_and_offsets(obj):
    """
    Recursively traverse through the JSON object (dict or list) and:
    1. Remove any node if it or any of its nested structures contains a dict with 'section_type': 'REF'.
    2. Remove any key-value pairs where the value is an empty list or empty dictionary.
    3. Remove any key-value pairs where the key is 'offset'.
    """
    if isinstance(obj, dict):
        # Check if any value of this dict is a nested dict with 'section_type': 'REF'
        if any(isinstance(v, dict) and v.get('section_type') == 'REF' for v in obj.values()):
            return None
        else:
            # No nested dict with 'section_type': 'REF', recursively process each key-value pair
            return {k: remove_ref_blank_entries_and_offsets(v) for k, v in obj.items() if k != 'offset' and
                       remove_ref_blank_entries_and_offsets(v) is not None and v != [] and v != {}}
    elif isinstance(obj, list):
        # Recursively process each item in the list
        return [remove_ref_blank_entries_and_offsets(item) for item in obj
                if remove_ref_blank_entries_and_offsets(item) is not None and item != [] and item != {}]
    else:
        # Return the item as is if it's not a dict or list
        return obj


def main(genes, role, max_abstracts=DEFAULT_MAX_ABSTRACTS, ncbi_api_key=NCBI_API_KEY):
    """Download abstracts and fulltext from pubmed, generate a prompt, query OpenAI.
    """
    Path("downloads").mkdir(exist_ok=True)
    Path("gpt_output").mkdir(exist_ok=True)

    api_key_param = "&api_key={ncbi_api_key}" if ncbi_api_key is not None else ""

    for gene in genes:
        abstracts = []
        fulltexts = []

        params = quote(f'({gene}) AND ("{role}")')

        pmtxt = requests.get(SEARCH_PUBMED_URL.format(params=params, max_abstracts=max_abstracts, api_key_param=api_key_param)).text
        pmids = re.findall("<Id>(\d+)</Id>", pmtxt)
        sleep(0.3)

        for pmid in pmids:
            if Path(f"downloads/abstract.pmid_{pmid}.txt").exists():
                abstracts.append(open(f"downloads/abstract.pmid_{pmid}.txt").read())
                if Path(f"downloads/fulltext.pmid_{pmid}.json").exists():
                    fulltexts.append(json.load(open(f"downloads/fulltext.pmid_{pmid}.json")))
                continue

            abstract = requests.get(GET_ABSTRACT_URL.format(pmid=pmid, api_key_param=api_key_param)).text
            open(f"downloads/abstract.pmid_{pmid}.txt", 'w').write(abstract)
            abstracts.append(abstract)
            sleep(0.3)

            pmcid = re.findall("^PMCID:\s+(PMC\S+)$", abstract, re.MULTILINE)
            assert len(pmcid) <= 1, "there should be max one PMCID per paper"
            pmcid = pmcid[0] if len(pmcid) == 1 else None

            if pmcid is not None:
                fulltext_request = requests.get(BIOC_PMCID_URL.format(pmcid=pmcid))
                if fulltext_request.status_code == 200:
                    fulltext_json = remove_ref_blank_entries_and_offsets(fulltext_request.json())
                    json.dump(fulltext_json, open(f"downloads/fulltext.pmid_{pmid}.json", 'w'), indent=2)
                    fulltexts.append(fulltext_json)
                sleep(0.3)

        # If there are only a couple of papers then we can include the fulltext
        # 250k characters should be well below 100k tokens; 500k characters is too many
        abstract_len = len("\n".join(abstracts))
        fulltext_len = len("\n".join(json.dumps(j) for j in fulltexts))
        if abstract_len + fulltext_len < 250_000:
            abstracts_txt = "\n".join(abstracts) + "\n" + "\n".join(json.dumps(j) for j in fulltexts)
        else:
            abstracts_txt = "\n".join(abstracts)

        gpt_out = f"gpt_output/{gene}_{role.replace(' ','_')}.txt"
        if len(abstracts_txt) > 1_000 and not Path(gpt_out).exists():
            completion = client.chat.completions.create(
                model="gpt-4-1106-preview",
                messages=[
                {"role": "system",
                "content": dedent("""
                You are an autoregressive language model that has been fine-tuned with instruction-tuning and RLHF.
                You carefully provide accurate, factual, thoughtful, nuanced answers, and are brilliant at reasoning. 
                If you think there might not be a correct answer, you say so. 
                Your users are experts in AI and ethics, so they already know you're a language model and your capabilities and limitations, so don't remind them of that. 
                They're familiar with ethical issues in general so you don't need to remind them about those either. 
                Don't be verbose in your answers, but do provide details and examples where it might help the explanation.

                Your users are also experts in science, and especially biology, medicine, statistics. 
                Do NOT add any details about how science or research works, tell me to ask my doctor or consult with a health professional.
                Do NOT add any details that such an expert would already know.
                """)},
                {"role": "user", 
                "content": dedent(f"""
                Help me research whether this gene has the role of {role}: {gene}. 
                Here are the top abstracts from pubmed, and any full text that will fit in context:

                {abstracts_txt}

                Read and synthesize all the evidence from these papers.
                Prefer evidence from good journals and highly cited papers, and papers that are recent.
                If there is one standout result in a top journal like Nature, Science or Cell, focus on that.
                There should usually be one primary result and most of the evidence should depend on that.

                Include a score out of 10, where 1 would be a single paper with weak evidence,
                5 would be a mid-tier journal with a single paper with believable evidence,
                and 10 would be multiple Nature, Science or Cell papers with very strong evidence.
                A score above 7 is exceptional, implying at least the result has been replicated and is trustworthy.

                Make sure the results are valid json with fields exactly as in the examples below.
                Do NOT change the fields, change the order, or add any other fields.
                The "Gene" json entry MUST match the query gene. Here, {gene}.                
                Include AT LEAST ONE reference for each entry on evidence (unless there is no evidence), e.g., (Smith et al., Nature, 2021). ALWAYS include the journal name.
                To help you think step-by-step about your response, use the FIRST "Summary" entry to summarize the question
                and the relevant available evidence in your own words, in one hundred words or fewer.
                Do NOT output markdown, just raw json. The first character should be {{ and the last character should be }}.

                Here are some examples of json you might output:

                {{
                "Summary": "ABC1 has no evidence of {role}. The experiments are not convincing. There is some evidence for ABC2, which may indicate something."; // string
                "Gene": "ABC1", // string
                "Cancer type": "", // string
                "Evidence": "No evidence as {role}", // string
                "Score": 0 // integer from 0-10
                }}

                {{
                "Summary": "There is some evidence for ABC2, which may indicate something. The evidence is medium, e.g. a paper in a mid-tier journal" // string
                "Gene": "DEF2", // string
                "Cancer type": "Bladder cancer", // string
                "Evidence": "A CRISPR screen identified DEF2 as {role} in bladder cancer (Jones et al., Nature Cancer, 2001)", // string
                "Score": 5 // integer from 0-10
                }}

                {{
                "Summary": "GHI3 has been cited as {role} many times. The experiments are convincing. The evidence is strong: a recent Nature paper and a Science paper" // string
                "Gene": "GHI3" // string
                "Cancer type": "Colorectal cancer" // string
                "Evidence": "A cell-based assay using a small molecule inhibitor identified GHI3 as {role} in colorectal cancer (Smith et al., Science, 2022, Thomson et al., Nature, 2019)", // string
                "Score": 8 // integer from 0-10
                }}
                """)
                }
            ]
            )
            open(gpt_out, 'w').write(completion.choices[0].message.content)
        else:
            # just log null results
            open(gpt_out, 'w').write("")

if __name__ == "__main__":
    main(genes = ["MAP2K1", "F5"], role = "cancer driver")

Comment