Using modal for bioinformatics

Brian Naughton | Sun 26 November 2023 | datascience | datascience modal workflow machine learning ai

For a long time, the dream has been to be able to test code on your laptop and transparently scale it to infinite compute on the cloud. There are many, many tools that can help you do this, but modal, a new startup, comes closer to a seamless experience than anything I've used before.

There is a really nice quickstart, and they even include a generous $30/month to help get your feet wet.

pip install modal
python3 -m modal setup
git clone https://github.com/hgbrian/biomodals
cd biomodals
modal run modal_omegafold.py --input-fasta modal_in/omegafold/insulin.fasta

OmegaFold

Here's an example of how to use modal to run OmegaFold. OmegaFold is an AlphaFold/ESMFold/ColabFold-like algorithm. It is much easier to run than AlphaFold (which needs 2TB+ of reference data!) and it performs well according to a recent benchmark by 310.ai.

import glob
from subprocess import run
from pathlib import Path
from modal import Image, Mount, Stub

FORCE_BUILD = False
MODAL_IN = "./modal_in"
MODAL_OUT = "./modal_out"

stub = Stub()

image = (Image
         .debian_slim()
         .apt_install("git")
         .pip_install("git+https://github.com/HeliXonProtein/OmegaFold.git", force_build=FORCE_BUILD)
        )

@stub.function(image=image, gpu="T4", timeout=600,
               mounts=[Mount.from_local_dir(MODAL_IN, remote_path="/in")])
def omegafold(input_fasta:str) -> list[tuple[str, str]]:
    input_fasta = Path(input_fasta)
    assert input_fasta.parent.resolve() == Path(MODAL_IN).resolve(), f"wrong input_fasta dir {input_fasta.parent}"
    assert input_fasta.suffix in (".faa", ".fasta"), f"not fasta file {input_fasta}"

    run(["mkdir", "-p", MODAL_OUT], check=True)
    run(["omegafold", "--model", "2", f"/in/{input_fasta.name}", MODAL_OUT], check=True)

    return [(pdb_file, open(pdb_file, "rb").read())
            for pdb_file in glob.glob(f"{MODAL_OUT}/**/*.pdb", recursive=True)]

@stub.local_entrypoint()
def main(input_fasta):
    outputs = omegafold.remote(input_fasta)

    for (out_file, out_content) in outputs:
        Path(out_file).parent.mkdir(parents=True, exist_ok=True)
        if out_content:
            with open(out_file, 'wb') as out:
                out.write(out_content)

Hopefully the code is relatively self-explanatory. It's just Python code with code to set up the docker image, and some decorators to tell modal how to run the code on the cloud.

There are only really two important lines: installing OmegaFold with pip:

.pip_install("git+https://github.com/HeliXonProtein/OmegaFold.git", force_build=FORCE_BUILD)

and running OmegaFold:

run(["omegafold", "--model", "2", f"/in/{input_fasta.name}", MODAL_OUT], check=True)

The rest of the code could be left unchanged and reused for many bioinformatics tools. For example, to run minimap2 I would just add:

.run_commands("git clone https://github.com/lh3/minimap2 && cd minimap2 && make")

Finally, to run the code:

modal run modal_omegafold.py --input-fasta modal_in/omegafold/insulin.fasta

Outputs

Modal has extremely nice, almost real-time logging and billing.

CPU and GPU usage for an OmegaFold run. Note how it tracks fractional CPU/GPU use.

Billing information for the same run, split into CPU, GPU, RAM.

This OmegaFold run cost me 3c and took about 3 minutes. So if I wanted to run OmegaFold on a thousand proteins I could probably do the whole thing for ~$100 in minutes. (I'm not sure since my test protein is short, and I don't know how many would run in parallel.) In theory, someone reading this article could go from never having heard of modal or OmegaFold to folding thousands proteins in well under an hour. That is much faster than any alternative I can think of.

Modal vs Docker

Why not just use Docker? Docker has been around forever and you can just make a container and run it the same way! Modal even uses Docker under the hood!

There are significant differences:

Dockerfiles are weird, have their own awful syntax, and are hard to debug;
you have to create and manage your own images;
you have to manage running your containers and collecting the output.

I have a direct equivalent to the above OmegaFold modal script that uses Docker, and it includes:

a Dockerfile (30 lines);
a Python script (80 lines, including copying files to and from buckets);
a simple bash script to build and push the image (5 lines);
and finally a script to run the Dockerfile on GCP (40 lines, specifying machine types, GPUs, RAM, etc).

Also, Docker can be slow to initialize, at least when I run my Docker containers on GCP. Modal has some impressive optimizations and cacheing (which I do not understand). I find I am rarely waiting more than a minute or two for scripts to start.

Modal vs snakemake, nextflow, etc

There is some overlap here, but in general the audience is different. Nextflow Tower may be a sort-of competitor, I have not tried to use it.

The advantages of these workflow systems over modal / Python scripts are mainly:

you can separate your process into many atomic steps;
you can parameterize different runs and keep track of parameters;
you can cache steps and restart after a failure (a major advantage of e.g., redun).

However, for many common tasks like protein folding (OmegaFold, ColabFold), genome assembly (SPAdes), read mapping (minimap2, bwa), there is one major step — executing a binary — and it's unclear if you need to package the process in a workflow.

Pricing

Modal is priced per second so you don't pay for more than you use.

Modal runs (transparently) on AWS, GCP, and Oracle. Another blogpost claims that Modal adds around a 12% margin. However, since Modal charges per cycle, I think it's possible you could end up saving quite a bit of money? If you run your own Docker image on a VM, you may end up paying a lot for idle time; for example, if your GPU is utilized for a fraction of the time (as with AlphaFold). It's very unclear to me how modal makes this work (or if I am misunderstanding something), but it's really nice to not have to worry about maximizing utilization.

One challenge with modal — and all cloud compute — for bioinformatics is having to push large files around (e.g., TB of sequencing reads). If you want to do that efficiently you may have to look into the modal's Volumes feature so larger files only get uploaded once.

Conclusion

Not too long ago, I felt I had to do almost everything on a linux server, since my laptop did not have enough RAM, or any kind of GPU. Now with the M-series MacBook Pros, you get a GPU and as much RAM as high end server (128GB!) I still need access to hundreds of cores and powerful GPUs, but only for defined jobs.

I am pretty excited about modal's vision of "serverless" compute. I think it's a big step forward in how to manage compute. Their many examples are illustrative. Never having to think about VMs or even having to choose a cloud is a big deal, especially these days when GPUs are so hard to find (rumor has it Oracle has spare A100's, etc!) Although it's an early startup, there is almost no lock-in with modal since everything is just decorated Python.

I made a basic biomodals repo and added some examples.

Comment

A simple way to simulate protein–ligand interactions (that does not work)

Brian Naughton | Mon 04 September 2023 | biotech | biotech machine learning ai

Molecular dynamics (MD) means simulating the forces acting on atoms. In drug discovery, MD usually means simulating protein–ligand interactions. This is clearly a crucial step in modern drug discovery, yet MD remains a pretty arcane corner of computational science.

This is a different problem to docking, where molecules are for the most part treated as rigid, and the problem is finding the right ligand orientation and the right pocket. Since in MD simulations the atoms can move, there are many more degrees of freedom, and so a lot more computation is required. For a great primer on this topic, see Molecular dynamics simulation for all (Hollingsworth, 2018).

What about deep learning?

Quantum chemical calculations, though accurate, are too computationally expensive to use for MD simulation. Instead, "force fields" are used, which enable computationally efficient calculation of the major forces. As universal function approximators, deep nets are potentially a good way to get closer to ground truth.

Analogously, fluid mechanics calculations are very computationally expensive, but deep nets appear to do a good job of approximating these expensive functions.

A deep net approximating Navier-Stokes

Recently, the SPICE dataset (Eastman, 2023) was published, which is a reference dataset that can be used to train deep nets for MD.

We describe the SPICE dataset, a new quantum chemistry dataset for training potentials relevant to simulating drug-like small molecules interacting with proteins. It contains over 1.1 million conformations for a diverse set of small molecules, dimers, dipeptides, and solvated amino acids.

This dataset has enabled new ML force fields like Espaloma (Takaba, 2023) and the recent Allegro paper (Musaelian, 2023), where they simulated a 44 million atom system of a HIV capsid. Interestingly, they scaled their system as high as 5120 A100's (which would cost $10k an hour to run!)

There are also hybrid ML/MM approaches (Rufa, 2020) based on the ANI2x ML force field (Devereux, 2020).

All of this work is very recent, and as I understand it, runs too slowly to replace regular force fields any time soon. Despite MD being a key step in drug development, only a small number of labs (e.g., Chodera lab) appear to work on OpenMM, OpenFF, and the other core technologies here.

Doing an MD simulation

I have only a couple of use-cases in mind:

does this ligand bind this protein in a human cell?
does this mutation affect ligand binding in a human cell?

Doing these MD simulations is tricky since a lot of background is expected of the user. There are many parameter choices to be made, and sensible options are not obvious. For example, you may need to choose force fields, ion concentrations, temperature, timesteps, and more.

By comparison, with AlphaFold you don't need to know how many recycles to run, or specify how the relaxation step works. You can just paste in a sequence and get a structure. As far as I can tell, there is no equivalent "default" for MD simulations.

A lot of MD tutorials I have found are geared towards simulating the trajectory of a system for inspection. However, with no specific numerical output, I don't know what to do with these results.

Choosing an approach

There are several MD tools out there for doing protein–ligand simulations, and calculating binding affinities:

Schrodinger is the main player in computational drug discovery, and a mature suite of tools. It's not really suitable for me, since it's expensive, geared toward chemists, designed for interactive use over scripting, and not even necessarily cutting-edge.
OpenEye also appears to be used a lot, and has close ties to open-source. Like Schrodinger, the tools are high quality, mostly interactive and designed for chemists.
HTMD from Acellera is not open-source, but it has a nice quickstart and tutorials.
GROMACS is open-source, actively maintained, and has tutorials, but is still a bit overwhelming with a lot of boilerplate.
Amber, like GROMACS, has been around for decades. It gets plenty of use (e.g., AlphaFold uses it as a final "relaxation" step), but is not especially user-friendly.
OpenMM seems to be where most of the open-source effort has been over the past five years or so, and is the de facto interface for a lot of the recent ML work in MD (e.g., Espaloma). A lot of tools are built on top or OpenMM:
- yank is a tool for free energy binding calculations. Simulations are parameterized by a yaml file.
- perses is also used for free energy calculation. It is pre-alpha software but under active development — e.g., this recent paper on protein–protein interaction. (Note, I will not claim to understand the differences between yank and perses!)
- SEEKR2 is a tool that enables free energy calculation, among other things.
- Making it rain is a suite of tools and colabs. It is a very well organized repo that guides you through running simulations on the cloud. For example, they include a friendly colab to run protein–ligand simulations. The authors did a great job and I'd recommend this repo broadly.
- BAT, the Binding Affinity Tool, calculates binding affinity using MD (also see the related GHOAT).

OpenMM quickstart for protein simulation

Since I am not a chemist, I am really looking for a system with reasonable defaults for typical drug development scenarios. I found a nice repo by tdudgeon that appears to have the same goal. It uses OpenMM, and importantly has input from experts on parameters and settings. For example, I'm not sure I would have guessed you can multiply the mass of Hydrogen by 4.

This keeps their total mass constant while slowing down the fast motions of hydrogens. When combined with constraints (typically constraints=AllBonds), this often allows a further increase in integration step size.

I forked the repo, with the idea that I could keep the simulation parameters intact but change the interface a bit to make it focused on the problems I am interested in.

Calculating affinity

I am interested in calculating ligand–protein affinity (or binding free energy) — in other words, how well does the ligand bind the protein. There's a lot here I do not understand, but here is my basic understanding of how to calculate affinity:

Using MD: This is the most accurate way to measure affinity, but the techniques are challenging. There are "end-point" approaches (e.g., MM/PBSA) and Free Energy Perturbation (FEP) / alchemical approaches. Alchemical free energy approaches are more accurate, and have been widely used for years. (I believe Schrodinger were the first to show accurate results (Wang, 2015).) Still, I found it difficult to figure out a straightforward way to do these calculations.
Using a scoring function: This is how most docking programs like vina or gnina work. Docking requires a very fast, but precise, score to optimize.
Using a deep net: Recently, several deep nets trained to predict affinity have been published. For example, HAC-Net is a CNN trained on PDBbind. This is a very direct way to estimate binding affinity, and should be accurate if there is enough training data.

The SQM/COSMO docking scoring function (Ajani, 2017)

Unfortunately, I do know know of a benchmark comparing all the above approaches, so I just tried out a few things.

Predicting cancer drug resistance

One interesting but tractable problem is figuring out if a mutation in a protein will affect ligand binding. For example, let's say we sequence a cancer genome, and see a mutation in a drug target, do we expect that drug will still bind?

There are many examples of resistance mutations evolving in cancer.

Common cancer resistance mutations (Hamid, 2020)

Experiments

BRAF V600E is a famous cancer target. Vemurafenib is a drug that targets V600E, and L505H is known to be a resistance mutation. There is a crystal structure of BRAF V600E bound to Vemurafenib (PDB:3OG7). Can I see any evidence of reduced binding of Vemurafenib if I introduce an L505H mutation?

PDB:3OG7, showing the distance between vemurafenib (cyan) and L505 (yellow)

I ran a simple simulation: starting with the crystal structure, introduce each possible mutation at position 505, allow the protein–ligand system to relax, and check to see if the new protein–ligand interactions are less favorable according to some measure of affinity.

I first used gnina's scoring function, which is fast and should be relatively precise (in order for gnina to work!) The rationale here was that the "obstruction" due to the resistance mutation would be detectable as the new atom positions of the amino acid and ligand would lead to a lower affinity.

Estimated affinity given mutations at position 505 in 3OG7

Nope. The resistance mutation has higher affinity (realistically, there are no distinguishable differences for any mutation).

We also know that MEK1 V215E acts as a resistance mutation to PD0325901, and the PDB has a crystal structure of MEK1 bound to PD0325901 (PDB:70MX).

Estimated affinity given mutations at position 215 in 70MX

Again, I can't detect any difference in affinity due to the resistance mutation.

HAC-Net

I also tried a deep-learning based affinity calculator, HAC-Net. HAC-Net has a nice colab and is relatively easy to run Dockerized.

The HAC-Net colab gives me a pKd of 8.873 for 3OG7 (wild-type)

Estimated pKd given mutations at position 505 in 3OG7 using HAC-Net

I still see no difference in affinity with HAC-Net.

Each of these simulations (relaxing a protein–ligand system with solvent present) took a few minutes on a single CPU. If I wanted to simulate a full trajectory, which could be 50 nanoseconds or longer, it would take hundreds or thousands of times as long.

Conclusions

On the one hand, I can run state-of-the-art MD simulations pretty easily with this system. On the other hand, I could not discriminate cancer resistance mutations from neutral mutations.

There are several possible reasons. Most likely, the short relaxation I am doing is insufficient and I need longer simulations. The simulations may also be insufficiently accurate, either intrinsically or due to poor parameterization. It's difficult to feel confident in these kinds of simulations, since there's no simple way to verify anything.

If anyone knows of any obvious fix for the approach here, let me know! Probably the next thing I would try would be adapting the Making It Rain tools, which include (non-alchemical) free energy calculation. For some reason the Making It Rain colab specifies "This notebook is NOT a standard protocol for MD simulations! It is just simple MD pipeline illustrating each step of a simulation protocol." which begs the questions, why not and where is such a notebook?

I do think that enabling anyone to run such simulations — specifically, with default parameters blessed by experts for common use-cases — would be a very good thing.

There are already several cancer drug selection companies like Oncobox, so maybe there should be a company doing this kind of MD for predicting cancer resistance. Maybe there is and I just have not heard of it?

Addendum: modal labs

I have been experimenting with modal labs for running code like this, where there are very specific environment requirements (i.e., painful to install libraries) and heavy CPU/GPU load. Previously, I would have used Docker, which is fundamentally awkward, and still requires manually provisioning compute. Modal can be a joy to use and I'll probably write up a separate blogpost on it.

To do your own simulation (bearing in mind all the failed experiments above!), you can either use my MD_protein_ligand colab or if you have a modal account, clone the MD_protein_ligand repo and run

mkdir -p input && modal run run_simulation_modal.py --pdb-id 3OG7 --ligand-id 032 --ligand-chain A

This basic simulation (including solvent) should cost around 10c on modal. That means we could relax all 5000 protein–ligand complexes in the PDB for around $500, perhaps in just a day or two (depending on how many machines modal allows in parallel). I'm not sure if there's any point to that, but pretty cool that things can scale up so easily these days!

Comment

Colab and computational tools for drug design

Brian Naughton | Sun 25 June 2023 | biotech | biotech machine learning ai

Last year I wrote a post about computational tools for drug development. Since then quite a lot has happened, especially the appearance of several generative models based on equivariant neural networks for drug design. This article is a sort of update to that post, and also a collection of colabs I have found or developed over the past year or so that can be stitched together to design drugs.

The tools listed here are focused on the earliest stages of drug development. Specifically:

Generate a new molecule or use a virtual screen to find one
Evaluate the molecule's potential as a drug
Synthesize or purchase the molecule

1a. Generate a new molecule

Pocket2Mol: Protein structure → small molecules

Pocket2Mol is one of a new crop of generative models that start with a binding pocket and generate molecules that fit in the pocket. VD-Gen (with animation below) is similar to Pocket2Mol but since it has no code available so I cannot tell if it works.

I created a Pocket2Mol colab that enables easy molecule generation. The input is a PDB file and 3D coordinates to search around. The centroid of a bound ligand in the PDB file can serve as the coordinates. The output is a list of generated molecules. I rank the generated molecules with gnina, a fast, easy to install, and relatively accurate way to measure binding affinity.

ColabDesign: Protein structure → peptide binders

ColabDesign is an extremely impressive project that democratizes a lot of the recent work in generative protein models. As you may guess from the name, ColabDesign — and sister project ColabFold — allow anyone to run them via colab.

For example, even the recent state-of-the-art RFDiffusion algorithm from the Baker lab has been incorporated and made available as an RFDiffusion colab.

ColabDesign can generate proteins that conform to a specific shape or reference protein backbone (structure → sequence, i.e., AlphaFold in reverse). It can also generate peptides that can bind to a specific protein.

The same group also developed AfCycDesign, which uses a nice trick to get Alphafold to fold cyclic peptides — an increasingly important drug type, and often an alternative to antibodies. There is an AfCycDesign colab too!

In this animation it is attempting to generate a cyclic peptide covid spike protein binder, as published recently.

1b. Run a virtual screen

gnina: PDB structure + ligand → posed ligand + binding affinity

gnina is the deep-learning–based successor to the extremely popular smina, itself an AutoDock Vina fork. There is a minimal implementation available as a gnina colab. gnina is a successor to smina and appears to be strictly better. gnina can be used for virtual screening and performs very well at that.

PointVS is a new, deep-learning approach. Like many recent methods, it uses an EGNN (equivariant graph neural network). PointVS's performance is impressive, comparable to gnina, but it's unclear which method runs faster. Uni-Dock is a new GPU-accelerated Autodock Vina that claims impressive speed but unfortunately there is no open code.

PointVS and gnina perform comparably

DiffDock: protein structure + ligand → posed ligand

DiffDock is a generative diffusion model with impressive performance. Given a protein and a ligand, DiffDock can try to find the best pose for that ligand. I created a DiffDock colab that runs DiffDock and again ranks the results using gnina.

DiffDock is a pose prediction method and is not designed to do virtual screening per se. It returns a "confidence" score that correlates with smina/gnina affinity.

DiffDock confidence and smina/gnina binding affinities correlate fairly well

As for which algorithm produces superior results, it's unclear, but according to a recent review, gnina comes out ahead.

2. Evaluate molecule properties

SMILES property predictor: molecule → predicted properties

A molecule the binds to a protein has one property necessary to become a drug, but that is far from the only thing that matters. We can also predict a molecule's intrinsic properties using machine learning.

I created a Smiles to properties colab based on chemprop, soltrannet, and SMILES2Caption. Most of the training data is from MoleculeNet. For some reason there was no pre-trained network available so I had to retrain from scratch.

The graphs generated show the predicted properties of your SMILES as compared to FDA-approved drugs (grey bounds). In the example below, acetaminophen appears as the most toxic of the group, as expected.

SwissADME also provides an excellent property prediction service, though anecdotally analyzing proprietary molecules on someone else's server is not commonly done in drug development.

Espaloma Charge: molecule → charge

Espaloma Charge is a standalone part of Chodera lab's Espaloma system. I include it here as they make available a very simple Espaloma Charge colab that will assign charges for your molecule.

HAC-Net: Protein structure + posed ligand → binding affinity

HAC-Net is a new deep learning method specifically designed to estimate protein–ligand binding affinity.

The HAC-Net colab takes as input a protein and posed ligand and returns a dissociation constant (pKd) and a nice pymol image. This method may or may not be more accurate than gnina — I have not attempted to benchmark.

3. Synthesize or purchase molecule

Postera Manifold: SMILES → molecule or building blocks

Postera Manifold is a machine learning tool that helps figure out how to synthesize small molecules. It can supply the building blocks and reactions necessary for synthesis, or there is a "Have PostEra make it for you" button (at least according to their documentation, but I could not find it!) However, usually synthesis of a new small molecule is a custom, CRO-driven process, that could be expensive and take a long time depending on the molecule's complexity.

Biomatik: SMILES → purchasable molecule

For peptides — which are linear, modular molecules — synthesis is much easier. A service like Biomatik (or GenScript, or many others) can supply peptides for as little as $80 per. They can use proteinogenic or non-proteinogenic amino acids, and can cyclize the peptide too.

Small World: SMILES → purchasable molecule

Small world allows you to search for molecules in the largest compound databases: ZINC, Mcule, Enamine REAL... There is also a nice, unofficial Small World Python API.

If you find a close match to your designed molecule, you could evaluaate it to see if it works as a binder, or use it as a starting point for modification.

Orforglipron

As an example of using these tools, I'll use a new GLP-1 agonist, orforglipron, that showed promising results in a phase 2 study published in NEJM just this week.

First, I can generate an image of the protein GLP-1 with a different bound ligand using my pdb2png colab. (PDB:7S15, "Pfizer small molecule bound to GLP-1").

I can look at the molecule and its charges using the Espaloma Charge colab. (Nice image, but to be honest this doesn't mean much to me!)

I can get the molecule's SMILES from pubchem, and search for it using Small World.

orforglipron in Small World: dark green is a match; light green is a mismatch

A little surprisingly, I find a matching molecule that is only two edits from orforglipron in the ZINC20 "for sale" set. It is labeled as orforglipron in MCE and interestingly the text says it was "extracted from a patent". It's unclear to me why it's not an exact match. Most sources charge $2k for 1mg, but there is one source that charges $1k for 5mg.

I can evaluate its properties with SwissADME, and it's interesting to note how far outside the ideal orforglipron is, in terms of size (883 Da!) and predicted solubility.

I can use the DiffDock colab to see how well it might bind to GLP-1

The best binding score I could get after three attempts was a DiffDock confidence of -2 (very unconfident) and a gnina affinity of -6 (poor-to-mediocre affinity). Because this molecule is so large, there are many possible conformers and it may be impossible to adequately sample them adequately. It does at least bind to the same place as the Pfizer small molecule.

When I feed this top pose from DiffDock into the HAC-Net colab, I get a pKd of 11, which is very high affinity.

The tools for drug design are getting very powerful, and colab is a fantastic way to make them widely available and easy to run. Even when I have a command-line equivalent set up, I often find running the colab to be quicker and easier. In theory, Docker provides the same kind of advantages, but it's never quite so easy, as you still have to provision compute, disk, etc. Having easy access to A100's (on the extremely affordable Colab Pro) is not bad either.

Comment

Using GPT-3 as a knowledge-base in biotech

Brian Naughton | Sat 25 February 2023 | biotech | biotech machine learning ai

Using GPT-3 as a knowledge-base for a biotech

Convolutional Neural Net Playing a 3D Game

Brian Naughton | Sun 10 April 2016 | data | data machine learning neural nets deep learning threejs webgl

Convolutional neural net playing a 3D game using convnetjs and three.js