archives –––––––– @btnaughton

In a previous blogpost I described a pipeline for synthesizing arbitrary proteins on the transcriptic robotic lab platform using only Python code. The ultimate goal of that project was to be able to run a program that takes a protein sequence as input, and "returns" a tube of bacteria expressing that protein. Here I'll describe some progress towards that goal.

pipeline diagram

Pipelining

The usual way to chain together different programs in bioinformatics is with a pipeline management system, for example, snakemake, nextflow, toil, WDL, and many many more. I've recently become a big fan of nextflow for computational pipelines, but its major advantages (e.g., containerization) don't help much here because so much of the work happens outside of the computer. For this project I've been using the slightly simpler snakemake, mainly for tracking which steps have been completed, and deciding which steps can be run in parallel based on their dependencies.

Each protocol has four associated steps in the pipeline:

  • generate protocol: create an autoprotocol file describing the protocol
  • submit protocol: submit the autoprotocol file to transcriptic
  • get results: download images, data, etc. from transcriptic
  • create report: create a HTML report from the downloaded data
snakemake pipeline

snakemake pipeline for protein synthesis

Metaprotocol

In my terminology, a "metaprotocol" defines the complete process, which is turned into a series of protocols. Ideally, the output of a single protocol will be a decision point: for example, whether or not a gel image includes the expected bands.

The metaprotocol is defined in yaml, which has its issues, but is more readable than json, and well supported. This code depends heavily on Pydna, a Python package for cloning and assembly. Given an insert and a vector, Pydna will design primers and a PCR program. The following is my metaprotocol yaml for expressing GFP:

- meta:
    assembly: |-
      Assembly:
      Sequences........................: [2690] [786]
      Sequences with shared homologies.: [2690] [786]
      Homology limit (bp)..............: 25
      Number of overlaps...............: 2
      Nodes in graph(incl. 5' & 3')....: 4
      Only terminal overlaps...........: No
      Circular products................: [3412]
      Linear products..................: [3446] [3442] [34] [30]
    assembly_figure: |2-
       -|SYNPUC19V|31
      |            \/
      |            /\
      |            31|786bp_PCR_prod|30
      |                              \/
      |                              /\
      |                              30-
      |                                 |
       ---------------------------------
    metaprotocol_id: 1k9ginus
    pcr_figure: |2-
                                    5AGGAGGACAGCTATGTCGAAAGGA...CATTACCCATGGAATGGATGAACTGTATAAA3
                                                                ||||||||||||||||||||||||||||||| tm 59.8 (dbd) 70.6
                                                               3GTAATGGGTACCTTACCTACTTGACATATTTTTAAGTGACCGGCAGCAAAATGTTGCAGCA5
      5ACTCTAGAGGATCCCCGGGTACCGAGCTCGAGGAGGACAGCTATGTCGAAAGGA3
                                     |||||||||||||||||||||||| tm 62.1 (dbd) 69.3
                                    3TCCTCCTGTCGATACAGCTTTCCT...GTAATGGGTACCTTACCTACTTGACATATTT5
    pcr_program: |2

      Pfu-Sso7d (rate 15s/kb)
      Two-step|    30 cycles |      |786bp
      98.0°C  |98.0C         |      |Tm formula: Pydna tmbresluc
      _____ __|_____         |      |SaltC 50mM
      00min30s|10s  \        |      |Primer1C 1.0µM
              |      \ 72.0°C|72.0°C|Primer2C 1.0µM
              |       \______|______|GC 49%
              |       0min11s|10min |4-12°C
    project_name: pUC19_sfGFP_cloning_v1
- linearize:
    restriction_enzyme: EcoRI
    vector: pUC19
- oligosynthesize:
    p1: ACTCTAGAGGATCCCCGGGTACCGAGCTCGAGGAGGACAGCTATGTCGAAAGGA
    p2: ACGACGTTGTAAAACGACGGCCAGTGAATTTTTATACAGTTCATCCATTCCATGGGTAATG
- thermocycle:
    insert: AGGAGGACAGCTATGTCGAAAGGAGAAGAACTGTTTACCGGTGTGGTTCCGATTCTGGTAGAACTGGATGGGGACGTGAACGGCCATAAATTTAGCGTCCGTGGTGAGGGTGAAGGGGATGCCACAAATGGCAAACTTACCCTTAAATTCATTTGCACTACCGGCAAGCTGCCGGTCCCTTGGCCGACCTTGGTCACCACACTGACGTACGGGGTTCAGTGTTTTTCGCGTTATCCAGATCACATGAAACGCCATGACTTCTTCAAAAGCGCCATGCCCGAGGGCTATGTGCAGGAACGTACGATTAGCTTTAAAGATGACGGGACCTACAAAACCCGGGCAGAAGTGAAATTCGAGGGTGATACCCTGGTTAATCGCATTGAACTGAAGGGTATTGATTTCAAGGAAGATGGTAACATTCTCGGTCACAAATTAGAATACAACTTTAACAGTCATAACGTTTATATCACCGCCGACAAACAGAAAAACGGTATCAAGGCGAATTTCAAAATCCGGCACAACGTGGAGGACGGGAGTGTACAACTGGCCGACCATTACCAGCAGAACACACCGATCGGCGACGGCCCGGTGCTGCTCCCGGATAATCACTATTTAAGCACCCAGTCAGTGCTGAGCAAAGATCCGAACGAAAAACGTGACCATATGGTGCTGCTGGAGTTCGTGACCGCCGCGGGCATTACCCATGGAATGGATGAACTGTATAAA
    p1: ACTCTAGAGGATCCCCGGGTACCGAGCTCGAGGAGGACAGCTATGTCGAAAGGA
    p2: ACGACGTTGTAAAACGACGGCCAGTGAATTTTTATACAGTTCATCCATTCCATGGGTAATG
    program:
      extension_time: 11.0
      forward_primer_concentration: 0.001
      rate: 15.0
      reverse_primer_concentration: 0.001
      saltc: 50.0
      ta: 72.0
- assemble:
    insert: sfGFP
    vector: pUC19

DNA synthesis

Of course, before you can run this pipeline, you need to have the appropriate insert DNA in your transcriptic inventory. As far as I know, none of the major synthetic DNA suppliers has an API. However, you can order DNA from IDT by filling in an excel file. I have automated filling in and emailing this file, so DNA synthesis can be included in the pipeline too! It should take about a week from ordering for DNA to appear at transcriptic.

Reporting

After each protocol finishes, a HTML report is generated. This allows the user to evaluate protocol results manually before initiating the next step. There are ways to automate this more, like using automated band mapping of gel images, but I think that kind of thing will work better once the transcriptic API settles down a bit. The HTML report also serves as a log of the experiment.

cut_plasmid FINISHED

cut_plasmid FINISHED

 Submitted at UTC 2016-08-20 19:42:48
   Started at UTC 2016-08-20 22:52:06
 Completed at UTC 2016-08-21 01:28:49
Ran report at UTC 2016-10-26 22:15:03

Expected DNA bands of size: 2686bp

synthesize_primers <strong>FINISHED</strong>

synthesize_primers FINISHED

 Submitted at UTC 2016-08-23 19:35:37
   Started at UTC 2016-08-23 20:00:11
 Completed at UTC 2016-08-24 20:01:03
Ran report at UTC 2016-10-26 22:15:44

Synthesized primers:

ACTCTAGAGGATCCCCGGGTACCGAGCTCGAGGAGGACAGCTATGTCGAAAGGA ACGACGTTGTAAAACGACGGCCAGTGAATTTTTATACAGTTCATCCATTCCATGGGTAATG

add_flanks FINISHED

add_flanks FINISHED

 Submitted at UTC 2016-10-08 15:12:38
   Started at UTC 2016-10-10 22:22:16
 Completed at UTC 2016-10-11 01:04:12
Ran report at UTC 2016-11-17 15:36:02

PCR program

Pfu-Sso7d (rate 15s/kb)
Two-step|    30 cycles |      |786bp
98.0°C  |98.0C         |      |Tm formula: Pydna tmbresluc
_____ __|_____         |      |SaltC 50mM
00min30s|10s  \        |      |Primer1C 1.0µM
        |      \ 72.0°C|72.0°C|Primer2C 1.0µM
        |       \______|______|GC 49%
        |       0min11s|10min |4-12°C
run_gibson_and_transform FINISHED

run_gibson_and_transform FINISHED

 Submitted at UTC 2016-10-24 23:10:06
   Started at UTC 2016-10-28 17:45:56
 Completed at UTC 2016-10-29 17:01:26
Ran report at UTC 2016-10-30 15:12:18

gibson_and_transform_v1_gibson_and_transform_v1_amp_6_flat_t18.png
gibson_and_transform_v1_gibson_and_transform_v1_amp_6_flat_t9.png
gibson_and_transform_v1_gibson_and_transform_v1_noAB_6_flat_t18.png
gibson_and_transform_v1_gibson_and_transform_v1_noAB_6_flat_t9.png
pick_colonies_and_culture FINISHED

pick_colonies_and_culture FINISHED

 Submitted at UTC 2016-11-02 19:46:39
   Started at UTC 2016-11-04 00:11:21
 Completed at UTC 2016-11-04 17:54:20
Ran report at UTC 2016-11-11 01:10:22

Absorbance readings

pick_colonies_and_culture_v1_abs_t16

b1     0.047
a2     0.048
a10    0.050
a7     0.050
a6     0.051
a5     0.051
b6     0.054
b5     0.056
a11    0.058
a3     0.060
b7     0.066
a9     0.071
a1     0.073
b4     0.080
a12    0.080
b10    0.081
a4     0.082
b9     0.083
b2     0.086
b3     0.088
a8     0.090
b11    0.116
b8     0.148

Conclusions

There is still plenty to do before the pipeline is completely automatic. For example, attentive readers will notice that the HTML report above shows an unsuccessful transformation, one of many! The first complete transformation took several months to get right. The biggest challenge is making the process robust to changes in the protein sequence — even basic PCR can go wrong in many ways. Currently, debugging is a major undertaking; unlike regular programming, iterations are slow and expensive. However, if the protocols can be made robust enough, which I think they can, then synthesizing a new protein could become as simple as running BLAST.

Comment

It feels apt to write about virtual companies from the beautiful new Hanahaus space in downtown Palo Alto. $3 an hour for a seat, and coffee by Blue Bottle. Rent in Palo Alto is actually not so bad if you share...

hanahaus *Hanahaus, Palo Alto*

My impression of the typical "virtual biotech" is a company that is spun out to develop a compound originally discovered in an academic lab, or licensed from a larger biotech. There are only a few employees, usually pharma veterans, whose job is to shepherd the compound from CRO to CRO, and develop just enough evidence that the compound can be sold.

Recently, developments in biotech — analogous to the move to cloud computing in IT — may allow for a more complete virtual drug development company. Below I summarize how this might work, and the companies and technologies that enable it.

Choose your therapy

Generally, biologics are going to be a better fit for a virtual model than small molecules.

The chemistry of drug development requires very specialized expertise, and large pharma/biotech has institutional knowledge that is extremely difficult to compete with. Also, because small molecules can be made of anything, their off-target effects can be difficult to predict (even aspirin is not completely understood).

Nucleotide-based technologies like RNAi (Alnylam), mRNA (Moderna), and CRISPR/Cas9 (Caribou, Editas) would be ideal. Since they are nucleotide-based, binding relies on sequence identity, so it's much closer to a digital system. Theoretically, you can change targets simply by changing the nucleotide sequence, which makes the process much more predictable. Nucleotide binding is generally easier to predict because a 1D search space (the human genome, plus perhaps commensal bacterial genomes) is so much more constrained than a 3D search space (all structures/epitopes present in and on cells). Of course, these technologies have their own issues in that they are new and untested.

Protein-based biologics are arguably a good compromise. For example: enzymes (enzyme replacement therapy is worth billions of dollars a year), antibodies (seven of the eight top selling drugs in 2013 were antibodies), BiTEs and CAR-Ts (cancer immunotherapy companies like Juno are showing extremely promising results). These technologies provide a more consistent design template than a small molecule (i.e., DNA), but there is still a lot that remains unpredictable, such as off-target binding for antibodies, or even how the protein will fold.

Drug repositioning

Another reasonable option is to use an existing library of small molecules (e.g., from NCATS) with some additional data that can be mined (e.g., expression changes in model organisms). This process is usually called drug repositioning, and there are indeed many such companies springing up as the amount of available data increases, and methods for prediction using statistical models (machine learning) improves (twoXAR, NuMedii, AtomWise).

You can also combine these two concepts, by applying CRISPR/Cas9 to a model organism to create a disease model, and then testing that model against a library of compounds (Recursion Pharma, Perlstein lab). Creating these disease models straightforwardly may be one of the major initial uses of CRISPR/Cas9 (amazingly, now mainstream enough that you can order yours from Agilent).

Choose your advantage

Without the resources of a large biotech, how can a virtual company compete? After all, pharma/biotech has thousands of potential therapies sitting on the shelf. A therapy that works great in yeast, or even mouse, is not necessarily worth much because most of the risk in drug development happens after the preclinical research stage (an orphan disease with no treatments is an easier sell).

drug

Since the eventual goal is a safe and effective therapy, that means there are three advantages your therapy could have:

  • More Safety The therapy has already been shown to be safe in a clinical trial, or is a generic/off-patent drug (twoXAR, Recursion, NuMedii)
  • More Efficacy The therapy works in multiple distinct organisms, so it should work in humans (Perlstein lab)
  • More Safety and More Efficacy The therapy comes directly from a human, therefore there is some indication that it's safe and effective in a human (X01, Neurimmune). Recent applications of human genetics in drug discovery (e.g., PCSK9 inhibitors) rely on a similar concept.

Create and test your therapy

  • Create
    • Design: A good example of where protein engineering is important is BiTEs. You can think of two things that should be colocated (like T-cells and cancer cells), and synthesize a molecule that binds both.
    • Find: A surprising fraction of drugs are still "natural products", many discovered through bioprospecting. Recently, with the incredible amount of sequencing capacity available, we can do this at scale from microbes (Warp Drive Bio) or maybe even from humans.
    • Repurpose: You can just try all the compounds in a commercial screening library. They may have already been picked over though!
NPs *Percentage of drug approvals that were natural products ("N")* `Newman & Cragg 2013 <http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3721181/>`_
  • Test
    • Model organism: Testing in simple model organisms is great, if you have a good model (apparently, yeast is a good model for Alzheimer's) It also helps you parallelize your experiments since you can grow these little organisms in wells.
    • Human cells: This method becomes especially powerful when combined with CRISPR/Cas9, even with a relatively low yield for now. Rooster Bio and Extem Bio are two startups providing MSCs (mesenchymal stem cells — not iPSCs) at competitive prices (Extem claims to have the largest stem cell library by several orders of magnitude). Of course, every large biotech is using stem cells too (e.g., AstraZeneca).
    • Animal: Mammalian animal models (usually mice or rats) are expensive, (probably $10k-100k per experiment) but currently necessary for any kind of serious drug development effort.

Choose your development methods

Since this company is virtual, there are severe limitations on what is possible, so the choice of development methods is extremely important. The experiments must be inherently amenable to virtualization.

Synthetic biology

If you are going to develop a biologic virtually and on the cheap, then you'll probably want to use synthetic biology. The iGEM synthetic biology competition gives some indication of how that might work (list of iGEM projects). iGEM is mostly focused on bacterial sensors and the like, but when the worlds of iGEM and drug development collide it is fascinating.

Synthetic biology allows you to iterate on and parallelize your experiments in ways that are very suited to virtualization. For example, if you want to do some mutagenesis on your protein, you can use a kit or write some code to edit the sequence directly. The use of synthetic DNA means you can worry less about the experimental process (purification of PCR products, general lab hygiene) and rely less on the hard-won expertise of lab science. You get increased reproducibility for free.

Synthesizing DNA is still expensive at 10-20c per base (that's at least a million times more expensive than sequencing) but companies like Gen9, Twist Bioscience and Cambrian Genomics should be able to bring the price down an order of magnitude within a few years. That will mean $10-50 proteins and antibody fragments, which should enable a lot more kinds of parallel experimentation.

You can get a bit of help with designing your vector and protein using software like Genome Compiler or Benchling (as used by Gen9).

DNA cost *Rob Carlson's* `DNA synthesis cost curve <http://www.synthesis.cc/2014/02/time-for-new-cost-curves-2014.html>`_

Cloud labs

The other crucial ingredient in the modern virtual biotech is the cloud lab (as I've discussed in several previous posts): Transcriptic, Autodesk Wet Lab Accelerator (in beta just this week, and built on top of Transcriptic), Arcturus BioCloud, Emerald Cloud Lab (in beta), Synthego (not yet live) and Riffyn (not yet live). None of these companies existed just a year or two ago.

Just like SnapChat can build a massive messaging app on top of Google App Engine to compete with Facebook et al., and many other lean internet companies build on top of AWS, the virtual biotech should take advantage of scalable cloud services for experiments too.

Sadly, you cannot do everything with synthetic biology and cloud labs just yet. For experiments that don't fit into these boxes, there is always Science Exchange and Assay Depot. There are also a couple of exciting stealth animal experimentation startups coming out soon too! The CRO world, like IT vendors before the era of AWS, is set for disruption.

Comment

I first learned about Arcturus BioCloud a few months ago from their irreverent youtube videos.

Arcturus are one of the new crop of robotic labs that allow you to run wet-lab experiments from a web or command-line interface. They appear to be smaller and less well funded than Transcriptic and Emerald Cloud Lab, both of which have opened enormous new lab spaces recently. Arcturus is much more focused on the biohacker / synthetic biology space, where Transcriptic and ECL are focused on taking on common wet-lab procedures like Western blots. (For completeness, the other robot lab I know about is Riffyn, though I don't know how that one fits in yet).

The most impressive aspect of Arcturus to me was that I could create and execute a synthetic biology project within literally a few minutes. Granted, the experiment simply grows a bacterium up with one of three available genes and takes a photograph, but I am still extremely impressed by how easy it was, and by the $80 price.

arcturus

I'm looking forward to seeing how Arcturus develops their interface to allow for custom proteins and analyses.

Comment
Brian Naughton | Fri 12 September 2014 | biotech | scrapy iGEM synthetic biology

iGem projects

Read More

First look at GenoCAD

Read More

Finding a vector

Read More

Finding a vector

Read More
Brian Naughton | Tue 12 August 2014 | biotech | synthetic biology vaccine

What is the flagellin DNA sequence?

Read More
Brian Naughton | Mon 11 August 2014 | biotech | synthetic biology

What is the strongest mammalian promoter?

Read More

Boolean Biotech © Brian Naughton Powered by Pelican and Twitter Bootstrap. Icons by Font Awesome and Font Awesome More