In a previous blogpost I described a pipeline for synthesizing arbitrary proteins on the transcriptic robotic lab platform using only Python code. The ultimate goal of that project was to be able to run a program that takes a protein sequence as input, and "returns" a tube of bacteria expressing that protein. Here I'll describe some progress towards that goal.
Pipelining
The usual way to chain together different programs in bioinformatics is with a pipeline management system, for example, snakemake, nextflow, toil, WDL, and many many more. I've recently become a big fan of nextflow for computational pipelines, but its major advantages (e.g., containerization) don't help much here because so much of the work happens outside of the computer. For this project I've been using the slightly simpler snakemake, mainly for tracking which steps have been completed, and deciding which steps can be run in parallel based on their dependencies.
Each protocol has four associated steps in the pipeline:
- generate protocol: create an autoprotocol file describing the protocol
- submit protocol: submit the autoprotocol file to transcriptic
- get results: download images, data, etc. from transcriptic
- create report: create a HTML report from the downloaded data
Metaprotocol
In my terminology, a "metaprotocol" defines the complete process, which is turned into a series of protocols. Ideally, the output of a single protocol will be a decision point: for example, whether or not a gel image includes the expected bands.
The metaprotocol is defined in yaml, which has its issues, but is more readable than json, and well supported. This code depends heavily on Pydna, a Python package for cloning and assembly. Given an insert and a vector, Pydna will design primers and a PCR program. The following is my metaprotocol yaml for expressing GFP:
- meta: assembly: |- Assembly: Sequences........................: [2690] [786] Sequences with shared homologies.: [2690] [786] Homology limit (bp)..............: 25 Number of overlaps...............: 2 Nodes in graph(incl. 5' & 3')....: 4 Only terminal overlaps...........: No Circular products................: [3412] Linear products..................: [3446] [3442] [34] [30] assembly_figure: |2- -|SYNPUC19V|31 | \/ | /\ | 31|786bp_PCR_prod|30 | \/ | /\ | 30- | | --------------------------------- metaprotocol_id: 1k9ginus pcr_figure: |2- 5AGGAGGACAGCTATGTCGAAAGGA...CATTACCCATGGAATGGATGAACTGTATAAA3 ||||||||||||||||||||||||||||||| tm 59.8 (dbd) 70.6 3GTAATGGGTACCTTACCTACTTGACATATTTTTAAGTGACCGGCAGCAAAATGTTGCAGCA5 5ACTCTAGAGGATCCCCGGGTACCGAGCTCGAGGAGGACAGCTATGTCGAAAGGA3 |||||||||||||||||||||||| tm 62.1 (dbd) 69.3 3TCCTCCTGTCGATACAGCTTTCCT...GTAATGGGTACCTTACCTACTTGACATATTT5 pcr_program: |2 Pfu-Sso7d (rate 15s/kb) Two-step| 30 cycles | |786bp 98.0°C |98.0C | |Tm formula: Pydna tmbresluc _____ __|_____ | |SaltC 50mM 00min30s|10s \ | |Primer1C 1.0µM | \ 72.0°C|72.0°C|Primer2C 1.0µM | \______|______|GC 49% | 0min11s|10min |4-12°C project_name: pUC19_sfGFP_cloning_v1 - linearize: restriction_enzyme: EcoRI vector: pUC19 - oligosynthesize: p1: ACTCTAGAGGATCCCCGGGTACCGAGCTCGAGGAGGACAGCTATGTCGAAAGGA p2: ACGACGTTGTAAAACGACGGCCAGTGAATTTTTATACAGTTCATCCATTCCATGGGTAATG - thermocycle: insert: AGGAGGACAGCTATGTCGAAAGGAGAAGAACTGTTTACCGGTGTGGTTCCGATTCTGGTAGAACTGGATGGGGACGTGAACGGCCATAAATTTAGCGTCCGTGGTGAGGGTGAAGGGGATGCCACAAATGGCAAACTTACCCTTAAATTCATTTGCACTACCGGCAAGCTGCCGGTCCCTTGGCCGACCTTGGTCACCACACTGACGTACGGGGTTCAGTGTTTTTCGCGTTATCCAGATCACATGAAACGCCATGACTTCTTCAAAAGCGCCATGCCCGAGGGCTATGTGCAGGAACGTACGATTAGCTTTAAAGATGACGGGACCTACAAAACCCGGGCAGAAGTGAAATTCGAGGGTGATACCCTGGTTAATCGCATTGAACTGAAGGGTATTGATTTCAAGGAAGATGGTAACATTCTCGGTCACAAATTAGAATACAACTTTAACAGTCATAACGTTTATATCACCGCCGACAAACAGAAAAACGGTATCAAGGCGAATTTCAAAATCCGGCACAACGTGGAGGACGGGAGTGTACAACTGGCCGACCATTACCAGCAGAACACACCGATCGGCGACGGCCCGGTGCTGCTCCCGGATAATCACTATTTAAGCACCCAGTCAGTGCTGAGCAAAGATCCGAACGAAAAACGTGACCATATGGTGCTGCTGGAGTTCGTGACCGCCGCGGGCATTACCCATGGAATGGATGAACTGTATAAA p1: ACTCTAGAGGATCCCCGGGTACCGAGCTCGAGGAGGACAGCTATGTCGAAAGGA p2: ACGACGTTGTAAAACGACGGCCAGTGAATTTTTATACAGTTCATCCATTCCATGGGTAATG program: extension_time: 11.0 forward_primer_concentration: 0.001 rate: 15.0 reverse_primer_concentration: 0.001 saltc: 50.0 ta: 72.0 - assemble: insert: sfGFP vector: pUC19
DNA synthesis
Of course, before you can run this pipeline, you need to have the appropriate insert DNA in your transcriptic inventory. As far as I know, none of the major synthetic DNA suppliers has an API. However, you can order DNA from IDT by filling in an excel file. I have automated filling in and emailing this file, so DNA synthesis can be included in the pipeline too! It should take about a week from ordering for DNA to appear at transcriptic.
Reporting
After each protocol finishes, a HTML report is generated. This allows the user to evaluate protocol results manually before initiating the next step. There are ways to automate this more, like using automated band mapping of gel images, but I think that kind of thing will work better once the transcriptic API settles down a bit. The HTML report also serves as a log of the experiment.
Conclusions
There is still plenty to do before the pipeline is completely automatic. For example, attentive readers will notice that the HTML report above shows an unsuccessful transformation, one of many! The first complete transformation took several months to get right. The biggest challenge is making the process robust to changes in the protein sequence — even basic PCR can go wrong in many ways. Currently, debugging is a major undertaking; unlike regular programming, iterations are slow and expensive. However, if the protocols can be made robust enough, which I think they can, then synthesizing a new protein could become as simple as running BLAST.