Research Reproducibility

Brian Naughton | Wed 10 September 2014 | outsourcing | experiments reproducible research

It turns out that most research is not terribly reproducible, and that's a problem when you are trying to turn R into D, as Amgen and others have discovered. That's why reproducible research is currently a hot topic in science — so hot you can even take a course on it on Coursera.

Most reproducible research literature I've seen focuses on the data analysis component, which makes sense since this is the most straightforward place to start. Generally, this means that you package your code and data in such a way that others can rerun your experiment.

Robotic labs may also help with reproducibility. What is a robotic lab? To me, a robotic lab is a lab that is automated enough that protocols are defined in a machine-readable language. No research lab looks like this yet (with good reason), but some commercial labs (Transcriptic and Emerald Cloud Lab) are moving quickly in this direction. It's an awesome trend that could lead to a lot of changes in how science is done, including the reproducibility of research.

I'll ignore some of the current problems with robots, which are mainly around manual dexterity, skill and flexibility. For many experiments, these limitations will be too much to overcome, but historically, automation and mass production eventually beats manual labor, starting with the most boring and repeatable tasks.

The potential advantages of robotic labs are highly analogous to those of cloud computing:

high scalability
low cost (thanks to economies of scale and automation)
protocols can be transferred from lab to lab
reproducibility within and between labs

Research Papers

The current model of writing research papers dates back to Robert Boyle in the 17th century (at least according to a recent episode of In Our Time). A 21st century research paper could look pretty different:

Introduction: currently, these are usually just derivative summaries, so let's replace this with a wikipedia link or one of a standardized set of introductions: "#34, why cancer research matters"
Methods: this is now a machine-readable protocol written in the Wolfram Language or YAML or Python
Results: analysis and results are in an iPython notebook so you can see exactly what was done with the data and in what order
Conclusion: here we can go to town describing in free-text what happened in the experiment and why it matters

Maybe I am being too harsh on the introduction, but in general this model makes a lot of sense to me. (I recently learned that there is a company, Standard Analytics, trying to help people write papers this way).

I would guess that currently, most papers cannot be written this way, because the experimental technique is too temperamental to standardize, or the code to analyze the data is in flux, or the analysis only lives in an Excel spreadsheet, or a hundred other reasons. However, it's also possible you just shouldn't read these papers on the grounds that they are likely to contain errors.

Conclusion

It's unfortunate that even purely computational biology papers usually lack a simple method for the reader to run the code and reproduce the key results. Many of these papers could be written up as iPython notebooks instead of as papers. Of course, this is no way to get tenure, so only people who have nothing to prove, like Peter Norvig, do this.

(Some experiments rely on proprietary data, and so cannot provide full reproducibility. However, you can usually at least demonstrate that the method you devised works on simulated data...)

One recent project that took a novel approach was the Encode project, where the authors published a virtual machine image along with the paper to help you reproduce many of the key results. As VMs get simpler and more portable, we might see more of this too.

Comment

Outsourcing Wet-Lab Experiments

Brian Naughton | Mon 01 September 2014 | outsourcing | outsourcing experiments virtual biotech

So you want to do some biological experiments, but you don't own a lab. Currently, unlike many other technology sectors, it is not easy to exchange money for a wet-lab experiment. This post is my interpretation of the evolution and latest developments in outsourcing wet-lab experiments.

Virtual Biotechs and CROs

Virtual Biotechs have been around for many years, but are becoming increasingly common. Basically, a Virtual Biotech is a biotech that outsources its research, usually to a Contract Research Organization (CRO). Virtual Biotechs are small: usually between one and twenty people, and they may not even have office space. You can tell it's cool to be a Virtual Biotech when you see a trend piece in the WSJ. Here's a description of one Virtual Biotech and a recent success story from the excellent Life Science VC blog.

There are a number of ways a Virtual Biotech can work. For example: a pharma exec sees value in a compound that the pharma has given up on, so he spins it out, raises money to pay for a Phase I or Phase II trial (conducted by a CRO), and if that is successful, he sells it back to the pharma for a profit. Everybody wins.

In other words, Virtual Biotech is often about asset arbitrage rather than research, since developing drugs from scratch is too expensive and slow, and preclinical work does not generate value quickly enough.

Interestingly, huge pharma companies are increasingly acting like Virtual Biotechs, in that they are either using CROs to conduct primary research, or buying up biotechs that already developed a promising compound. Less and less R&D is being done in Big Pharma due to a lack of productivity (see Eroom's Law).

Figuring out which CRO to use still seems to be a trial-and-error process, more like hiring a consultant than buying some compute on Amazon.

Web 2.0 CRO

Academic labs, core labs, CROs and even biotech companies often have excess lab capacity that they want to utilize. About 5-10 years ago, Assay Depot and Science Exchange launched with the intention of connecting that excess capacity with academic and biotech labs. Assay Depot and Science Exchange act as clearing houses, connecting you to labs that can perform experiments for you. They provide contact information and billing services but not too much else.

On the positive side, this is often very cost-effective and a great way to leverage the expertise of (for example) the UC Davis Mouse Biology Program, but on the downside the service provider is not necessarily set up to act as a CRO, and you will probably end up having to contract with several labs to get all your experiments done. If there is a time-critical step spread over two labs, like RNA sequencing, you may have a problem.

I think the major use-case here is for an academic lab that lacks the ability to do a certain type of experiment. Instead of finessing a collaboration with another lab, you just pay a small amount and get your results back fast.

Here Come the Robots

Starting very recently, there is an exciting new trend in wet-lab outsourcing: robots!

Two Bay Area companies, Transcriptic and Emerald Cloud Lab, decided that they could make the process of outsourced wet-lab research more efficient and more reproducible by using robots.

Transcriptic is already up and running, with competitive pricing on cloning, genotyping and biobanking. I believe that Transcriptic will already perform many other types of experiment upon request, and that their advertised experiments are just their foot in the door. Emerald Cloud Lab will be launching in 2015 with a large suite of services. It is currently in beta.

The advantages of doing experiments through a lab like this are tantalizing: (a) it can make experiments cheaper and faster, with economies of scale and machines running all day and night (b) it can make your research more reproducible since your protocol will be defined by a machine-readable script (c) you can scale up your research from one sample to a huge number, potentially without changing the protocol of your successful pilot experiment

Interestingly, Emerald is using the Wolfram Language (very similar to Mathematica). I would prefer Python or something similar, but clearly the Wolfram Language has some great capabilities, and seems to be highly expressive for data-analysis (see the Wolfram Blog for some great examples).

Conclusion

We're clearly not yet in a world where companies and academic labs can just run all their experiments virtually (as we've seen analogously happen with AWS, Azure, etc for web companies) and of course there are many types of experiment that necessitate hands-on time and expertise (for example, developing a new protocol or technology).

However, there are also thousands of labs doing their own genotyping, cloning, sequencing, mass-spec etc. All of them are doing it slightly differently and half of them are doing it worse than the median (and none of them think so)... We know that science is generally not very reproducible at the best of times (see Amgen's experiments reproducing 53 landmark cancer papers or Ioannidis' famous paper) so reducing experimental variation must be a good thing long-term. I really hope to see more competitors to Transcriptic and Emerald soon — and maybe even new approaches to defining and publishing experiments in a reproduction-friendly machine-readable format.

Comment