archives –––––––– @btnaughton

Modern biotech data infrastructure

Brian Naughton | Sat 30 October 2021 | biotech | biotech

So you are starting a biotech, and because it's 2021, you intend to generate a good amount of data. Maybe some genomics, proteomics, microscopy images, even video. You want to set up data infrastructure you will not regret later. This article has some thoughts on how to start.

Use Google Workspace or Office 365

OK this is pretty basic these days, but I think the important thing here is to exclude desktop applications as much as possible. The alternative choice of using desktop versions of Powerpoint, Excel, etc. still happens frequently in biotech, but it seems crazy not to be able to reliably store documents past the tenure of a single employee, back them up easily, and search them. If someone leaves the company, they may well have some files you need on their hard drive! The hard part is not so much having the online versions of these tools, which is almost the default, but stopping people using the desktop versions.

I haven't actually tried Office 365, but I believe that the browser-based Powerpoint, Excel, etc. are pretty good. I have used Google Workspace (previously called gsuite) for many years, so I am pretty used to the tools and for me that is a pretty natural choice.

I especially like Workspace's integration with the rest of Google Cloud. For example, if you create an email address for a new employee, then that email address can be automatically authenticated for any internal tools you have. You hardly have to think about authentication — a not-fun security problem — at all. (AWS also has single sign-on and it's probably not so hard to set up.)

Use the cloud (AWS or GCP) for everything

There are three major cloud providers: Amazon (AWS), Google (GCP), and Microsoft (Azure). Unless you are closely tied to the Microsoft ecosystem, I don't know of a good reason to use Azure.

GCP

  • great support for compute / ML, with TPUs and colab
  • integrates very nicely with Google Workspace
  • more modern and consistent API than AWS

AWS

  • more capabilities and tools than GCP
  • better support from the AWS team
  • the "default" cloud, so more tools support it or require it

GCP and AWS are both fine choices, and have more similarities than differences. In general, you do have to choose just one though, both for simplicity and because "egress" (moving data out) is one of the most expensive parts of using the cloud.

Put everything in buckets

I love buckets (s3 on AWS, cloud storage on GCP). In fact, this blog is served straight out of an s3 bucket.

Apart from managing servers, one of the most annoying things about scaling up data is when you run out of disk space. Once your data no longer fits on a disk, everything gets 10 times more complicated: you have to think about how to distribute data across disks and they are all huge administative overhead. If you have ever had to shard data, I am sorry.

By comparison, buckets: scale to infinity, are at least 10X cheaper than disks ($0.001 to $0.02 per GB per month), have many more 9s of uptime than anything else in your stack, and basically just sit there serving data and never complaining.

There are two main downsides: accessing data over the network is never going to compete with a local disk, and buckets are basically a simple key–value store, they are only searchable by one thing, which is the name of the file (for example, the ID of an experiment). For many use-cases, this is good enough, but if you want to search for other data within the file, you'll likely need a database too.

Tools like AWS's Athena or Google's BigQuery (essentially an SQL interface on top of files in buckets) may help, at least as a stop-gap, but they still have to read in every file for every search you do, so they are not a substitute for indexed/searchable data.

Even though you almost certainly need a database eventually, depending on how you are accessing data, buckets could take you a long way. For biotechs, accessing files by the ID of an experiment is pretty natural. Later, when you want to layer a database on top, it should be a natural progression to store the index-worthy data and the bucket URLs in the database.

Comparison of data storage options by @tsuname

Most files you store will either be raw data or tabular. For tabular data, parquet seems to be the safe choice these days. It has native support in pandas, Athena, BigQuery, and can be optimized for space or speed. SQLite files are likely much bulkier than parquet, but may shine when you have more complex queries. You can even do HTTP range requests on SQLite files in buckets! This method could be a huge win for some use-cases.

Which brings us on to "big data".

Big data: tech vs biotech

"Big data" as a term was a big thing for a while, and it still lingers as a concept. What is big though? "Big data" has famously been described as data that will not fit in an excel file, which could be anything larger than a gigabyte.

Biotechs use all these data-heavy instruments like sequencers, so surely modern biotech requires big data? As a simple example, we can compare a hypothetical consumer-focused tech company to a data-driven biotech.

  • One key difference between a biotech and almost any tech company — even if they have the same amount of data — is how many people are accessing the data. It could be up to millions (or billions!) of users accessing data on a website, but at a biotech it can never be more than thousands of users, and more likely, tens.

  • A consumer website or app may need to respond to a request in a couple of seconds to retain a user, while a scientist at a biotech may be more willing to wait much longer for a batch of results.

  • Another important difference is how much the data can be compressed. Usually, at a biotech you are trying to make some inference based on data. For example, how are certain molecular features (gene expression, metabolism, cell structure) correlated with a treatment? The actual sequencing reads or pixels are usually not relevant. In contrast, for any website with text communication, you will need to retrieve the specific text written. A biotech can throw away (or archive) the raw reads from a sequencing run and keep the variants, but a tech company cannot throw away the raw text from a chat and keep the gist.

These differences have a big impact on how difficult it is to manage the data.

How many hard disks is that?

One thing that is pretty counter-intuitive to me is how big hard disks are these days. For example, a 16 TB hard drive now costs around $500. If you are like me, this doesn't really register as 1000 times bigger than a 16 GB hard drive.

So how big is 16 TB really?

  • Genomes: a human genome is 3GB naively uncompressed, but less than 1MB compressed; so you can theoretically store tens of millions of complete human genomes on one consumer hard disk.
  • Sequencing reads: a sequncing run from a large Illumina machine like a NovaSeq can produce terabytes of data, but these essentially always get collapsed into much smaller derived files (assemblies, variant files, transcript counts). The raw, archived reads can sit in a bucket (at a cost of $1 per TB per month), or in most cases you could just delete the reads (nobody actually does this!)
  • Images: obviously this depends on the instrument producing the images, and how much lossy compression is ok for your inference. Your images could be much larger than 1 MB, but trivially, you can store 16 million 1 MB images. That is a lot of experiments! I'm not an expert here, but I have not seen deep learning applied to really high resolution images, I think for practical reasons.
  • Assays: this includes biochemical assays, cytotoxicity assays, any in vivo assays, etc. These are neglibible in size and will never fill a hard disk.

It may be useful to differentiate between the amount of data you need locally (e.g., in memory), and the amount of data you need to be able to access at short notice. If you want to access summary statistics, or derived data, instantly, but also be able to access the raw data from an experiment in a reasonable timeframe of seconds to hours (downloading from a fast or slow bucket), then you may not have big data problems at all. This is good news! It means you may not need a cluster, expensive RAID (NetApp, Isilon), IT staff, etc.

GB TB PB EB
If you can see the GB, congrats, your monitor is very large or very clean.

Example data costs for biotechs

  • An old Recursion Tx blogpost states that they gather "65 terabytes of data per month". This would cost $650/month to store in a bucket for fast access (retrieval in milliseconds), or $65/month for archival access (retrieval "within 12 hours").

  • A 2020 blogpost on insitro says they "generate hundreds of terabytes (and eventually petabytes) of proprietary data specifically for training machine learning models". A hundred terabytes is 10 hard disks ($5000 worth), or $1-$10k/month in bucket storage.

The numbers above oversimplify the complex data problems I am sure these companies have, but these are also among the most data-hungry biotechs out there. For most biotechs, the numbers will be a tiny fraction of this.

The main point here is just to translate "65 terabytes" into dollars and cents, and put in a good word for buckets. For an unfair contrast, Apple stores "8 exabytes" of data with Google. That is eight million terabytes. I think it's useful to recognize the difference in scale.

Run your compute on the cloud

Running compute on the cloud is almost certainly the right way to start, since it takes very little time to set up, and can scale up and down very quickly. However, it can be shockingly expensive compared to just buying computers.

For example, a hard disk you can buy for $500 costs >$600 a month to rent on the cloud! (assuming $0.04 per GB per month and a 16 TB disk). That is not even an SSD!

GPU boxes are also extremely expensive, which can be limiting if you need to do a lot of deep learning. The counterargument here is that if you later need to train on 100 GPUs or TPUs, the cloud is the only practical way to do this.

Building a cluster of computers on site is a big task, requiring at least one IT person to administer, and it severely limits your ability to scale compute up and down. Depending on the location of your company, one IT person could cost >$300k a year in salary, benefits, etc. At a certain scale it may make sense to build a cluster, like Recursion's BioHive supercomputer, but not to start.

If you want to save money on cloud costs, one untested idea that I find very appealing is to just buy powerful desktop PCs for employees — a $3k gaming PC, a monster $20k deep learning box, or other Big Ass Server — and access them like a distributed cloud via cloudflare tunnels. Obviously shuttling data between machines can be difficult, but in some situations this could save six figures and this may suit, say, a lean deep-learning startup. Depending on how much data you have — and it could be a lot! — you could still plausibly rsync all the company's data to your local disk.

Tools for computation on the cloud

In the olden days, you would run computations with MPI and SGE or slurm on a local cluster. Many (most?) universities still use these systems, because they still own their own clusters. For cost reasons, it could be a good idea for large institutions to do this. For startups it's neither realistic nor desirable.

Then there was an awkward intermediate stage with cloud computing. You could spin up clusters on the cloud, and tear them down when you were done, but you still had to manage a bunch of stuff with the servers yourself. For example, you might have to know how to set up hadoop, or later kubernetes. These are devops tasks, and really fiddly and annoying.

Things have thankfully changed pretty fast, and now you don't need to know much about the nuts and bolts of clusters and Docker pods to do distributed computing.

The options for scaling compute on the cloud today include:

Results of a twitter poll by @AlbertVilella

Next-gen tools

The next generation of tools should require even less experience with the command-line, server provisioning, and explicit scaling. Here are some examples of new models of hands-off cloud computing:

  • Google colab: These Google-managed jupyter notebooks are an amazing deal if you need access to a GPU or TPU every now and then. In particular the pro tier is a fantastic deal at $10/mo. (They must lose so much money on this.) There is zero setup and installation, so colabs are mainly a great way to deploy a tool to the community without setting up a server. A lean startup or lab could get amazing value here.

  • Domino Data Lab: Domino Data Lab is a very simple way to get access to compute on-demand, and it's good for analyzing and experimenting with data via jupyter. It is not designed for running routine workflows.

  • numpywren: This is a defunct project, built on pywren and limited in many ways, but very interesting, and maybe an indication of where things are going. The serverless compute concept got a boost recently now that lambda can mount EFS.

  • anyscale: This project looks very promising. The tagline here is "program your cluster as easily as your laptop". I think that captures it! It uses Ray on the backend, which is a good sign, I think. (If you are a pandas users, also check out Modin, aka Pandas on Ray.)

  • Dask Cloud Provider "This package provides classes for constructing and managing ephemeral Dask clusters on various cloud platforms". I have not tried this, but Dask is probably the most popular parallelization tool for Python/pandas.

  • polytope: The name of the project is no longer polytope, but I believe the plan is still hands-off compute on the cloud. The author, Erik Bernhardsson, also wrote an interesting blogpost on data tools and organization.

  • Temporal: "The open source microservice orchestration platform for writing durable workflows as code". I don't really know anything about this one, and it's not aimed at biotech data, but its focus on robust workflows seems interesting.

  • Athena for Genomics: There are a few projects like this from both Google and Amazon, but I'm not sure these use-cases make sense. In particular, this one will likely be very expensive for any routine tasks.

  • LatchBio: "Find and launch dozens of cloud workflows and visualize your biological data all within your browser" Latch is a new startup in the space, trying to accelerate data/ML infrastructure adoption for biotechs. It's currently no-code, so even more hands-off than the other tools listed here. (I am an investor in Latch).

Experimental lab data: ELN and LIMS

Up to now, I've mainly used the output of instruments like sequencers as my examples, since these are the largest source of raw data.

Apart from raw instrument data, there is also "experimental lab" data. This is usually a small amount of data, but more heterogeneous, interconnected, and difficult to organize. The world of lab data software is not well-defined, but at least two types of systems that come up are ELNs and LIMSs.

An ELN (electronic lab notebooks) is just what it sounds like: semi-structured data-stores for experimental data. They may also include common helper tools that scientists find useful in the lab (e.g., unit conversions).

A LIMS (laboratory information management system) is similar to an ELN, in that it helps log experimental data, but it's less of a personal notebook, and more like the software that runs lab operations. At a certain scale — and not very large — a biotech will need a LIMS to (a) manage the instrument data they are generating; (b) manage all the inventory in the lab; (c) manage lab workflows ("after process A, queue up for process B").

Despite how important LIMS are to biotech, I don't know how a biotech is supposed to get one up and running. Most of the modern LIMS I am aware of were developed in-house, despite the large amount of redundancy in functionality. Even today, most LIMS are developed by small companies with small software teams. Unfortunately, I don't know of any specific LIMS that works well for general use.

Modern experimental lab data tools

The modern tools I am aware of that overlap with ELNs and/or LIMSs include:

  • Benchling: a "life sciences R&D cloud". Benchling began as an ELN, and has gradually added LIMS functionality. It is the most popular tool listed here, probably by a wide margin.
  • Radix: an "operating system for your lab". Radix appears to have a hardware focus that sounds promising ("connect all hardware and software together"). I do worry that you need also boots on the ground, working directly with your machines, to make this integration work properly.
  • Synthace: an "R&D cloud" with a focus on experimental design and automation.
  • TetraScience: an "R&D Data Cloud for life sciences" with instrument integrations.
  • SciCord: "combines the compliance and structured aspects of a Laboratory Information Management System (LIMS) with the flexibility of an Electronic Laboratory Notebook (ELN)".
  • Elemental Machines: "Distills a lab full of data onto a single dashboard". Elemental Machines is more focused on sensors and data monitoring than the others on this list, and hence may complement an ELN or LIMS.

If you are wondering how these tools fit into your biotech, and what problems they specifically solve, well I am too. My impression is that most of these tools want to be a solution for data management and analysis, but without promising specific LIMS functionality, reflecting the breadth of the problem. LIMS means interfacing with devices, and that is often slow, custom work. You may need to just custom build your system, perhaps with help from a consulting group like Monomer Bio.

This 2019 Nature news item also discusses this topic — including remarks from Radix and Synthace — but from the perspective of cloud labs, which is a whole separate topic.

Compliance, traceability, etc.

There are also tools like Veeva that are more focused on later stage biotech problems like compliance, traceability, manufacturing, marketing, etc. These tools are out of scope for early-stage biotechs.

This is getting pretty long

The point of this post was supposed to be listing some technology choices I think would make sense for a new, data-driven biotech startup. So, here are a few loose conclusions:

  • Start out with all your data in one place on the cloud, and disallow storing files on laptops.
  • Use the cloud for all compute and storage, especially buckets where possible.
  • Put off doing anything with servers, databases — and especially multiples of these — as long as you can.
  • One big hard drive and a big CPU can store and process more data than you might think, possibly even all of your data. It's unlikely you have more than terabytes of locally-useful data. (Raw data should live in buckets!)
  • Try out all the available data tools and platforms before building anything yourself — maintenance asymptotes at 100% of the total work required.
  • It's easy to trial a couple of ELNs and pick your favorite but I have no idea what you are supposed to do about LIMS. If anyone figures this out I'd love to know.

It would be useful to have some best practices out there for biotech data (or even more generally, hard tech data). Unfortunately, the tech end of biotech is much smaller and more insular (read, IP-focused) than regular tech, so there is much less material around on what's going on inside companies in the space. In general, everyone is building their own thing, and mostly starting from scratch, incurring huge inefficiencies. I hope that the rapidly increasing number of "techbio" companies will improve this (e.g., EQRx had a recent video on their data stack), and we'll soon commoditize the nuts-and-bolts data problems and put the focus back on the biological problems.


Comments


Boolean Biotech © Brian Naughton Powered by Pelican and Twitter Bootstrap. Icons by Font Awesome and Font Awesome More