The Sequencing Gap

Brian Naughton | Sun 11 November 2018 | sequencing | biotech sequencing dna

I took a look at the data in Albert Vilella's very useful NGS specs spreadsheet using Google's slick colab notebook. (If you have yet to try colab it's worth a look.)

Doing this in colab was a bit trickier than normal, so I include the code here for reference.

First, I need the gspread lib to parse google sheets data, and the id of the sheet itself.

!pip install --upgrade -q gspread

sheet_id = "1GMMfhyLK0-q8XkIo3YxlWaZA5vVMuhU1kg41g4xLkXc"

Then I authorize myself with Google (a bit awkward but it works).

from google.colab import auth
auth.authenticate_user()

import gspread
from oauth2client.client import GoogleCredentials

gc = gspread.authorize(GoogleCredentials.get_application_default())

I parse the data into a pandas DataFrame.

sheet = gc.open_by_key(sheet_id)

import pandas as pd
rows = sheet.worksheet("T").get_all_values()
df = pd.DataFrame.from_records([r[:10] for r in rows if r[3] != ''])

I have to clean up the data a bit so that all the sequencing rates are Gb/day numbers.

import re
dfr = df.rename(columns=df.iloc[0]).drop(index=0).rename(columns={"Rate: (Gb/d) ":"Rate: (Gb/d)"}).set_index("Platform")["Rate: (Gb/d)"]
dfr = dfr[(dfr != "--") & (dfr != "TBC")]
for n, val in enumerate(dfr):
  if "-" in val:
    rg = re.search("(\d+).(\d+)", val).groups()
    val = (float(rg[0]) + float(rg[1])) / 2
    dfr[n] = val
dfr = pd.DataFrame(dfr.astype(float)).reset_index()

I tacked on some data I think is representative of Sanger throughput, if not 100% comparable to the NGS data.

A large ABI 3730XL can apparently output up to 1-2 Mb of data a day in total (across thousands of samples). A lower-throughput ABI SeqStudio can output 1-100kb (maybe more).

dfr_x = pd.concat([dfr, 
                   pd.DataFrame.from_records([{"Platform":"ABI 3730xl", "Rate: (Gb/d)":.001}, 
                                              {"Platform": "ABI SeqStudio", "Rate: (Gb/d)":.0001}])])

dfr_x["Rate: (Mb/d)"] = dfr_x["Rate: (Gb/d)"] * 1000

If I plot the data there's a pretty striking, three-orders-of-magnitude gap from 1Mb-1Gb. Maybe there's not enough demand for this range, but I think it's actually just an artifact of how these technologies evolved, and especially how quickly Illumina's technology scaled up.

import seaborn as sns
import matplotlib.pyplot as plt
f, ax = plt.subplots(figsize=(16,8))
fax = sns.stripplot(data=dfr_x, y="Platform", x="Rate: (Mb/d)", size=8, ax=ax);
fax.set(xscale="log");
fax.set(xlim=(.01, None));

sequencing gap plot

Getting a single 1kb sequencing reaction done by a service in a day for a couple of dollars is easy, so the very low throughput end is pretty well catered for.

However, if you are a small lab or biotech doing any of:

microbial genomics: low or high coverage WGS
synthetic biology: high coverage plasmid sequencing
disease surveillance: pathogen detection, assembly
human genetics: HLA sequencing, immune repertoire sequencing, PGx or other panels
CRISPR edits: validating your edit, checking for large deletions

you could probably use a few megabases of sequence now and then without having to multiplex 96X.

If it's cheap enough, I think this is an interesting market that Nanopore's new Flongle can take on, and for now there's no competition at all.

The Sequencing Gap

Comments