# The Sequencing Gap

Sun 11 November 2018 // Filed under sequencing //

I took a look at the data in Albert Vilella's very useful NGS specs spreadsheet using Google's slick colab notebook. (If you have yet to try colab it's worth a look.)

Doing this in colab was a bit trickier than normal, so I include the code here for reference.

First, I need the gspread lib to parse google sheets data, and the id of the sheet itself.

```!pip install --upgrade -q gspread
```
```sheet_id = "1GMMfhyLK0-q8XkIo3YxlWaZA5vVMuhU1kg41g4xLkXc"
```

Then I authorize myself with Google (a bit awkward but it works).

```from google.colab import auth
auth.authenticate_user()

```

I parse the data into a pandas DataFrame.

```sheet = gc.open_by_key(sheet_id)

import pandas as pd
rows = sheet.worksheet("T").get_all_values()
df = pd.DataFrame.from_records([r[:10] for r in rows if r != ''])
```

I have to clean up the data a bit so that all the sequencing rates are Gb/day numbers.

```import re
dfr = df.rename(columns=df.iloc).drop(index=0).rename(columns={"Rate: (Gb/d) ":"Rate: (Gb/d)"}).set_index("Platform")["Rate: (Gb/d)"]
dfr = dfr[(dfr != "--") & (dfr != "TBC")]
for n, val in enumerate(dfr):
if "-" in val:
rg = re.search("(\d+).(\d+)", val).groups()
val = (float(rg) + float(rg)) / 2
dfr[n] = val
dfr = pd.DataFrame(dfr.astype(float)).reset_index()
```

I tacked on some data I think is representative of Sanger throughput, if not 100% comparable to the NGS data.

A large ABI 3730XL can apparently output up to 1-2 Mb of data a day in total (across thousands of samples). A lower-throughput ABI SeqStudio can output 1-100kb (maybe more).

```dfr_x = pd.concat([dfr,
pd.DataFrame.from_records([{"Platform":"ABI 3730xl", "Rate: (Gb/d)":.001},
{"Platform": "ABI SeqStudio", "Rate: (Gb/d)":.0001}])])

dfr_x["Rate: (Mb/d)"] = dfr_x["Rate: (Gb/d)"] * 1000
```

If I plot the data there's a pretty striking, three-orders-of-magnitude gap from 1Mb-1Gb. Maybe there's not enough demand for this range, but I think it's actually just an artifact of how these technologies evolved, and especially how quickly Illumina's technology scaled up.

```import seaborn as sns
import matplotlib.pyplot as plt
f, ax = plt.subplots(figsize=(16,8))
fax = sns.stripplot(data=dfr_x, y="Platform", x="Rate: (Mb/d)", size=8, ax=ax);
fax.set(xscale="log");
fax.set(xlim=(.01, None));
``` Getting a single 1kb sequencing reaction done by a service in a day for a couple of dollars is easy, so the very low throughput end is pretty well catered for.

However, if you are a small lab or biotech doing any of:

• microbial genomics: low or high coverage WGS
• synthetic biology: high coverage plasmid sequencing
• disease surveillance: pathogen detection, assembly
• human genetics: HLA sequencing, immune repertoire sequencing, PGx or other panels
• CRISPR edits: validating your edit, checking for large deletions

you could probably use a few megabases of sequence now and then without having to multiplex 96X.

If it's cheap enough, I think this is an interesting market that Nanopore's new Flongle can take on, and for now there's no competition at all.