Brian Naughton // Sun 11 November 2018 // Filed under sequencing // Tags biotech sequencing dna

I took a look at the data in Albert Vilella's very useful NGS specs spreadsheet using Google's slick colab notebook. (If you have yet to try colab it's worth a look.)

Doing this in colab was a bit trickier than normal, so I include the code here for reference.

First, I need the gspread lib to parse google sheets data, and the id of the sheet itself.

!pip install --upgrade -q gspread
sheet_id = "1GMMfhyLK0-q8XkIo3YxlWaZA5vVMuhU1kg41g4xLkXc"

Then I authorize myself with Google (a bit awkward but it works).

from google.colab import auth

import gspread
from oauth2client.client import GoogleCredentials

gc = gspread.authorize(GoogleCredentials.get_application_default())

I parse the data into a pandas DataFrame.

sheet = gc.open_by_key(sheet_id)

import pandas as pd
rows = sheet.worksheet("T").get_all_values()
df = pd.DataFrame.from_records([r[:10] for r in rows if r[3] != ''])

I have to clean up the data a bit so that all the sequencing rates are Gb/day numbers.

import re
dfr = df.rename(columns=df.iloc[0]).drop(index=0).rename(columns={"Rate: (Gb/d) ":"Rate: (Gb/d)"}).set_index("Platform")["Rate: (Gb/d)"]
dfr = dfr[(dfr != "--") & (dfr != "TBC")]
for n, val in enumerate(dfr):
  if "-" in val:
    rg ="(\d+).(\d+)", val).groups()
    val = (float(rg[0]) + float(rg[1])) / 2
    dfr[n] = val
dfr = pd.DataFrame(dfr.astype(float)).reset_index()

I tacked on some data I think is representative of Sanger throughput, if not 100% comparable to the NGS data.

A large ABI 3730XL can apparently output up to 1-2 Mb of data a day in total (across thousands of samples). A lower-throughput ABI SeqStudio can output 1-100kb (maybe more).

dfr_x = pd.concat([dfr, 
                   pd.DataFrame.from_records([{"Platform":"ABI 3730xl", "Rate: (Gb/d)":.001}, 
                                              {"Platform": "ABI SeqStudio", "Rate: (Gb/d)":.0001}])])

dfr_x["Rate: (Mb/d)"] = dfr_x["Rate: (Gb/d)"] * 1000

If I plot the data there's a pretty striking, three-orders-of-magnitude gap from 1Mb-1Gb. Maybe there's not enough demand for this range, but I think it's actually just an artifact of how these technologies evolved, and especially how quickly Illumina's technology scaled up.

import seaborn as sns
import matplotlib.pyplot as plt
f, ax = plt.subplots(figsize=(16,8))
fax = sns.stripplot(data=dfr_x, y="Platform", x="Rate: (Mb/d)", size=8, ax=ax);
fax.set(xlim=(.01, None));

sequencing gap plot

Getting a single 1kb sequencing reaction done by a service in a day for a couple of dollars is easy, so the very low throughput end is pretty well catered for.

However, if you are a small lab or biotech doing any of:

  • microbial genomics: low or high coverage WGS
  • synthetic biology: high coverage plasmid sequencing
  • disease surveillance: pathogen detection, assembly
  • human genetics: HLA sequencing, immune repertoire sequencing, PGx or other panels
  • CRISPR edits: validating your edit, checking for large deletions

you could probably use a few megabases of sequence now and then without having to multiplex 96X.

If it's cheap enough, I think this is an interesting market that Nanopore's new Flongle can take on, and for now there's no competition at all.

Brian Naughton // Thu 14 May 2015 // Filed under genomics // Tags sequencing nanopore minion

The unveiling of the MinION MkII at the 2015 Oxford Nanopore Conference may be remembered as a very big deal in the history of genomics. A number of tweets even compared it to the 2007 iPhone unveiling. Of course, that's crazy — it's a much bigger deal than a simple Candy Crush vehicle.

I look forward to reading accounts from people who actually attended the conference, but I wanted to assemble what I learned from the #nanoporeconf tweets and this great genomeweb article.


The major news at the conference was the announcement of a new MinION sequencer, the MkII, replacing the MkI, itself only available as part of an early-access program.

The MkII has a new "fast mode" that's about 10X as fast as before and the output is now an impressively Illumina-like 5GB per hour. The error rates are also apparently improved.

Apart from its size, one of the most interesting things about the tiny MinION is that it streams sequence in real-time to your computer (I think it even needs USB 3). That means you can do things like sequence until you find what you are looking for, then stop, or, perhaps in the future, leave it on all the time, monitoring your sewer system etc.

Because of how the machine works, you are billed by the hour, like an AWS instance. The new price is $20 an hour, down from about $90 per hour for the MkI. At that price you really start to think about sequencing in a new way. This is the most mind-blowing aspect of this machine to me, and I think it will radically change how sequencing will be used, especially outside core human genomics applications.


Voltrax is a lab-on-a-chip that attaches directly to your MinION and does sample prep for you. That allows you to sequence from samples while you're out and about. It's programmable in Python (an excellent choice). Sadly, it's not out yet and they didn't give a timeline for it.

What's MinION for?

Although it's an amazing little machine, MinION doesn't compete with MiSeq/HiSeq for typical human genomics applications. This is mainly because the error rate is still very high. For the MkI the error rate is apparently up to 30%(!) For the MkII there is talk of getting it down to 5% — still very high compared to Illumina.

The major applications I've seen proposed so far are around: genome scaffolding (like PacBio — you just need long reads); pathogen/environmental sequencing (like @pathogenomenick sequencing Ebola in Guinea — here you need something fast and portable, and long reads help); sequencing messy parts of the genome (like HLA and CYP2D6, previously requiring Sanger sequencing or other difficult methods — here you need long and reasonably accurate reads).


This is a bigger, benchtop instrument with higher throughput than the MinION (6.4 terabases per run). It's less obvious to me how this machine will be used, but it's extremely impressive throughput.

Brian Naughton // Tue 11 November 2014 // Filed under data // Tags bayesian pymc sequencing

I recently watched a couple of videos about the new PyMC (PyMC3), and they got me pretty excited about it (one from Thomas Wiecki and one from Chris Fonnesbeck, both core developers.)

I've used PyMC2 a little bit, but this new version is a complete rewrite. It now uses Theano as a backend, which helps with computing gradients, but it also means PyMC3 is now pure Python, where previously it had a bunch of Fortran. It also seems to be better integrated with the current state-of-the-art tools, like pandas and scikit.

PyMC 3 is alpha software, so it has bugs and little documentation, but it's nice to get some exposure to Theano and NUTS (my favorite Python distribution, Anaconda, includes Theano by default.)

At the same time, I've was playing around with some example sequencing data from the pRESTO toolkit so I thought I would try to apply PyMC3 to these data.

Every sequencing read gives you a DNA sequence, but also an estimate of the error for every nucleotide. The sequencing read files are in fastq format, which means that the quality information is encoded like this:


It looks awful, but I guess if you really need to use a text file that's how it has to work. You can map these characters onto a quality (Phred) score in Python with a simple ord(c)-33. You can then map that onto expected number of errors by taking 10^(-v/10).

I decided to look at read quality as a function of read-length using PyMC3.


Boolean Biotech © Brian Naughton Powered by Pelican and Twitter Bootstrap. Icons by Font Awesome and Font Awesome More