Brian Naughton // Thu 14 May 2015 // Filed under genomics // Tags sequencing nanopore minion

The unveiling of the MinION MkII at the 2015 Oxford Nanopore Conference may be remembered as a very big deal in the history of genomics. A number of tweets even compared it to the 2007 iPhone unveiling. Of course, that's crazy — it's a much bigger deal than a simple Candy Crush vehicle.

I look forward to reading accounts from people who actually attended the conference, but I wanted to assemble what I learned from the #nanoporeconf tweets and this great genomeweb article.


The major news at the conference was the announcement of a new MinION sequencer, the MkII, replacing the MkI, itself only available as part of an early-access program.

The MkII has a new "fast mode" that's about 10X as fast as before and the output is now an impressively Illumina-like 5GB per hour. The error rates are also apparently improved.

Apart from its size, one of the most interesting things about the tiny MinION is that it streams sequence in real-time to your computer (I think it even needs USB 3). That means you can do things like sequence until you find what you are looking for, then stop, or, perhaps in the future, leave it on all the time, monitoring your sewer system etc.

Because of how the machine works, you are billed by the hour, like an AWS instance. The new price is $20 an hour, down from about $90 per hour for the MkI. At that price you really start to think about sequencing in a new way. This is the most mind-blowing aspect of this machine to me, and I think it will radically change how sequencing will be used, especially outside core human genomics applications.


Voltrax is a lab-on-a-chip that attaches directly to your MinION and does sample prep for you. That allows you to sequence from samples while you're out and about. It's programmable in Python (an excellent choice). Sadly, it's not out yet and they didn't give a timeline for it.

What's MinION for?

Although it's an amazing little machine, MinION doesn't compete with MiSeq/HiSeq for typical human genomics applications. This is mainly because the error rate is still very high. For the MkI the error rate is apparently up to 30%(!) For the MkII there is talk of getting it down to 5% — still very high compared to Illumina.

The major applications I've seen proposed so far are around: genome scaffolding (like PacBio — you just need long reads); pathogen/environmental sequencing (like @pathogenomenick sequencing Ebola in Guinea — here you need something fast and portable, and long reads help); sequencing messy parts of the genome (like HLA and CYP2D6, previously requiring Sanger sequencing or other difficult methods — here you need long and reasonably accurate reads).


This is a bigger, benchtop instrument with higher throughput than the MinION (6.4 terabases per run). It's less obvious to me how this machine will be used, but it's extremely impressive throughput.

Brian Naughton // Tue 11 November 2014 // Filed under data // Tags bayesian pymc sequencing

I recently watched a couple of videos about the new PyMC (PyMC3), and they got me pretty excited about it (one from Thomas Wiecki and one from Chris Fonnesbeck, both core developers.)

I've used PyMC2 a little bit, but this new version is a complete rewrite. It now uses Theano as a backend, which helps with computing gradients, but it also means PyMC3 is now pure Python, where previously it had a bunch of Fortran. It also seems to be better integrated with the current state-of-the-art tools, like pandas and scikit.

PyMC 3 is alpha software, so it has bugs and little documentation, but it's nice to get some exposure to Theano and NUTS (my favorite Python distribution, Anaconda, includes Theano by default.)

At the same time, I've was playing around with some example sequencing data from the pRESTO toolkit so I thought I would try to apply PyMC3 to these data.

Every sequencing read gives you a DNA sequence, but also an estimate of the error for every nucleotide. The sequencing read files are in fastq format, which means that the quality information is encoded like this:


It looks awful, but I guess if you really need to use a text file that's how it has to work. You can map these characters onto a quality (Phred) score in Python with a simple ord(c)-33. You can then map that onto expected number of errors by taking 10^(-v/10).

I decided to look at read quality as a function of read-length using PyMC3.


Boolean Biotech © Brian Naughton Powered by Pelican and Twitter Bootstrap. Icons by Font Awesome and Font Awesome More