Analyzing Sequencing Read Errors with PyMC3

Brian Naughton // Tue 11 November 2014 // Filed under data // Tags bayesian pymc sequencing

I recently watched a couple of videos about the new PyMC (PyMC3), and they got me pretty excited about it (one from Thomas Wiecki and one from Chris Fonnesbeck, both core developers.)

I've used PyMC2 a little bit, but this new version is a complete rewrite. It now uses Theano as a backend, which helps with computing gradients, but it also means PyMC3 is now pure Python, where previously it had a bunch of Fortran. It also seems to be better integrated with the current state-of-the-art tools, like pandas and scikit.

PyMC 3 is alpha software, so it has bugs and little documentation, but it's nice to get some exposure to Theano and NUTS (my favorite Python distribution, Anaconda, includes Theano by default.)

At the same time, I've was playing around with some example sequencing data from the pRESTO toolkit so I thought I would try to apply PyMC3 to these data.

Every sequencing read gives you a DNA sequence, but also an estimate of the error for every nucleotide. The sequencing read files are in fastq format, which means that the quality information is encoded like this:


It looks awful, but I guess if you really need to use a text file that's how it has to work. You can map these characters onto a quality (Phred) score in Python with a simple ord(c)-33. You can then map that onto expected number of errors by taking 10^(-v/10).

I decided to look at read quality as a function of read-length using PyMC3.


Boolean Biotech © Brian Naughton Powered by Pelican and Twitter Bootstrap. Icons by Font Awesome and Font Awesome More