Brian Naughton // Tue 17 October 2017 // Filed under genomics // Tags bioinformatics genomics programming

This small project started when I was looking for an implementation of Needleman-Wunsch (pairwise global DNA alignment) in javascript. I just wanted to align two sequences on a website and in a google sheet (using Google Apps Script).

I looked around for a simple javascript implementation (preferably one self-contained javascript file) but I was surprised that I couldn't find one. Needleman-Wunsch is a pretty simple algorithm so I decided to implement it. I did get a little bit side-tracked...

The first step was to find someone else's implementation to copy, so I started with some numpy code from @brent_p. Based on his other work, I think it's a safe assumption it's implemented correctly. (There is also a complete Cython version in this repo, which implements gap_extend and other parameters, and is obviously much faster. I really just need a very basic alignment so the simpler numpy version is fine for me).

numpy and friends

There are lots of ways to tweak the numpy implementation of Needleman-Wunsch to try to make it faster. Here are the things I tried:

  1. orig: the original numpy implementation.
  2. orig3: the original numpy implementation run with python 3.
    This is just to test how much faster or slower Python 3.6 is than 2.7.
  3. numpy: my numpy implementation.
    This is like the original numpy code, but modified a bit to make it more like my code.
  4. numba: my numpy implementation, but with numba applied.
    Numba is a pretty amazing JIT compiler you can turn on by adding one line of code. It comes with anaconda, and it's always worth trying just in case.
  5. torch: my numpy implementation, but with numpy replaced with PyTorch.
    PyTorch seems like a friendly alternative to TensorFlow, especially with its numpy-like syntax. Without explicitly applying .cuda() to my arrays it just uses the CPU, so it should not be too different to regular numpy.
  6. torchcuda: my numpy implementation, but with numpy replaced with PyTorch, and .cuda() applied to each array.
    The same as torch except using the GPU.
  7. cupy: my numpy implementation, but with numpy replaced with CuPy.
    CuPy is a drop-in replacement for numpy and, like PyTorch, only requires changing a couple of lines.

Nim

Nim is an interesting language that can compile to C (nimc) or javascript (nimjs). I thought this was a pretty good use-case for nim since I need javascript but writing scientific code in javascript is not fun. I started with a numpy-like library called arraymancer, which worked well, but since it relies on BLAS it would not compile to javascript (I could have checked that earlier...) Luckily, changing the code to simple for loops was pretty easy. Nim's syntax is a lot like Python, with some unnecessary differences like using echo instead of print. As someone used to Python, I didn't find it to be as friendly as I expected. The dream of Python-with-static-types is still a dream...

Javascript

Finally, I just programmed the alignment in javascript (js). All of the implementations are almost line-for-line identical, so this did not take long.

Speed comparison

I ran all the above implementations on some random DNA of various lengths and the results are plotted below.

import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
data = {'orig': {500: 2.23, 1000: 8.50, 1500: 18.96, 2000: 33.45, 2500: 52.70, 3000: 76.44, 5000: 209.90}, 
        'orig3': {500: 2.62, 1000: 10.34, 1500: 22.68, 2000: 40.22, 2500: 62.03, 3000: 90.15, 5000: 248.93}, 
        'numpy': {500: 1.54, 1000: 3.41, 1500: 6.27, 2000: 10.60, 2500: 16.11, 3000: 22.87, 5000: 67.45}, 
        'numba': {500: 5.54, 1000: 7.05, 1500: 9.28, 2000: 13.08, 2500: 17.40, 3000: 22.21, 5000: 56.69}, 
        'torch': {500: 1.67, 1000: 3.92, 1500: 7.61, 2000: 12.86, 2500: 19.55, 3000: 27.48, 5000: 82.90}, 
        'torchcuda': {500: 8.54, 1000: 22.47, 1500: 46.26, 2000: 80.12, 2500: 119.92, 3000: 169.95, 5000: 467.04}, 
        'cupy': {500: 35.71, 1000: 138.86, 1500: 951.97, 2000: 1713.57, 2500: 2660.11, 3000: 3798.51}, 
        'nimc': {500: 0.016, 1000: 0.041, 1500: 0.08, 2000: 0.13, 2500: 0.20, 3000: 0.31, 5000: 0.85}, 
        'nimjs': {500: 0.14, 1000: 0.28, 1500: 0.48, 2000: 0.75, 2500: 1.12, 3000: 1.53, 5000: 4.06}, 
        'js': {500: 0.09, 1000: 0.14, 1500: 0.20, 2000: 0.34, 2500: 0.41, 3000: 0.82, 5000: 1.64}}
vfast_ones = ["nimjs", "js", "nimc"]
fast_ones = ["torch", "numpy", "numba"] + vfast_ones
ok_ones = ["torchcuda", "orig3", "orig"] + fast_ones
df = pd.DataFrame(data)
df
cupy js nimc nimjs numba numpy orig orig3 torch torchcuda
500 35.71 0.09 0.01 0.13 5.54 1.54 2.23 2.62 1.67 8.54
1000 138.86 0.13 0.04 0.28 7.05 3.41 8.50 10.34 3.92 22.47
1500 951.97 0.20 0.08 0.48 9.28 6.27 18.96 22.68 7.61 46.26
2000 1713.57 0.34 0.13 0.75 13.08 10.60 33.45 40.22 12.86 80.12
2500 2660.11 0.41 0.20 1.12 17.49 16.11 52.70 62.03 19.55 119.92
3000 3798.51 0.82 0.31 1.53 22.21 22.87 76.44 90.15 27.48 169.95
5000 NaN 1.64 0.85 4.06 56.69 67.45 209.90 248.93 82.90 467.04

I'll skip cupy since it's much slower than everything else and throws the plots off. That doesn't imply anything negative about cupy and I'd use it again. It's extremely easy to replace numpy with cupy, and for properly vectorized code I'm sure it's much faster than numpy.

f,ax = plt.subplots(figsize=(16,12))
ax.set_title("fast: everything except cupy")
_ = df[ok_ones].plot(ax=ax)

plot_fast

f,ax = plt.subplots(figsize=(16,12))
ax.set_title("faster: numpy vs C vs js")
_ = df[fast_ones].plot(ax=ax)

plot_faster

f,ax = plt.subplots(figsize=(16,12))
ax.set_title("fastest: C vs js")
_ = df[vfast_ones].plot(ax=ax)

plot_fastest

Conclusions

I learned some interesting things here...

  • numba is good. It didn't speed this code up very much, but it was a bit faster than numpy for large alignments and didn't cost anything. I expected this to be the fastest Python-based code because there are several Python for loops (i.e., unvectorized code), which is where numba can help a lot.

  • I'm not sure why my numpy is faster than the original numpy since my changes were minimal. The original version is not coded for speed anyway.

  • GPUs don't help unless your code is written for GPUs. That basically means one repetitive task handed off to the GPU along with the data (no back-and-forth). There are ways to implement Needleman-Wunsch in a GPU-friendly way, but it complicates the code a lot. On the one hand this is a very obvious result to anyone who has used GPUs for computation — on the other hand, maybe a really smart compiler could use the CPU where appropriate and GPU where appropriate...

  • Nim is a pretty interesting language. I have seen it described as either "an easier C" or a "statically typed Python". To me it's definitely more like the former. It's not all that friendly compared to Python, but I think I'd try it again as a C/Cython replacement. Don't forget to compile with -d:release.

  • Javascript is fast! If nim is not compiled with -d:release it's even faster than nim's C code. Sadly, Google Apps Scripts' javascript is extremely slow for some reason. That was an unfortunate surprise, especially since it times out after about five minutes, so long alignments just fail! I can't explain why it's so slow...

Finally, just to note that this implementation is good enough for my purposes, but I haven't really spent any time making sure it works in all situations (apart from affirming that its output is the same as the original numpy code), so I wouldn't trust it too much. The code is available in this repo.

Comment
Brian Naughton // Mon 10 October 2016 // Filed under genomics // Tags genomics nanopore

Many genomics people, especially in the US, are still unfamiliar with Oxford Nanopore's MinION sequencer. I was lucky enough to join their early access program last year, so I've been using it for a while. In that time I've become more and more excited about its potential. In fact, I think it's the most exciting thing to happen in genomics in a long time. I'll try to explain why.

MinION vs Illumina

The MinION is a tiny little sequencer that has some serious advantages over Illumina sequencers:

  • it's very portable (see the photos!) and doesn't require any special equipment to run
  • it's simple to run: there's a 10 minute prep with just a couple of pipetting steps
  • the sequencer itself is essentially free, with a cost of $500-900 per flow-cell (which can be reused several times).
  • the reads are very long, about as long as the input DNA (100kb is not unusual)
  • it's a single molecule sequencer, so you can detect per molecule variation, including base modifications (this is still low accuracy though)
  • it can read RNA directly, giving you full-length transcripts
  • the turnaround time is very quick: you can generate tens to hundreds of megabases of data in an afternoon
  • data analysis is easier than for short-read sequencers, since alignment and assembly are simpler. You may not even really need any alignment if you are sequencing a plasmid or insert.
  • the data arrives per read instead of per base: so in one hour you can have thousands of long reads (as opposed to Illumina, where you'd have millions of partial reads, each only a few bases)
  • seeing reads appear in real-time is amazing and you can literally pull the USB plug when you have enough data

There are also two big disadvantages:

  • its accuracy is at least an order of magnitude worse than Illumina (~90% vs >99%)
  • its per base cost is at least an order of magnitude higher than an Illumina HiSeq ($0.5/Mbase vs <$0.02/Mbase) and 2–10X more expensive than a MiSeq. Of course, these numbers are rough and in flux. For example, a HiSeq or MiSeq will require a service contract that could be $20k/yr — the cost of an Illumina run is highly volume-dependent.

Something that's not often discussed is the error rate of short-read sequencers. On a per base level they are extremely accurate, but incorrectly determined structural variants are also errors. In a human genome a miscalled 3Mb inversion could by itself be considered a 0.1% error rate. and there are lots of structural variants in humans. Unlike incorrect base-calls, it is often impossible to overcome this issue with greater read-depth.

Despite these advantages, many scientists remain skeptical of the MinION. There are probably two things going on here: (a) Oxford has consistently overpromised since announcing in 2012; (b) the MinION only started to be really competitive in the past few months, so there is a lag.

What changed?

About six months ago, you could expect to get about 500Mb of DNA from a flow-cell, with each pore reading at 70 bases/second and accuracy of 70-80% (at least in our novice hands).

Earlier this year, Oxford made two important changes that improved performance: they updated their pore from an unspecified pore ("R7", which was tangled up in a patent dispute with Illumina) to an E. coli pore ("R9"), which has both better throughput and better accuracy than R7. At the same time, they updated their base-calling algorithm to a deep learning-based method, further improving accuracy.

They are still incrementally improving R9, and are already on version R9.4. At the time of writing, this version is currently only in the hands of the inner circle of nanoporati, but luckily they are all on Twitter so we can get a pretty good sense of how well it works. People are reporting excellent results, with runs of over 5Gb at the new R9 speed of 240 bases/second (this should be 500 bases/second soon, apparently with no loss in accuracy). Accuracy is also up, with 1D reads perhaps even edging over 90% in experienced hands.

So, compared to six months ago, you are probably getting 5-10 times as much data with half the error rate.

OK, what can I do with one of these gizmos?

The stats are definitely exciting, but I don't think they really capture why I think the MinION is so interesting. The MinION has several key areas where it can do some damage, and other areas where it opens up new possibilities.

sequencing microbial genomes de novo

This is very doable. I wouldn't say it's easy yet, but long reads negate a lot of the computational problems of de novo assembly: finding overlapping 10,000mers is a very different problem to finding overlapping 100mers.

infectious agent detection

Once you have prepped DNA, which takes from 10 minutes (with the "rapid" kit) to two hours, the actual process of detecting a pathogen could be under ten minutes. In practice I don't think anybody is going from blood sample to diagnosis this quickly, but the potential is there.

There is even software (Mykrobe) that detects drug-resistance genes in bacteria, and recommends appropriate antibiotics. When this is done cheaply and routinely it should help a lot with drug resistance and overprescription of antibiotics.

Since the data comes in one read at a time, as soon as you get one read from the infectious agent you are done.

direct RNA sequencing

If you want to read full-length transcripts, and see base modifications too, then the MinION is the only option that I know of. This capability is new, and the base modification detection is not accurate, but there's still plenty of interesting research to do with this.

barcoding

Sequencing often requires barcoding, which adds fiddly extra steps before and after sequencing. But, if your reads are long enough, then you may not need to barcode. For example, you can sequence 96 plasmids at the same time — simply throw away any reads that are not the full length of the plasmid.

other long-read problems

There are a few classic long-read problems like HLA sequencing, VDJ sequencing and structural variant detection (especially for cancer). These are reasonably good applications for MinION, though VDJ sequencing probably needs more accuracy, and structural variant detection might need more throughput. (10X + Illumina makes the most sense for anything like this)

MinION in the Field

Oxford is making an effort to eliminate the "cold chain" for the MinION. The flow-cell itself already seems to keep well at temperatures well above refrigeration, and they claim they can lyophilize the other reagents. Even before that happens, with basically just a cooler, a laptop, and a way to extract DNA, doctors, ecologists, and other scientists can go out into the field and do sequencing anywhere.

Earlier this year, as part of the Zibra project, scientists from the UK and Brazil drove a van through Brazil, sampling and sequencing Zika virus along the way.

Biology labs and Biotechs

The advantage of MinION for non-genomics–focused biology labs is not really widely discussed, but I think it's one of the most important.

Basically, if you want a few megabases sequenced and you have a MinION and a flow-cell, you can have the data in your hands today. When you're done you can put the flow-cell away and use it again tomorrow. Depending on your needs, you might get 4-10 uses out of the flow-cell, meaning each run costs $150-300 including sample prep.

In contrast, if you want to get some data from a MiSeq, you are probably signing up for a gigabase of sequence. That's overkill for most labs, and it produces many gigabytes of raw data to manage too... If you want reasonable length reads (2x150bp), then sequencing will take at least 24 hours. If you are lucky enough to have a core lab at your institute then that helps, but you may still have to wait your turn.

If you don't work at a university — perhaps you're at a small biotech — then the alternative is buying a MiSeq (or MiniSeq) at $50-150k plus service contract, or sending your samples out to a CRO for sequencing. A CRO will have a turnaround time of at least a week, and that's after you've explained to them what you need and agreed on the terms.

It's hard to imagine a one-off MiSeq run happening in under a week, so being able to just do it yourself is a huge increase in efficiency.

If you're sequencing a thousand of anything, then Illumina is much cheaper, but I wonder how many biology labs need megabase-scale sequencing occasionally, but don't do it because of the current barriers to entry, including the computational burden of aligning and assembling short reads. There are cases where I would not have bothered with the hassle of getting something sequenced except that we could just do it ourselves with the MinION.

Genomics for Everyone

I think the most exciting thing going on here is just taking sequencing and genomics out of the lab and into the real world. Admittedly, this does require some improvements and inventions from Oxford, like easier DNA preparation, so there are caveats here, but nothing too crazy.

Oxford's metrichor site spells out some of the use-cases too. I'll just give some scattered examples of things to sequence, some more realistic than others, but I think each plausibly represent something new that has real economic value:

  • hospital surfaces and employees for MRSA
  • food at factories (detecting E. coli etc)
  • the environment at airports, workplaces, etc for flu (flu is expensive!)
  • at crime scenes (also a big deal since the current methods of forensic DNA analysis are awful)
  • at home, to see if you have a cold or flu, the same cold or a new one, and even figure out where you picked up the virus
  • the air to detect mold in buildings
  • farm animals' microbiomes to monitor gut health and improve growth
  • at methane farms, wine fermenters, beer fermenters, to monitor and manage the process
  • various kinds of labs for bacterial contamination
  • the sewage system of a city to monitor the city's diet and health
  • for educational purposes, and at competitions like iGEM
  • fish and other foods to detect mislabeling (a surprisingly big problem)
  • animals out in the wild for conservation purposes
  • your own microbiome to monitor your gut health
  • soil, plants, droppings, insects at farms to monitor pests etc.
  • at the dentist's to detect decay-causing bacteria
  • at the dermatologist's (cosmetologist?) to detect and treat acne-causing bacteria

These applications (apps?) can potentially be run by anyone. Stick some DNA in, wait a bit, processing happens on the cloud and the answer appears on your phone in a few minutes to a few hours. You don't need to know anything about genetics or molecular biology, you'll just see a readout that says "E. coli detected" in food or "DNA from new rhino detected" in droppings.

There's already a teaser of this with Oxford's What's in my Pot app. It figures out which microbes are in a sample, and draws a nice cladogram for you.

To realize this potential, the sequencer still needs to be cheaper, but the lower bound on that seems good, since the number of molecules involved is really tiny. (That's another advantage of single-molecule sequencers.)

Finally, coming back to present-day reality a little bit, Oxford will need to execute on their plans to make sequencing easier and cheaper (reagent lyophilization, Zumbador, SmidgION, Voltrax, FPGAs, etc. — watch Oxford's latest tech update for more on that), but I think MinION is going to become a very big deal in the next few years.

Comment
Brian Naughton // Sun 14 August 2016 // Filed under data // Tags data genomics statistics pymc3 nanopore

Oxford Nanopore (ONT) sells an amazing, inexpensive sequencer called the MinION. It's an unusual device in that the sequencing "flowcells" use protein pores. Unlike a silicon chip, the pores are sensitive to their environment (especially heat) and can get damaged or degraded and stop working.

When you receive a flowcell from ONT, only a fraction of the possible 2048 pores are active, perhaps 800–1500. Because of how the flowcell works, you start a run using zero or one pores from each of the 512 "channels" (each of which contains 4 pores).

As the pores interact with the DNA and other stuff in your solution, they can get gummed up and stop working. It's not unusual to start a run with 400 pores and end a few hours later with half that many still active. It's also not unusual to put a flowcell in the fridge and find it has 20% fewer pores when you take it out. (Conversely, pores can come back to life after a bit of a rest.)

All told, this means it's quite difficult to tell how much sequence you can expect from a run. If you want to sequence a plasmid at 100X, and you are starting with 400 pores, will you get enough data in two hours? That's the kind of question we want an answer to.

import pymc3 as pm
import numpy as np
import scipy as sp
import theano.tensor as T

import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

from functools import partial
from IPython.display import display, HTML, Image

For testing purposes — and because I don't have nearly enough real data — I will simulate some data.

In this toy example, DNA of length 1000+/-50 bases is sequenced for 1000, 20000 or 40000 seconds at 250bp/s. After 100 reads, a pore gets blocked and can no longer sequence. The device I am simulating has only 3 pores (where a real MinION would have 512).

datatypes = [
    ('total_num_reads', np.int),        ('total_dna_read', np.int),
    ('num_minutes', np.float),          ('num_active_pores_start', np.int),
    ('num_active_pores_end', np.int),   ('mean_input_dna_len_estimate', np.float)]

MAX_PORES = 3
AXIS_RUNS, AXIS_PORES = 0, 1

data = np.array([
  ( MAX_PORES*9,   MAX_PORES*  9*1000*.95,    1000.0/60,  MAX_PORES,  MAX_PORES,  1000.0),
  ( MAX_PORES*10,  MAX_PORES* 10*1000*1.05,   1000.0/60,  MAX_PORES,  MAX_PORES,  1000.0),
  ( MAX_PORES*10,  MAX_PORES* 10*1000*.95,    1000.0/60,  MAX_PORES,  MAX_PORES,  1000.0),
  ( MAX_PORES*11,  MAX_PORES* 11*1000*1.05,   1000.0/60,  MAX_PORES,  MAX_PORES,  1000.0),
  ( MAX_PORES*100, MAX_PORES*100*1000*.95,   20000.0/60,  MAX_PORES,          0,  1000.0),
  ( MAX_PORES*100, MAX_PORES*100*1000*1.05,  20000.0/60,  MAX_PORES,          0,  1000.0),
  ( MAX_PORES*100, MAX_PORES*100*1000*.95,   40000.0/60,  MAX_PORES,          0,  1000.0),
  ( MAX_PORES*100, MAX_PORES*100*1000*1.05,  40000.0/60,  MAX_PORES,          0,  1000.0),
], dtype=datatypes)

MinION Simulator

To help figure out how long to leave our sequencer running, I made a MinION simulator in a jupyter notebook. I then turned the notebook into an app using dappled.io, an awesome new service that can turn jupyter notebooks into apps running on the cloud (including AWS lambda). Here is the dappled nanopore simulator app / notebook.

The model is simple: a pore reads DNA until it's completely read, then with low probability it gets blocked and can no longer sequence. I've been fiddling with the parameters to give a reasonable result for our runs, but we don't have too much data yet, so it's not terribly accurate. (The model itself may also be inaccurate, since how the pore gets blocked is not really discussed by ONT.)

Nevertheless, the simulator is a useful tool for ballparking how much sequence we should expect from a run.

Hierarchical Bayesian Model

Here's where things get more complicated...

In theory, given some data from real MinION runs, we should be able to learn the parameters for a model that would enable us to predict how much data we would get from a new run. Like many problems, this is a good fit for Bayesian analysis, since we have data and we want to learn the most appropriate model given the data.

For each run, I need to know:

  • what length DNA I think I started with
  • how long the run was in minutes
  • how many reads I got
  • how much sequence I got
  • the number of active pores at the start and at the end of a run

I'll use PyMC3 for this problem. First, I need to specify the model.

Input DNA

The input DNA depends a lot on its preparation. I believe the distribution of input DNA lengths could be:

  • exponential if it is genomic DNA breaking randomly
  • normal with a small variance if it is from a plasmid
  • normal but sharply truncated if it is cut out of a gel

The length of input DNA is different to the length of reads produced by the sequencer, which is affected by breakage, capture biases, secondary structure, etc. The relationship between input DNA length and read length could be learned. We could get arbitrarily complex here: for example, given enough data, we could include a mixture model for different DNA types (genomic vs plasmid).

A simple distribution with small variance but fat tails seems reasonable here. I don't have much idea what the standard deviation should be, so I'll make it a fraction of the mean.

with pm.Model() as model:
    mean_input_dna_len = pm.StudentT('mean_input_dna_len', nu=3, mu=data['mean_input_dna_len_estimate'],
                                     sd=data['mean_input_dna_len_estimate']/5, shape=data.shape[0])

This is the first PyMC3 code, so here are some notes on what's going on:

  • the context must always specify the pymc3 model
  • Absent better ideas, I'll generally be using a T distribution instead of a normal, as MacKay recommends. This distribution should be truncated at zero, but I can't do this for multidimensional data because of what I think is a bug in pymc3.
  • I am not modeling the input_dna_len here, just the mean of that distribution. As I mention above, the distribution of DNA lengths could be modeled several ways, but I currently only need the mean in my model.

the shape parameter

In this experiment, We will encounter three different kinds of distribution shape:

  • shape=1: this is a scalar value. For example, we might think the speed of the nanopore machine is the same for all runs.
  • shape=data.shape[0]: data.shape[0] is the number of runs. For example, the length of DNA in solution varies from run to run.
  • shape=(data.shape[0],MAX_NUM_PORES): we estimate a value for every pore in every run. For example, how many reads until a pore gets blocked? This varies by run, but needs to be modeled per pore too.

R9 Read Speed

I know that the R9 pore is supposed to read at 250 bases per second. I believe the mean could be a little bit more or less than that, and I believe that all flowcells and devices should be about the same speed, therefore all runs will sample from the same distribution of mean_read_speed.

Because this is a scalar, pymc3 lets me set a lower bound of 0 bases per second. (Interestingly, unlike DNA length, the speed could technically be negative, since the voltage across the pore can be positive or negative, as exploited by Read Until...)

Truncated0T1D = pm.Bound(pm.StudentT, lower=0)
with pm.Model() as model:
    mean_read_speed = Truncated0T1D('mean_read_speed', nu=3, mu=250, sd=10, shape=data.shape)

Capturing DNA

DNA is flopping around in solution, randomly entering pores now and then. How long will a pore have to wait for DNA? I have a very vague idea that it should take around a minute but this is basically an unknown that the sampler should figure out.

Note again that I am using the mean time only, not the distribution of times. The actual distribution of times to capture DNA would likely be distributed by an exponential distribution (waiting time for a Poisson process).

Hierarchical model

This is the first time I am using a hierarchical/multilevel model. For more on what this means, see this example from pymc3 or Andrew Gelman's books.

There are three options for modeling mean_time_to_capture_dna: (a) it's the same for each run (e.g., everybody us using the same recommended DNA concentration) (b) it's independent for each run (c) each run has a different mean, but the means are drawn from the same distribution (and probably should not be too different).

Image("hierarchical_pymc3.png")

Diagram taken from PyMC3 dev twiecki's blog

with pm.Model() as model:
    prior_mean_time_to_capture_dna = Truncated0T1D('prior_mean_time_to_capture_dna', nu=3, mu=60, sd=30)
    mean_time_to_capture_dna = pm.StudentT('mean_time_to_capture_dna', nu=3, mu=prior_mean_time_to_capture_dna,
                                           sd=prior_mean_time_to_capture_dna/10, shape=data.shape)

Reading DNA

I can use pymc3's Deterministic type to calculate how long a pore spends reading a chunk of DNA, on average. That's just a division, but note it uses theano's true_div function instead of a regular division. This is because neither value is a number; they are random variables. Theano will calculate this as it's being sampled. (A simple "a/b" also works, but I like to keep in mind what's being done by theano if possible.)

with pm.Model() as model:
    mean_time_to_read_dna = pm.Deterministic('mean_time_to_read_dna', T.true_div(mean_input_dna_len, mean_read_speed))

Then each pore can do how many reads in this run? I have to be a bit careful to specify that I mean the number of reads possible per pore.

with pm.Model() as model:
    num_reads_possible_in_time_per_pore = pm.Deterministic('num_reads_possible_in_time_per_pore',
        T.true_div(num_seconds_run, T.add(mean_time_to_capture_dna, mean_time_to_read_dna)))

Blocking Pores

In my model, after each read ends, a pore can get blocked, and once it's blocked it does not become unblocked. I believe about one in a hundred times, a pore will be blocked after if finishes sequencing DNA. If it were more than that, sequencing would be difficult, but it could be much less.

We can think of the Beta distribution as modeling coin-flips where the probability of heads (pore getting blocked) is 1/100.

with pm.Model() as model:
    prior_p_pore_blocked_a, prior_p_pore_blocked_b, = 1, 99
    p_pore_blocked = pm.Beta('p_pore_blocked', alpha=prior_p_pore_blocked_a, beta=prior_p_pore_blocked_b)

Then, per pore, the number of reads before blockage is distributed as a geometric distribution (the distribution of the number of tails you will flip before you flip a head). I have to approximate this with an exponential distribution — the continuous version of a geometric — because ADVI (see below) requires continuous distributions. I don't think it makes an appreciable difference.

Here the shape parameter is the number of runs x number of pores since here I need a value for every pore.

with pm.Model() as model:
    num_reads_before_blockage = pm.Exponential('num_reads_before_blockage', lam=p_pore_blocked,
        shape=(data.shape[0], MAX_PORES))

Constraints

Here things get a little more complicated. It is not possible to have two random variables x and y, set x + y = data, and sample values of x and y.

testdata = np.random.normal(loc=10,size=100) + np.random.normal(loc=1,size=100)
with pm.Model() as testmodel:
    x = pm.Normal("x", mu=0, sd=10)
    y = pm.Normal("y", mu=0, sd=1)
    # This does not work
    #z = pm.Deterministic(x + y, observed=data)
    # This does work
    z = pm.Normal('x+y', mu=x+y, sd=1, observed=testdata)
    trace = pm.sample(1000, pm.Metropolis())

I was surprised by this at first but it makes sense. This process is more like a regression where you minimize error than a perfectly fixed constraint. The smaller the standard deviation, the more you penalize deviations from the constraint. However, you need some slack so that it's always possible to estimate a logp for the data given the model. If the standard deviation goes too low, you end up with numerical problems (e.g., nans). Unfortunately, I do not know of a reasonable way to set this value.

I encode constraints in my model in a similar way: First I define a Laplace distribution in theano to act as my likelihood. Why Laplace instead of Normal? It returned nans less often...

Using your own DensityDist allows you to use any function as a likelihood. I use DensityDist just to have more control over the likelihood, and to differentiate from a true Normal/Laplacian distribution. This can be finicky, so I've had to spend a bunch of time tweaking this.

def T_Laplace(val, obs, b=1):
    return T.log(T.mul(1/(2*b), T.exp(T.neg(T.true_div(T.abs_(T.sub(val, obs)), b)))))

Constraint #1

The total DNA I have read must be the product of mean_input_dna_len and total_num_reads. mean_input_dna_len can be different to my estimate.

This is a bit redundant since I know total_num_reads and total_dna_read exactly. In a slightly more complex model we would have different mean_read_length and mean_input_dna_len. Here it amounts to just calculating mean_input_dna_len as total_dna_read / total_num_reads.

def apply_mul_constraint(f1, f2, observed):
    b = 1 # 1 fails with NUTS (logp=-inf), 10/100 fails with NUTS too (not positive definite)
    return T_Laplace(T.mul(f1,f2), observed, b)

with pm.Model() as model:
    total_dna_read_constraint = partial(apply_mul_constraint, mean_input_dna_len, data['total_num_reads'])
    constrain_total_dna_read = pm.DensityDist('constrain_total_dna_read', total_dna_read_constraint, observed=data['total_dna_read'])

Constraint #2

The number of reads per pore is whichever is lower:

  • the number of reads the pore could manage in the length of the run
  • the number of reads it manages before getting blocked

To calculate this, I compare num_reads_before_blockage and num_reads_possible_in_time_per_pore.

First, I use T.tile to replicate num_reads_possible_in_time_per_pore for all pores. That turns it from an array of length #runs to a matrix of shape #runs x #pores (this should use broadcasting but I wasn't sure it was working properly...)

Then I take the minimum value of these two arrays (T.lt) and if the minimum value is the number of reads before blockage (T.switch(T.lt(f1,f2),0,1)) then that pore is blocked, otherwise it is active. I sum these 0/1s over all pores (axis=AXIS_PORES) to get a count of the number of active pores for each run.

The value of this count is constrained to be equal to data['num_active_pores_end'].

def apply_count_constraint(num_reads_before_blockage, num_reads_possible_in_time_broadcast, observed):
    b = 1
    num_active_pores = T.sum(T.switch(T.lt(num_reads_before_blockage, num_reads_possible_in_time_broadcast),0,1),axis=AXIS_PORES)
    return T_Laplace(num_active_pores, observed, b)

with pm.Model() as model:
    num_reads_possible_in_time_broadcast = T.tile(num_reads_possible_in_time_per_pore, (MAX_PORES,1)).T
    num_active_pores_end_constraint = partial(apply_count_constraint, num_reads_before_blockage, num_reads_possible_in_time_broadcast)
    constrain_num_active_pores_end = pm.DensityDist('constrain_num_active_pores_end', num_active_pores_end_constraint, observed=data['num_active_pores_end'])

Constraint #3

Using the same matrix num_reads_possible_in_time_broadcast this time I sum the total number of reads from each pore in a run. I simply sum the minimum value from each pore: either the number before blockage occurs or the total number possible.

def apply_minsum_constraint(num_reads_before_blockage, num_reads_possible_in_time_broadcast, observed):
    b = 1 # b=1 fails with ADVI and >100 pores (nan)
    min_reads_per_run = T.sum(T.minimum(num_reads_before_blockage, num_reads_possible_in_time_broadcast),axis=AXIS_PORES)
    return T_Laplace(min_reads_per_run, observed, b)

with pm.Model() as model:
    total_num_reads_constraint = partial(apply_minsum_constraint, num_reads_before_blockage, num_reads_possible_in_time_broadcast)
    constrain_total_num_reads = pm.DensityDist('constrain_total_num_reads', total_num_reads_constraint, observed=data['total_num_reads'])

Sampling

There are three principal ways to sample in PyMC3:

  • Metropolis-Hastings: the simplest sampler, generally works fine on simpler problems
  • NUTS: more efficient than Metropolis, but I've found it to be slow and tricky to get to work
  • ADVI: this is "variational inference", the new, fast way to estimate a posterior. This seems to work great, though as I mention above, it needs continuous distributions only.

I used ADVI most of the time in this project, since NUTS was too slow and had more numerical issues than ADVI, and the ADVI results seemed more sensible than Metropolis.

with pm.Model() as model:
    v_params = pm.variational.advi(n=500000, random_seed=1)
    trace = pm.variational.sample_vp(v_params, draws=5000)

Results

In my toy model: there are 8 runs in total; each device has 3 pores; it takes about 4 seconds to sequence a DNA molecule (~1000 bases / 250 bases per second) and 96 seconds to capture a DNA molecule (so 100 seconds in total to read one DNA molecule).

In my 8 runs, I expect that [10, 10, 10, 10, 200, 200, 400, 400] reads are possible. However, I expect that runs 4, 5, 6, 7 will have a blockage at 100 reads.

Finally, I expect there to be 3 pores remaining in the first four runs, and 0 pores remaining in the last four.

ADVI Results

mean_input_dna_len__0                    950.0 (+/- 0.1) # All the input_dna_lens are ok.
mean_input_dna_len__1                   1050.0 (+/- 0.1) # This is a simple calculation since I know
mean_input_dna_len__2                    950.0 (+/- 0.1) # how many reads there are and how long the reads are
mean_input_dna_len__3                   1050.0 (+/- 0.1)
mean_input_dna_len__4                    950.0 (+/- 0.1)
mean_input_dna_len__5                   1049.9 (+/- 0.1)
mean_input_dna_len__6                    950.0 (+/- 0.1)
mean_input_dna_len__7                   1050.0 (+/- 0.1)
num_reads_possible_in_time_per_pore__0    10.1 (+/- 1) # The number of reads possible is also a simple calculation
num_reads_possible_in_time_per_pore__1    10.3 (+/- 1) # It depends on the speed of the sequencer and the length of DNA
num_reads_possible_in_time_per_pore__2    10.2 (+/- 1)
num_reads_possible_in_time_per_pore__3    10.5 (+/- 1)
num_reads_possible_in_time_per_pore__4   210.9 (+/- 36)
num_reads_possible_in_time_per_pore__5   207.4 (+/- 35)
num_reads_possible_in_time_per_pore__6   419.8 (+/- 67)
num_reads_possible_in_time_per_pore__7   413.2 (+/- 66)
num_reads_before_blockage_per_run__0     501.3 (+/- 557) # The total number of reads before blockage per run
num_reads_before_blockage_per_run__1     509.8 (+/- 543) # is highly variable when there is no blockage (runs 0-3).
num_reads_before_blockage_per_run__2     501.9 (+/- 512)
num_reads_before_blockage_per_run__3     502.4 (+/- 591)
num_reads_before_blockage_per_run__4     297.2 (+/- 39)  # When there is blockage (runs 4-7), then it's much less variable
num_reads_before_blockage_per_run__5     298.7 (+/- 38)
num_reads_before_blockage_per_run__6     299.6 (+/- 38)
num_reads_before_blockage_per_run__7     301.2 (+/- 38)
num_active_pores_per_run__0                2.8 (+/- 0.4) # The number of active pores per run is estimated correctly
num_active_pores_per_run__1                2.8 (+/- 0.4) # as we expect for a value we've inputted
num_active_pores_per_run__2                2.8 (+/- 0.4)
num_active_pores_per_run__3                2.8 (+/- 0.3)
num_active_pores_per_run__4                0.0 (+/- 0.1)
num_active_pores_per_run__5                0.0 (+/- 0.1)
num_active_pores_per_run__6                0.0 (+/- 0.0)
num_active_pores_per_run__7                0.0 (+/- 0.0)

ADVI Plots

We can plot the values above too. No real surprises here, though num_active_pores_per_run doesn't show the full distribution for some reason.

with pm.Model() as model:
    pm.traceplot(trace[-1000:],
                 figsize=(12,len(trace.varnames)*1.5),
                 lines={k: v['mean'] for k, v in pm.df_summary(trace[-1000:]).iterrows()})

Conclusions

It's a pretty complex model. Is it actually useful? I think it's almost useful, but in its current form it suffers from a lack of robustness.

The results I get are sensitive to changes in several independent areas:

  • the sampler/inference method: for example, Metropolis returns different answers to ADVI
  • the constraint parameters: changing the "b" parameter can lead to numerical problems
  • the priors I selected: changes in my priors that I do not perceive as important can lead to different results (more data would help here)

A big problem with my current model is that scaling it up to 512 pores seems to be difficult numerically. Metropolis just fails, I think because it can't sample efficiently; NUTS fails for reasons I don't understand (it throws an error about a positive definite matrix); ADVI works best, but starts to get nans as the number of pores grows, unless I loosen the constraints. It's possible that I should need to begin with loose constraints and tighten them over time.

Finally, the model currently expects the same number of pores to be available for every run. I haven't addressed that yet, though I think it should be pretty straightforward. There may be nice theano trick I am overlooking.

More Conclusions

Without ADVI, I think I would have failed to get an answer here. I'm pretty ignorant of how it works, but it seems like a significant advance and I'll definitely use it again. In fact, it would be interesting to try applying Edward, a variational inference toolkit with PyMC3 support, to this problem (this would mean swapping theano for tensorflow).

The process of modeling your data with a PyMC3/Stan/Edward model forces you to think a lot about what is really going on with your data and model (and even after spending quite a bit of time on it, my model still needs quite a bit of work to be more than a toy...) When your model has computational problems, as I had several times, it often means the model wasn't described correctly (Gelman's folk theorem).

Although it is still a difficult, technical process, I'm excited about this method of doing science. It seems like the right way to tackle a lot of problems. Maybe with advances like ADVI and theano/tensorflow; groups like the Blei and Gelman labs developing modeling tools like PyMC3, Stan and Edward; and experts like Kruschke and the Stan team creating robust models for us to copy, it will become more common.

Comment
Brian Naughton // Sun 05 June 2016 // Filed under biotech // Tags data biotech genomics statistics

A review of interesting things in biotech, genomics, data analysis

Read More

Boolean Biotech © Brian Naughton Powered by Pelican and Twitter Bootstrap. Icons by Font Awesome and Font Awesome More