Oxford Nanopore (ONT) sells an amazing, inexpensive sequencer called the
MinION. It's an unusual device in
that the sequencing "flowcells" use protein pores. Unlike a silicon
chip, the pores are sensitive to their environment (especially heat) and
can get damaged or degraded and stop working.
When you receive a flowcell from ONT, only a fraction of the possible
2048 pores are active, perhaps 800–1500. Because of how the flowcell
works, you start a run using zero or one pores from each of the 512
"channels" (each of which contains 4 pores).
As the pores interact with the DNA and other stuff in your solution,
they can get gummed up and stop working. It's not unusual to start a run
with 400 pores and end a few hours later with half that many still
active. It's also not unusual to put a flowcell in the fridge and find
it has 20% fewer pores when you take it out. (Conversely, pores can come
back to life after a bit of a rest.)
All told, this means it's quite difficult to tell how much sequence you
can expect from a run. If you want to sequence a plasmid at 100X, and
you are starting with 400 pores, will you get enough data in two hours?
That's the kind of question we want an answer to.
import pymc3 as pm
import numpy as np
import scipy as sp
import theano.tensor as T
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
from functools import partial
from IPython.display import display, HTML, Image
For testing purposes — and because I don't have nearly enough real data
— I will simulate some data.
In this toy example, DNA of length 1000+/-50 bases is sequenced for
1000, 20000 or 40000 seconds at 250bp/s. After 100 reads, a pore gets
blocked and can no longer sequence. The device I am simulating has only
3 pores (where a real MinION would have 512).
datatypes = [
('total_num_reads', np.int), ('total_dna_read', np.int),
('num_minutes', np.float), ('num_active_pores_start', np.int),
('num_active_pores_end', np.int), ('mean_input_dna_len_estimate', np.float)]
MAX_PORES = 3
AXIS_RUNS, AXIS_PORES = 0, 1
data = np.array([
( MAX_PORES*9, MAX_PORES* 9*1000*.95, 1000.0/60, MAX_PORES, MAX_PORES, 1000.0),
( MAX_PORES*10, MAX_PORES* 10*1000*1.05, 1000.0/60, MAX_PORES, MAX_PORES, 1000.0),
( MAX_PORES*10, MAX_PORES* 10*1000*.95, 1000.0/60, MAX_PORES, MAX_PORES, 1000.0),
( MAX_PORES*11, MAX_PORES* 11*1000*1.05, 1000.0/60, MAX_PORES, MAX_PORES, 1000.0),
( MAX_PORES*100, MAX_PORES*100*1000*.95, 20000.0/60, MAX_PORES, 0, 1000.0),
( MAX_PORES*100, MAX_PORES*100*1000*1.05, 20000.0/60, MAX_PORES, 0, 1000.0),
( MAX_PORES*100, MAX_PORES*100*1000*.95, 40000.0/60, MAX_PORES, 0, 1000.0),
( MAX_PORES*100, MAX_PORES*100*1000*1.05, 40000.0/60, MAX_PORES, 0, 1000.0),
], dtype=datatypes)
MinION Simulator
To help figure out how long to leave our sequencer running, I made a
MinION simulator in a jupyter notebook. I then turned the notebook into
an app using dappled.io, an awesome new
service that can turn jupyter notebooks into apps running on the cloud
(including AWS lambda). Here is the dappled nanopore simulator app /
notebook.
The model is simple: a pore reads DNA until it's completely read, then
with low probability it gets blocked and can no longer sequence. I've
been fiddling with the parameters to give a reasonable result for our
runs, but we don't have too much data yet, so it's not terribly
accurate. (The model itself may also be inaccurate, since how the pore
gets blocked is not really discussed by ONT.)
Nevertheless, the simulator is a useful tool for ballparking how much
sequence we should expect from a run.
Hierarchical Bayesian Model
Here's where things get more complicated...
In theory, given some data from real MinION runs, we should be able to
learn the parameters for a model that would enable us to predict how
much data we would get from a new run. Like many problems, this is a
good fit for Bayesian analysis, since we have data and we want to learn
the most appropriate model given the data.
For each run, I need to know:
- what length DNA I think I started with
- how long the run was in minutes
- how many reads I got
- how much sequence I got
- the number of active pores at the start and at the end of a run
I'll use PyMC3 for this
problem. First, I need to specify the model.
R9 Read Speed
I know that the R9 pore is supposed to read at 250 bases per second. I
believe the mean could be a little bit more or less than that, and I
believe that all flowcells and devices should be about the same speed,
therefore all runs will sample from the same distribution of
mean_read_speed.
Because this is a scalar, pymc3 lets me set a lower bound of 0 bases per
second. (Interestingly, unlike DNA length, the speed could technically
be negative, since the voltage across the pore can be positive or
negative, as exploited by Read
Until...)
Truncated0T1D = pm.Bound(pm.StudentT, lower=0)
with pm.Model() as model:
mean_read_speed = Truncated0T1D('mean_read_speed', nu=3, mu=250, sd=10, shape=data.shape)
Capturing DNA
DNA is flopping around in solution, randomly entering pores now and
then. How long will a pore have to wait for DNA? I have a very vague
idea that it should take around a minute but this is basically an
unknown that the sampler should figure out.
Note again that I am using the mean time only, not the distribution of
times. The actual distribution of times to capture DNA would likely be
distributed by an exponential distribution (waiting time for a Poisson
process).
Hierarchical model
This is the first time I am using a hierarchical/multilevel model. For
more on what this means, see this example from pymc3
or Andrew Gelman's books.
There are three options for modeling mean_time_to_capture_dna: (a)
it's the same for each run (e.g., everybody us using the same recommended DNA concentration)
(b) it's independent for each run
(c) each run has a different mean, but the means are drawn from the same distribution (and
probably should not be too different).
Image("hierarchical_pymc3.png")
with pm.Model() as model:
prior_mean_time_to_capture_dna = Truncated0T1D('prior_mean_time_to_capture_dna', nu=3, mu=60, sd=30)
mean_time_to_capture_dna = pm.StudentT('mean_time_to_capture_dna', nu=3, mu=prior_mean_time_to_capture_dna,
sd=prior_mean_time_to_capture_dna/10, shape=data.shape)
Reading DNA
I can use pymc3's Deterministic type to calculate how long a pore spends
reading a chunk of DNA, on average. That's just a division, but note it
uses theano's true_div function instead of a regular division. This
is because neither value is a number; they are random variables. Theano
will calculate this as it's being sampled. (A simple "a/b" also
works, but I like to keep in mind what's being done by theano if
possible.)
with pm.Model() as model:
mean_time_to_read_dna = pm.Deterministic('mean_time_to_read_dna', T.true_div(mean_input_dna_len, mean_read_speed))
Then each pore can do how many reads in this run? I have to be a bit
careful to specify that I mean the number of reads possible per pore.
with pm.Model() as model:
num_reads_possible_in_time_per_pore = pm.Deterministic('num_reads_possible_in_time_per_pore',
T.true_div(num_seconds_run, T.add(mean_time_to_capture_dna, mean_time_to_read_dna)))
Blocking Pores
In my model, after each read ends, a pore can get blocked, and once it's
blocked it does not become unblocked. I believe about one in a hundred
times, a pore will be blocked after if finishes sequencing DNA. If it
were more than that, sequencing would be difficult, but it could be much
less.
We can think of the Beta
distribution
as modeling coin-flips where the probability of heads (pore getting
blocked) is 1/100.
with pm.Model() as model:
prior_p_pore_blocked_a, prior_p_pore_blocked_b, = 1, 99
p_pore_blocked = pm.Beta('p_pore_blocked', alpha=prior_p_pore_blocked_a, beta=prior_p_pore_blocked_b)
Then, per pore, the number of reads before blockage is distributed as a
geometric distribution (the distribution of the number of tails you will
flip before you flip a head). I have to approximate this with an
exponential distribution — the continuous version of a geometric —
because ADVI (see below) requires continuous distributions. I don't
think it makes an appreciable difference.
Here the shape parameter is the number of runs x number of pores
since here I need a value for every pore.
with pm.Model() as model:
num_reads_before_blockage = pm.Exponential('num_reads_before_blockage', lam=p_pore_blocked,
shape=(data.shape[0], MAX_PORES))
Constraints
Here things get a little more complicated. It is not possible to have
two random variables x and y, set x + y = data, and sample values of x
and y.
testdata = np.random.normal(loc=10,size=100) + np.random.normal(loc=1,size=100)
with pm.Model() as testmodel:
x = pm.Normal("x", mu=0, sd=10)
y = pm.Normal("y", mu=0, sd=1)
# This does not work
#z = pm.Deterministic(x + y, observed=data)
# This does work
z = pm.Normal('x+y', mu=x+y, sd=1, observed=testdata)
trace = pm.sample(1000, pm.Metropolis())
I was surprised by this at first but it makes sense. This process is
more like a regression where you minimize error than a perfectly fixed
constraint. The smaller the standard deviation, the more you penalize
deviations from the constraint. However, you need some slack so that
it's always possible to estimate a logp for the data given the
model. If the standard deviation goes too low, you end up with numerical
problems (e.g., nans). Unfortunately, I do not know of a
reasonable way to set this value.
I encode constraints in my model in a similar way: First I define a
Laplace distribution in theano to act as my likelihood. Why Laplace
instead of Normal? It returned nans less often...
Using your own DensityDist allows you to use any function as a
likelihood. I use DensityDist just to have more control over the
likelihood, and to differentiate from a true Normal/Laplacian
distribution. This can be finicky, so I've had to spend a bunch of time
tweaking this.
def T_Laplace(val, obs, b=1):
return T.log(T.mul(1/(2*b), T.exp(T.neg(T.true_div(T.abs_(T.sub(val, obs)), b)))))
Constraint #1
The total DNA I have read must be the product of mean_input_dna_len
and total_num_reads. mean_input_dna_len can be different to my
estimate.
This is a bit redundant since I know total_num_reads and
total_dna_read exactly. In a slightly more complex model we would
have different mean_read_length and mean_input_dna_len. Here it
amounts to just calculating mean_input_dna_len as
total_dna_read / total_num_reads.
def apply_mul_constraint(f1, f2, observed):
b = 1 # 1 fails with NUTS (logp=-inf), 10/100 fails with NUTS too (not positive definite)
return T_Laplace(T.mul(f1,f2), observed, b)
with pm.Model() as model:
total_dna_read_constraint = partial(apply_mul_constraint, mean_input_dna_len, data['total_num_reads'])
constrain_total_dna_read = pm.DensityDist('constrain_total_dna_read', total_dna_read_constraint, observed=data['total_dna_read'])
Constraint #2
The number of reads per pore is whichever is lower:
- the number of reads the pore could manage in the length of the run
- the number of reads it manages before getting blocked
To calculate this, I compare num_reads_before_blockage and
num_reads_possible_in_time_per_pore.
First, I use T.tile to replicate
num_reads_possible_in_time_per_pore for all pores. That turns it
from an array of length #runs to a matrix of shape #runs x #pores (this
should use
broadcasting
but I wasn't sure it was working properly...)
Then I take the minimum value of these two arrays (T.lt) and if the
minimum value is the number of reads before blockage
(T.switch(T.lt(f1,f2),0,1)) then that pore is blocked, otherwise it
is active. I sum these 0/1s over all pores (axis=AXIS_PORES) to get
a count of the number of active pores for each run.
The value of this count is constrained to be equal to
data['num_active_pores_end'].
def apply_count_constraint(num_reads_before_blockage, num_reads_possible_in_time_broadcast, observed):
b = 1
num_active_pores = T.sum(T.switch(T.lt(num_reads_before_blockage, num_reads_possible_in_time_broadcast),0,1),axis=AXIS_PORES)
return T_Laplace(num_active_pores, observed, b)
with pm.Model() as model:
num_reads_possible_in_time_broadcast = T.tile(num_reads_possible_in_time_per_pore, (MAX_PORES,1)).T
num_active_pores_end_constraint = partial(apply_count_constraint, num_reads_before_blockage, num_reads_possible_in_time_broadcast)
constrain_num_active_pores_end = pm.DensityDist('constrain_num_active_pores_end', num_active_pores_end_constraint, observed=data['num_active_pores_end'])
Constraint #3
Using the same matrix num_reads_possible_in_time_broadcast this time
I sum the total number of reads from each pore in a run. I simply sum
the minimum value from each pore: either the number before blockage
occurs or the total number possible.
def apply_minsum_constraint(num_reads_before_blockage, num_reads_possible_in_time_broadcast, observed):
b = 1 # b=1 fails with ADVI and >100 pores (nan)
min_reads_per_run = T.sum(T.minimum(num_reads_before_blockage, num_reads_possible_in_time_broadcast),axis=AXIS_PORES)
return T_Laplace(min_reads_per_run, observed, b)
with pm.Model() as model:
total_num_reads_constraint = partial(apply_minsum_constraint, num_reads_before_blockage, num_reads_possible_in_time_broadcast)
constrain_total_num_reads = pm.DensityDist('constrain_total_num_reads', total_num_reads_constraint, observed=data['total_num_reads'])
Sampling
There are three principal ways to sample in PyMC3:
- Metropolis-Hastings: the simplest sampler, generally works fine on simpler problems
- NUTS: more efficient than Metropolis, but I've found it to be slow and tricky to get to work
- ADVI: this is "variational inference", the new, fast way to estimate a posterior.
This seems to work great, though as I mention above, it needs continuous distributions only.
I used ADVI most of the time in this project, since NUTS was too slow
and had more numerical issues than ADVI, and the ADVI results seemed
more sensible than Metropolis.
with pm.Model() as model:
v_params = pm.variational.advi(n=500000, random_seed=1)
trace = pm.variational.sample_vp(v_params, draws=5000)
Results
In my toy model: there are 8 runs in total; each device has 3 pores; it
takes about 4 seconds to sequence a DNA molecule (~1000 bases / 250
bases per second) and 96 seconds to capture a DNA molecule (so 100
seconds in total to read one DNA molecule).
In my 8 runs, I expect that [10, 10, 10, 10, 200, 200, 400, 400]
reads are possible. However, I expect that runs 4, 5, 6, 7 will have a
blockage at 100 reads.
Finally, I expect there to be 3 pores remaining in the first four runs,
and 0 pores remaining in the last four.
ADVI Results
mean_input_dna_len__0 950.0 (+/- 0.1) # All the input_dna_lens are ok.
mean_input_dna_len__1 1050.0 (+/- 0.1) # This is a simple calculation since I know
mean_input_dna_len__2 950.0 (+/- 0.1) # how many reads there are and how long the reads are
mean_input_dna_len__3 1050.0 (+/- 0.1)
mean_input_dna_len__4 950.0 (+/- 0.1)
mean_input_dna_len__5 1049.9 (+/- 0.1)
mean_input_dna_len__6 950.0 (+/- 0.1)
mean_input_dna_len__7 1050.0 (+/- 0.1)
num_reads_possible_in_time_per_pore__0 10.1 (+/- 1) # The number of reads possible is also a simple calculation
num_reads_possible_in_time_per_pore__1 10.3 (+/- 1) # It depends on the speed of the sequencer and the length of DNA
num_reads_possible_in_time_per_pore__2 10.2 (+/- 1)
num_reads_possible_in_time_per_pore__3 10.5 (+/- 1)
num_reads_possible_in_time_per_pore__4 210.9 (+/- 36)
num_reads_possible_in_time_per_pore__5 207.4 (+/- 35)
num_reads_possible_in_time_per_pore__6 419.8 (+/- 67)
num_reads_possible_in_time_per_pore__7 413.2 (+/- 66)
num_reads_before_blockage_per_run__0 501.3 (+/- 557) # The total number of reads before blockage per run
num_reads_before_blockage_per_run__1 509.8 (+/- 543) # is highly variable when there is no blockage (runs 0-3).
num_reads_before_blockage_per_run__2 501.9 (+/- 512)
num_reads_before_blockage_per_run__3 502.4 (+/- 591)
num_reads_before_blockage_per_run__4 297.2 (+/- 39) # When there is blockage (runs 4-7), then it's much less variable
num_reads_before_blockage_per_run__5 298.7 (+/- 38)
num_reads_before_blockage_per_run__6 299.6 (+/- 38)
num_reads_before_blockage_per_run__7 301.2 (+/- 38)
num_active_pores_per_run__0 2.8 (+/- 0.4) # The number of active pores per run is estimated correctly
num_active_pores_per_run__1 2.8 (+/- 0.4) # as we expect for a value we've inputted
num_active_pores_per_run__2 2.8 (+/- 0.4)
num_active_pores_per_run__3 2.8 (+/- 0.3)
num_active_pores_per_run__4 0.0 (+/- 0.1)
num_active_pores_per_run__5 0.0 (+/- 0.1)
num_active_pores_per_run__6 0.0 (+/- 0.0)
num_active_pores_per_run__7 0.0 (+/- 0.0)
ADVI Plots
We can plot the values above too. No real surprises here, though
num_active_pores_per_run doesn't show the full distribution for some
reason.
with pm.Model() as model:
pm.traceplot(trace[-1000:],
figsize=(12,len(trace.varnames)*1.5),
lines={k: v['mean'] for k, v in pm.df_summary(trace[-1000:]).iterrows()})
Conclusions
It's a pretty complex model. Is it actually useful? I think it's almost
useful, but in its current form it suffers from a lack of robustness.
The results I get are sensitive to changes in several independent areas:
- the sampler/inference method: for example, Metropolis returns different answers to ADVI
- the constraint parameters: changing the "b" parameter can lead to numerical problems
- the priors I selected: changes in my priors that I do not perceive as important can lead to
different results (more data would help here)
A big problem with my current model is that scaling it up to 512 pores
seems to be difficult numerically. Metropolis just fails, I think
because it can't sample efficiently; NUTS fails for reasons I don't
understand (it throws an error about a positive definite matrix); ADVI
works best, but starts to get nans as the number of pores grows,
unless I loosen the constraints. It's possible that I should need to
begin with loose constraints and tighten them over time.
Finally, the model currently expects the same number of pores to be
available for every run. I haven't addressed that yet, though I think it
should be pretty straightforward. There may be nice theano trick I am
overlooking.
More Conclusions
Without ADVI, I think I would have failed to get an answer here. I'm
pretty ignorant of how it works, but it seems like a significant advance
and I'll definitely use it again. In fact, it would be interesting to
try applying Edward, a variational inference
toolkit with PyMC3 support, to this problem (this would mean swapping
theano for tensorflow).
The process of modeling your data with a PyMC3/Stan/Edward model forces
you to think a lot about what is really going on with your data and
model (and even after spending quite a bit of time on it, my model still
needs quite a bit of work to be more than a toy...) When your model has
computational problems, as I had several times, it often means the model
wasn't described correctly (Gelman's folk
theorem).
Although it is still a difficult, technical process, I'm excited about
this method of doing science. It seems like the right way to tackle a
lot of problems. Maybe with advances like ADVI and theano/tensorflow;
groups like the Blei and Gelman labs developing modeling tools like
PyMC3, Stan and Edward; and experts like Kruschke and the Stan team
creating robust models for us to copy, it will become more common.
Comment