Brian Naughton // Tue 17 October 2017 // Filed under genomics // Tags bioinformatics genomics programming

This small project started when I was looking for an implementation of Needleman-Wunsch (pairwise global DNA alignment) in javascript. I just wanted to align two sequences on a website and in a google sheet (using Google Apps Script).

I looked around for a simple javascript implementation (preferably one self-contained javascript file) but I was surprised that I couldn't find one. Needleman-Wunsch is a pretty simple algorithm so I decided to implement it. I did get a little bit side-tracked...

The first step was to find someone else's implementation to copy, so I started with some numpy code from @brent_p. Based on his other work, I think it's a safe assumption it's implemented correctly. (There is also a complete Cython version in this repo, which implements gap_extend and other parameters, and is obviously much faster. I really just need a very basic alignment so the simpler numpy version is fine for me).

numpy and friends

There are lots of ways to tweak the numpy implementation of Needleman-Wunsch to try to make it faster. Here are the things I tried:

  1. orig: the original numpy implementation.
  2. orig3: the original numpy implementation run with python 3.
    This is just to test how much faster or slower Python 3.6 is than 2.7.
  3. numpy: my numpy implementation.
    This is like the original numpy code, but modified a bit to make it more like my code.
  4. numba: my numpy implementation, but with numba applied.
    Numba is a pretty amazing JIT compiler you can turn on by adding one line of code. It comes with anaconda, and it's always worth trying just in case.
  5. torch: my numpy implementation, but with numpy replaced with PyTorch.
    PyTorch seems like a friendly alternative to TensorFlow, especially with its numpy-like syntax. Without explicitly applying .cuda() to my arrays it just uses the CPU, so it should not be too different to regular numpy.
  6. torchcuda: my numpy implementation, but with numpy replaced with PyTorch, and .cuda() applied to each array.
    The same as torch except using the GPU.
  7. cupy: my numpy implementation, but with numpy replaced with CuPy.
    CuPy is a drop-in replacement for numpy and, like PyTorch, only requires changing a couple of lines.

Nim

Nim is an interesting language that can compile to C (nimc) or javascript (nimjs). I thought this was a pretty good use-case for nim since I need javascript but writing scientific code in javascript is not fun. I started with a numpy-like library called arraymancer, which worked well, but since it relies on BLAS it would not compile to javascript (I could have checked that earlier...) Luckily, changing the code to simple for loops was pretty easy. Nim's syntax is a lot like Python, with some unnecessary differences like using echo instead of print. As someone used to Python, I didn't find it to be as friendly as I expected. The dream of Python-with-static-types is still a dream...

Javascript

Finally, I just programmed the alignment in javascript (js). All of the implementations are almost line-for-line identical, so this did not take long.

Speed comparison

I ran all the above implementations on some random DNA of various lengths and the results are plotted below.

import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
data = {'orig': {500: 2.23, 1000: 8.50, 1500: 18.96, 2000: 33.45, 2500: 52.70, 3000: 76.44, 5000: 209.90}, 
        'orig3': {500: 2.62, 1000: 10.34, 1500: 22.68, 2000: 40.22, 2500: 62.03, 3000: 90.15, 5000: 248.93}, 
        'numpy': {500: 1.54, 1000: 3.41, 1500: 6.27, 2000: 10.60, 2500: 16.11, 3000: 22.87, 5000: 67.45}, 
        'numba': {500: 5.54, 1000: 7.05, 1500: 9.28, 2000: 13.08, 2500: 17.40, 3000: 22.21, 5000: 56.69}, 
        'torch': {500: 1.67, 1000: 3.92, 1500: 7.61, 2000: 12.86, 2500: 19.55, 3000: 27.48, 5000: 82.90}, 
        'torchcuda': {500: 8.54, 1000: 22.47, 1500: 46.26, 2000: 80.12, 2500: 119.92, 3000: 169.95, 5000: 467.04}, 
        'cupy': {500: 35.71, 1000: 138.86, 1500: 951.97, 2000: 1713.57, 2500: 2660.11, 3000: 3798.51}, 
        'nimc': {500: 0.016, 1000: 0.041, 1500: 0.08, 2000: 0.13, 2500: 0.20, 3000: 0.31, 5000: 0.85}, 
        'nimjs': {500: 0.14, 1000: 0.28, 1500: 0.48, 2000: 0.75, 2500: 1.12, 3000: 1.53, 5000: 4.06}, 
        'js': {500: 0.09, 1000: 0.14, 1500: 0.20, 2000: 0.34, 2500: 0.41, 3000: 0.82, 5000: 1.64}}
vfast_ones = ["nimjs", "js", "nimc"]
fast_ones = ["torch", "numpy", "numba"] + vfast_ones
ok_ones = ["torchcuda", "orig3", "orig"] + fast_ones
df = pd.DataFrame(data)
df
cupy js nimc nimjs numba numpy orig orig3 torch torchcuda
500 35.71 0.09 0.01 0.13 5.54 1.54 2.23 2.62 1.67 8.54
1000 138.86 0.13 0.04 0.28 7.05 3.41 8.50 10.34 3.92 22.47
1500 951.97 0.20 0.08 0.48 9.28 6.27 18.96 22.68 7.61 46.26
2000 1713.57 0.34 0.13 0.75 13.08 10.60 33.45 40.22 12.86 80.12
2500 2660.11 0.41 0.20 1.12 17.49 16.11 52.70 62.03 19.55 119.92
3000 3798.51 0.82 0.31 1.53 22.21 22.87 76.44 90.15 27.48 169.95
5000 NaN 1.64 0.85 4.06 56.69 67.45 209.90 248.93 82.90 467.04

I'll skip cupy since it's much slower than everything else and throws the plots off. That doesn't imply anything negative about cupy and I'd use it again. It's extremely easy to replace numpy with cupy, and for properly vectorized code I'm sure it's much faster than numpy.

f,ax = plt.subplots(figsize=(16,12))
ax.set_title("fast: everything except cupy")
_ = df[ok_ones].plot(ax=ax)

plot_fast

f,ax = plt.subplots(figsize=(16,12))
ax.set_title("faster: numpy vs C vs js")
_ = df[fast_ones].plot(ax=ax)

plot_faster

f,ax = plt.subplots(figsize=(16,12))
ax.set_title("fastest: C vs js")
_ = df[vfast_ones].plot(ax=ax)

plot_fastest

Conclusions

I learned some interesting things here...

  • numba is good. It didn't speed this code up very much, but it was a bit faster than numpy for large alignments and didn't cost anything. I expected this to be the fastest Python-based code because there are several Python for loops (i.e., unvectorized code), which is where numba can help a lot.

  • I'm not sure why my numpy is faster than the original numpy since my changes were minimal. The original version is not coded for speed anyway.

  • GPUs don't help unless your code is written for GPUs. That basically means one repetitive task handed off to the GPU along with the data (no back-and-forth). There are ways to implement Needleman-Wunsch in a GPU-friendly way, but it complicates the code a lot. On the one hand this is a very obvious result to anyone who has used GPUs for computation — on the other hand, maybe a really smart compiler could use the CPU where appropriate and GPU where appropriate...

  • Nim is a pretty interesting language. I have seen it described as either "an easier C" or a "statically typed Python". To me it's definitely more like the former. It's not all that friendly compared to Python, but I think I'd try it again as a C/Cython replacement. Don't forget to compile with -d:release.

  • Javascript is fast! If nim is not compiled with -d:release it's even faster than nim's C code. Sadly, Google Apps Scripts' javascript is extremely slow for some reason. That was an unfortunate surprise, especially since it times out after about five minutes, so long alignments just fail! I can't explain why it's so slow...

Finally, just to note that this implementation is good enough for my purposes, but I haven't really spent any time making sure it works in all situations (apart from affirming that its output is the same as the original numpy code), so I wouldn't trust it too much. The code is available in this repo.

Comment
Brian Naughton // Fri 22 September 2017 // Filed under stats // Tags stats probability bayesianism maxent
from scipy import stats
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import animation, rc
from IPython.display import HTML, Image

rc('animation', html='html5')
plt.style.use('ggplot')
%matplotlib inline
%config InlineBackend.figure_format = 'retina'

Like many scientists, I've always been confused by and uncomfortable with the menagerie of available statistical tests, how they relate, and when to use them ("use a Chi-square test, unless n<5..."). It's hard to recover the logic underlying diagrams like the one below. Many of these tests rely on the data "being normal", but why? And how normal do they have to be?

statistical tests flowchart

Normal-looking distributions

As we learn, the normal — or Gaussian — results from the central limit theorem, when you add together random samples from the same distribution. There are some conditions, like the samples must be independent and the distribution must have finite variance.

If we know the underlying process producing the samples meets these conditions, then we can use the normal distribution with confidence. We also learn that even if the above conditions are not met, it's probably ok to just use the normal anyway (e.g., multipying values instead of adding, samples not from the same distribution), or we can apply a transformation to make the distribution "look normal". Then once we have a normal-looking distribution, we can use a standard \(t\)-test, etc. To many of us, this is not a hugely satisfying way to apply statistics, and it definitely does not feel very scientific.

Additionally, the normal doesn't seem particularly differentiated from other distributions. The binomial, Poisson, normal and t-distributions all look very similar... In fact, they can look very very similar, as we can see below. (Of course, the first three are members of the exponential family, so they are not unrelated, but I'll leave that aside.)

fig, ax = plt.subplots(1, 4, figsize=(16,5))

x = np.linspace(0, 30, 100)
ax[0].set_title("normal")
_ = ax[0].plot(x, stats.norm(10, 3).pdf(x), 'r-', lw=3, alpha=0.6)

x = np.arange(30)
ax[1].set_title("Poisson")
_ = ax[1].plot(x, stats.poisson(10).pmf(x), 'r-', lw=3, alpha=0.6)

x = np.arange(30)
ax[2].set_title("binomial")
_ = ax[2].plot(x, stats.binom(30, 1/3).pmf(x), 'r-', lw=3, alpha=0.6)

x = np.arange(30)
ax[3].set_title("t (df=10)")
_ = ax[3].plot(x, stats.t(10, 10, 3).pdf(x), 'r-', lw=3, alpha=0.6)

png

It turns out that these distributions are indeed all closely related, and you can derive them and understand how they are related using relatively simple (high school-ish level) maths. The trick is that you have to use the principle of "maximum entropy".

What is entropy?

I won't attempt to define entropy rigorously, since I am not a statistician and definitions are fiddly things, but I think it suffices to think of a maximum entropy distribution as the smoothest, most even distribution that still fits with some constraints. In other words, the distribution should not prefer any value in particular unless forced to do so to meet its constraints.

In the simplest example, the maximum entropy distribution bounded by two finite values is just a flat (uniform) distribution (the "principle of indifference").

You can calculate the entropy of a distribution using \(H(X) = -\sum_{k\geq1}p_k log(p_k)\); or for continuous distributions: \(H(X) = -\int_{-\infty}^{\infty}p(x) \log p(x)dx\). The concept of entropy is fundamental across several fields including information theory and statistical mechanics, so these formulas may look familiar.

Because we can measure the entropy of any distribution, we can define some constraints (e.g., a distribution has bounds 0 to 1, the mean is \(\mu\), etc), and derive the maximum entropy distribution given those constraints, using differentiation to find the maximum and Lagrange multipliers to enforce the constraints.

For a proper description and derivation, see Data Analysis, A Bayesian Tutorial by Sivia & Skilling (Section 5.3). That is where I first learned about this, and it is still the best treatment I have seen. Statistical Rethinking by McElreath (Section 9.1) also discusses this topic; Information Theory by MacKay (Section 22.13) (a free book) discusses it in passing, and I'm sure a lot of other books do too.

I will just give the gist, but hopefully it's near enough correct.

Deriving the normal distribution

To derive the normal distribution, we start with this formula (Sivia & Skilling, Section 5.3.2):

Sivia & Skilling, 5.3.2

It looks complicated, but it's really just the entropy formula with two Lagrange multipliers (and three constraints): the sum of \(p\)'s is \(1\) (because it's a probability distribution), the mean is \(\mu\), and the variance is \(\sigma^2\). If we take the derivative to find the maximum entropy — by setting \({\partial}Q/{\partial}p = 0\) — out pops the normal distribution in just a couple of steps (see the book for details!)

So, the normal distribution is the "fewest-assumptions" distribution there is if you have only a mean and a variance, and that is partially why it's so broadly applicable! That also explains (I think) why it's generally ok to use the normal distribution even when its conditions (i.i.d. samples, finite variance, etc) have not been met. It's about as general a distribution as you can get.

Other distributions

Sivia derives a few other fundamental distributions using maximum entropy:

  • Exponential: this is derived identically to the normal, but with no variance constraint (just the mean).
  • Binomial: this is derived using similar methods, but starting with the formula for \({n \choose k}\).
  • Poisson: this is derived similarly to the binomial, though it also requires a Taylor expansion.

Sivia also draws some interesting connections between the distributions:

  • The binomial can be approximated by the Poisson, for large \(N\) and small \(p\).
  • The Poisson can be approximated by the normal, for large \(\mu\), using Stirling's approximation.

I think this sums up pretty nicely how these common distributions are related and why they all look so similar.

Which distributions correspond to which constraints?

There is a nice wikipedia article on maximum entropy probability distributions, with a table of twenty or so distributions and the associated constraints (below are just a subset):

maximum entropy distributions

Interestingly, sometimes apparently simple constraints (e.g., bounds are \(a\) to \(b\), mean is \(\mu\)) can produce complicated answers.

The fat-tailed \(t\)-distribution

The \(t\)-distribution, of \(t\)-test fame, is not covered in the above section of Sivia & Skilling, though it can be derived using maximum entropy too. One description of the \(t\)-distribution that shows its connection to the normal is that it represents samples from a normal distribution of uncertain variance (hence its common description as "like the normal distribution with fatter tails"). As the number of degrees of freedom grows, the \(t\)-distribution approaches the normal distribution. At \(\geq10\) degrees of freedom, the \(t\) and normal distributions are difficult to distinguish (see my plots at the start of this post).

Another interesting thing about the \(t\)-distribution versus the normal is that MacKay claims that the normal distribution is not really a good "real world" distribution, since its tails are too light (Information Theory, Section 23.2). For example, fatter tails are more forgiving to outliers, and the natural world has a lot of outliers: height is almost the canonical example of a normally distributed trait, but dwarfism and gigantism produce heights more standard deviations from the mean than a normal distribution expects.

For any natural data, it might actually be better overall to use a \(t\)-distribution or other fatter tailed distribution. Of course, that makes it more difficult to solve Ordinary Least Squares and so on, but we have powerful computers now that can solve these things numerically, so maybe it should not matter? Similarly, when choosing priors, experts will often recommend fatter tailed distributions, like the \(t\)-distribution or the related half-Cauchy over the normal, especially for scale parameters (e.g., standard deviation).

Conclusion

Many of the common distributions we come across in science can be understood from the perspective of deriving the maximum entropy distribution subject to some constraints. Relatedly, some of the most fundamental distributions can be understood as approximations of each other under different conditions. This perspective on probability distributions was not at all obvious to me for a long time, but I wish probability and statistics were taught more in this style instead of the "cookbook" style.

With the growth in probabilistic programming (Stan, PyMC3, Edward), sometimes replacing standard statistical tests, it is becoming more important to think about which distributions really represent your data and why. It's not always normal.




Appendix

It's not so difficult to prove that the maximum entropy distribution is uniform. Nevertheless here's a simulation. The highest entropy distribution from this simulation looks uniform, and the lowest entropy distribution looks very skewed.

min_ent, max_ent = np.inf, -np.inf
N = 11

for _ in range(10000):
    vals = np.zeros(N)
    for _ in range(500):
        vals[np.random.randint(N)] += 1
    vals /= sum(vals)

    ent = (-vals*np.log(vals)).sum()

    if ent > max_ent: 
        max_vals = vals[:]
        max_ent = ent
    if ent < min_ent: 
        min_vals = vals[:]
        min_ent = ent

fig, ax = plt.subplots(1, 2, figsize=(16,5))
ax[0].set_ylim([0, .2])
ax[1].set_ylim([0, .2])
ax[0].set_title("highest entropy ({:.2f})".format(max_ent))
ax[1].set_title("lowest entropy ({:.2f})".format(min_ent))

x = np.linspace(0, 10, N)
_ = ax[0].plot(x, max_vals)
_ = ax[1].plot(x, min_vals)

png

The beta distribution is also a maximum entropy distribution bounded by 0 and 1. It also starts to look pretty normal after a few samples. (Thanks to the louistao.me blog for instructions on how to animate it).

ps = np.linspace(0, 1, 101)
n_start, k_start = 0, 0

def animate(i):
    if i > 90:   j = 0
    elif i > 50: j = 90 - i
    elif i > 40: j = 40
    else:        j = i

    n = j*2 + n_start
    k = j + k_start

    y = (ps**k)*((1-ps)**(n-k))
    y /= max(y)
    line.set_data(ps, y)
    line.set_label("heads:{:2d} flips:{:2d}".format(j,n))
    legend = plt.legend(prop={'family': 'monospace'})
    return (line,)

def init():
    line.set_data([], [])
    return (line,)

fig, ax = plt.subplots(figsize=(8,4))
ax.set_xlim((0, 1))
ax.set_ylim((0, 1.15))
line, = ax.plot([], [], lw=2)
anim = animation.FuncAnimation(fig, animate, init_func=init, frames=100, interval=100, blit=True)
anim
Comment
Brian Naughton // Mon 11 September 2017 // Filed under biotech // Tags biotech vc

Y Combinator recently announced that they want to do more biotech, specifically "health and synthetic biologies". This seems like a good thing in general, since there aren't too many incubators for biotech out there. IndieBio is the biggest, but they are completely focused on biotech, even providing lab space.

So what are YC actually investing in? Here are the biotech companies from the 2017 Winter and Summer batches (data from techcrunch: 1 2 3 4):

  • Darmiyan: Alzheimer's diagnostic
    We definitely need better Alzheimer's diagnostics, so this seems like a great thing. Usually these diagnostics are designed to help enroll prodromal patients in clinical trials (e.g., Avid), but this one seems to be for screening too, at least according to techcrunch. It's unclear to me what their technology is, though it appears to be MRI-based.
  • HelpWear: heart attack-detecting wearable
    This watch monitors heart palpitations, arrhythmias and heart attacks. I'd buy one if I were at risk... It will enter clinical trials in the "near future."
  • Oncobox: cancer genetic test
    This test appears to match cancer drugs to patients, so I guess it's similar to Foundation Medicine, though with "full DNA and RNA profiles". Usually, it's pretty expensive to do anything novel like this, since you have to convince doctors that it makes sense. So you need to fund a trial, or ten.
  • Forever Labs: autologous stem cells
    Pretty cool, but reminiscent of the whole cord blood thing. It also reminds me of the companies you can send your surplus teeth to (there are several!), which is an odd, but noninvasive, way to get stem cells from kids.
  • Cambridge Cancer Genomics: "next gen liquid biopsy, AI, smart genomics"
    This is quite a few buzzwords, especially for a British company. They say they are "applying our proprietary analysis to a tumour's genomic features", to help guide treatment. The team certainly looks solid, but similarly to Oncobox, they will likely need a bunch of money to prove this works.
  • Modern fertility: at-home fertility test
    These guys appear to be packaging a panel of useful fertility tests for home use. I don't think they have to invent anything new here, which is probably a good thing. It's "physician-ordered", which avoids FDA involvement (and is also a good phrase to search for in "DTC" genetic tests...)
  • BillionToOne: NIPT for developing countries
    (Disclaimer: I know these guys). This one makes sense to me. NIPT is a great technology that should be brought to every country.
  • PreDxion: blood test for the ER
    This appears to be a POC blood test. I don't doubt the need for new tests like this, but they'll need to go through FDA, which is a long road.
  • Clear Genetics: automated genetic counseling
    Automating genetic counseling is necessary because there are only a couple of thousand genetic counselors in the US and the tests are getting more common and more complex. It's a bit hard to believe genetic counseling is a $5B market in the US though, since that implies $2M+ revenue per genetic counselor.
  • Delee: a circulating tumor cell diagnostic
    I thought CTCs were done after On-Q-Ity but it's been a while and there's probably something new and interesting to do here. They've completed a small trial already, which is great.
  • AlemHealth: radiology imaging diagnostic
    Like BillionToOne, AlemHealth is bringing a known-useful technology to emerging markets. Perhaps surprisingly, I believe this kind of global radiology outsourcing is already common in the US (one large healthcare system sends images to Brazil and India, apparently).
  • Indee: CRISPR research tool
    This is apparently for "developing and manufacturing" cell therapies by gene editing. It sounds novel, cell therapies are here for the long-term, and there are dozens of CAR-T companies around to pay for it, so it could be cool.
  • InnaMed: home blood-testing device
    This seems to be a cartridge-based blood diagnostic, kind of like Cepheid, except for home use. I don't know if doctors like to bill for these visits or not, which can kill adoption, but if InnaMed can pull it off it seems like a great thing... Like PreDxion, they'll need FDA clearance.
  • Volt Health: electrical stimulation medical device
    This seems to be a topical neurostimulation device to treat incontinence. It looks like one of those electrical muscle stimulation belts for your abs. I instinctively like this because incontinence is one of those massive problems nobody wants to work on. Maybe the trial will be inexpensive too, since the risks seem low.

My sole criterion for a company to make the list was that it may involve FDA, CLIA/CAP, or similar. There may be omissions! — I just skimmed through the techcrunch articles.

In 2015, 18 out of 108 YC companies were labeled "biomedical". My criteria are stricter, and I count 14 out of 210 companies from 2017 as "biotech". This appears to be lower than — or at least not significantly higher than — the number in 2015, somewhat contradicting the YC quote.

Interestingly, the 2015 batch were also much more computation- and therapeutics-focused, including 20n, atomwise, Transcriptic and Notable Labs. Synbio leader Ginkgo was funded in 2014. 11 of the 14 biotechs from 2017 are diagnostics (including Clear Genetics), one is a therapy / medical device (Volt), one a research tool (Indee), and one a service I can't classify (Forever Labs).

It's a curious set of companies, surprisingly light on computation- or data-driven companies, which you'd think would be YC's strength. Also notably, I don't see any synthetic biology companies (arguably Indee?) and only Volt Health is therapeutic. By contrast, IndieBio has many (at least half?) synthetic biology companies, and several therapeutics.

Because the list is so diagnostics-focused, many of the companies on this list will need expensive trials to properly enter their markets. Perhaps the YC program is setting them up for a raise from traditional biotech VCs? I don't yet see what YC wants to do in biotech, but their statement about doing more is only a few months old, so it will be interesting to see what happens in 2018.

Comment
Brian Naughton // Tue 27 June 2017 // Filed under biotech // Tags biotech drug development

Some notes on drug development: classes of drug targets and therapies.

Read More
Brian Naughton // Mon 06 February 2017 // Filed under biotech // Tags biotech transcriptic snakemake synthetic biology

How to automate protein synthesis pipeline using transcriptic and snakemake

Read More
Brian Naughton // Thu 26 January 2017 // Filed under biotech // Tags biotech iolt nodemcu arduino

An internet-connected lab thermometer

Read More
Brian Naughton // Mon 10 October 2016 // Filed under genomics // Tags genomics nanopore

What's been going on with Oxford Nanopore in 2016

Read More
Brian Naughton // Sun 14 August 2016 // Filed under data // Tags data genomics statistics pymc3 nanopore

Estimating parameters for a toy nanopore simulator using a hierarchical Bayesian model in PyMC3

Read More
Brian Naughton // Sun 05 June 2016 // Filed under biotech // Tags data biotech genomics statistics

A review of interesting things in biotech, genomics, data analysis

Read More
Brian Naughton // Sun 10 April 2016 // Filed under data // Tags data machine learning neural nets deep learning threejs webgl

Convolutional neural net playing a 3D game using convnetjs and three.js

Read More

Boolean Biotech © Brian Naughton Powered by Pelican and Twitter Bootstrap. Icons by Font Awesome and Font Awesome More