Brian Naughton // Sun 02 June 2019 // Filed under datascience // Tags datascience cloud gcp aws bioinformatics

There are a lot of bioinformatics workflow languages and systems these days. Albert Vilella maintains a comprehensive spreadsheet and it's massive. A little confusingly, some of these are actual languages (CWL), some are engines for running workflows (Cromwell), and some are self-contained systems (reflow, nextflow, snakemake).

The point of these tools is to take a multi-part process — usually a combination of scripts and binaries — and run it many times, keeping a log of what succeeded and what failed, and managing the resources required.

Among the most popular (or at least ones I've heard of people using) are snakemake, toil, nextflow, luigi, bpipe, scipipe, and Cromwell/WDL.

For a comprehensive review, see this 2017 paper by Jeremy Leipzig. Here, I'm just going to compare an example script from the nextflow website to the equivalent in snakemake and reflow. These are the three workflow systems I am most familiar with, and I think the differences between them are pretty interesting.

Nextflow

Nextflow is probably the most popular workflow system right now. It is powerful, simple to get started with, and very well supported by Paolo Di Tommasso, with plenty of example scripts. It supports running pipelines locally, on a local cluster, or on the cloud. It can use containers (Docker or Singularity) for reproducibility. Nextflow uses AWS Batch as its main cloud backend (apparently everyone is these days).

The basic script below, from the nextflow site, is pretty self-explanatory.

script

#!/usr/bin/env nextflow

params.in = "$baseDir/data/sample.fa"
sequences = file(params.in)

/*
 * split a fasta file in multiple files
 */
process splitSequences {
    input:
    file 'input.fa' from sequences

    output:
    file 'seq_*' into records

    """
    awk '/^>/{f="seq_"++d} {print > f}' < input.fa
    """
}

/*
 * Simple reverse the sequences
 */
process reverse {
    input:
    file x from records

    output:
    stdout result

    """
    cat $x | rev
    """
}

/*
 * print the channel content
 */
result.subscribe { println it }

output

N E X T F L O W  ~  version 19.04.1
Launching basic.rf [hungry_celsius] - revision: c3e3b933c7
[warm up] executor > local
executor >  local (1)
[ea/b4bcb4] process > splitSequences [  0%] 0 of 1
1FDSA>
ACCAAACACAGTA
2FDSA>
ACCAAACACAGTA
3FDSA>
ACCAAACACAGTA
executor >  local (2)
[ea/b4bcb4] process > splitSequences [100%] 1 of 1 
[ef/0377ba] process > reverse        [100%] 1 of 1 
Completed at: 19-May-2019 10:17:39
Duration    : 951ms
CPU hours   : (a few seconds)
Succeeded   : 2

The output is quite comprehensive and easy to understand.

Snakemake

Snakemake, developed by Johannes Köster (also the founder of the bioconda project!) is the oldest of the three workflow systems here. As the name implies, it's sort of like make, but in Python.

Snakemake existed before containers and cloud computing were common. It is filename-focused, which makes it simple to get started with, and pretty easy to debug. In contrast, nextflow and reflow try to abstract away the names of files, sometimes at the cost of simplicity.

Even though I have used snakemake quite a bit, I found it pretty difficult to port the nextflow script above, mainly because it involves splitting one file into multiple. I think there are at least three ways to handle file splitting in snakemake: dynamic files (soon to be deprecated), the new checkpoints (I did not look into this much), and just using directories.

The directory option is the simplest to me, but the result is probably far from ideal snakemake. In previous snakemake pipelines I often relied on something similar: when one step of the pipeline is done, instead of relying on the appearance of a specific filename, I would just output a "sentinel" file, e.g., echo done > {output.sentinel}, and use that as the trigger for the next step.

Since Snakemake is based on Python, it is easy to extend and integrate with other scripts. To copy the nextflow script, here I just use the bash though.

script

rule all:
    input: "rev"

rule split_fasta:
    input: "input.fa"
    output: directory("split")
    run: shell("""awk '/^>/{{f="{output}/seq_"++d}} {{print > f}}' < {input}""")

rule reverse_fasta:
    input: "split"
    output: directory("rev")
    run: shell("""for f in {input}/*; do cat $f |rev > {output}/rev_$(basename $f); done""")

output

Snakemake outputs a directory, rev, containing the files rev_seq_1, rev_seq_2, rev_seq_3.

Reflow

Reflow is a relatively new workflow language developed by at Grail by @marius and colleagues. I've been using Reflow for a year or so, and it's the system I am most familiar with.

One big difference between reflow and the other systems is how constrained it is: simplifying a bit, every function takes (a) a docker container, (b) a file or directory stored in s3, and (c) some bash; and outputs another file or directory stored in s3.

Everything gets run on AWS. You cannot run it on a regular cluster. You can run a reflow pipeline locally but it's mostly for debugging, and feels more like emulating the real AWS pipeline.

script

val files = make("$/files")
val dirs = make("$/dirs")

input := file("s3://booleanbiotech/input.fa")

func Split(input file) dir =
    exec(image := "ubuntu") (out dir) {"
        echo Split {{input}};
        awk '/^>/{f="{{out}}/seq_"++d} {print > f}' < {{input}}
    "}

func Reverse(input file) file =
    exec(image := "ubuntu") (out file) {"
        echo Reverse {{input}};
        cat {{input}} | rev > {{out}}
    "}

split_dir := Split(input)
rev_files := [Reverse(f) | f <- dirs.Files(split_dir)]

val Main = rev_files

output

[file(sha256=sha256:15a48aae8b19682a84dcf39f9260901e0b8331e8f9f974efd11a9a15c950320f, size=16),
 file(sha256=sha256:adad5843ce952755cd5b5b71fd92c7a67be3063d047350cf7fffdd6031ee5c2e, size=16),
 file(sha256=sha256:441ff0551203c350251abc5782698371116cca12a735fa2dbed66de4820b9201, size=16)]

There is logging too, but overall the output of reflow is pretty opaque!

To see the contents of the generated files, you have to use reflow cat 15a48aae8b19682a84dcf39f9260901e0b8331e8f9f974efd11a9a15c950320f. Generally, the last step in your pipeline should copy the outputs to S3, but debugging this can be tough.

Comparing the Three

Snakemake has been great for local pipelines. Its "reports" — html output from stages of the pipeline — are especially nice. It also now supports containers (and conda), so it is far from stagnant. However, I haven't tried to run snakemake pipelines on the cloud, and I suspect since it is so file-focused it might be less suited to this.

I only seriously experimented with nextflow before it had AWS Batch support, which made it unsuitable for my purposes. At that time running pipelines on the cloud required quite a lot of setup, but now this issue seems to be solved. Nextflow development also seems to be the most active of the three. There are even Nextflow meetings, periodically.

Reflow is the least popular, least flexible, and least documented of the three. Nevertheless, as a user who exclusively runs pipelines on the cloud, I still like the reflow model best. It gets closest to my ideal for a workflow system, which is basically just annotating code with the resources required to run it.

This is because to reflow, everything is a hash: containers, code, inputs, outputs. For example, if you have genome A (md5sum(genome_sequence_bytes)=<hash1>), and you run adapter trimming function B (md5sum(script_bytes)=<hash2>), using container C (md5sum(docker_container_bytes)=<hash3>), then you "know" the output is the hash of md5sum(<hash1>+<hash2>+<hash3>)=<hash4>. If the file s3://my-reflow-cache/<hash4> exists, even if it's from an old workflow, you don't need to recalculate! (The above is the gist of how it works, anyway. As usual with caches, this system can also result in tricky bugs...)

For some technical details on differences between nextflow and reflow, this Twitter thread may also be of interest, since it includes some discussion from the authors of the tools.

Conclusions

If I were starting today, I would probably choose nextflow as the safe option, mainly due to its new AWS Batch support. Snakemake still works great, though it may be showing its age a bit, while reflow is excellent if you are AWS-exclusive and can deal with its lack of documentation and support.

I think my ideal workflow language would have reflow-style decorators on Python functions (or a subset of python). I'll be interested to see if the reflow model inspires some new workflow systems like this.

Comment
Brian Naughton // Tue 17 October 2017 // Filed under genomics // Tags bioinformatics genomics programming

This small project started when I was looking for an implementation of Needleman-Wunsch (pairwise global DNA alignment) in javascript. I just wanted to align two sequences on a website and in a google sheet (using Google Apps Script).

I looked around for a simple javascript implementation (preferably one self-contained javascript file) but I was surprised that I couldn't find one. Needleman-Wunsch is a pretty simple algorithm so I decided to implement it. I did get a little bit side-tracked...

The first step was to find someone else's implementation to copy, so I started with some numpy code from @brent_p. Based on his other work, I think it's a safe assumption it's implemented correctly. (There is also a complete Cython version in this repo, which implements gap_extend and other parameters, and is obviously much faster. I really just need a very basic alignment so the simpler numpy version is fine for me).

numpy and friends

There are lots of ways to tweak the numpy implementation of Needleman-Wunsch to try to make it faster. Here are the things I tried:

  1. orig: the original numpy implementation.
  2. orig3: the original numpy implementation run with python 3.
    This is just to test how much faster or slower Python 3.6 is than 2.7.
  3. numpy: my numpy implementation.
    This is like the original numpy code, but modified a bit to make it more like my code.
  4. numba: my numpy implementation, but with numba applied.
    Numba is a pretty amazing JIT compiler you can turn on by adding one line of code. It comes with anaconda, and it's always worth trying just in case.
  5. torch: my numpy implementation, but with numpy replaced with PyTorch.
    PyTorch seems like a friendly alternative to TensorFlow, especially with its numpy-like syntax. Without explicitly applying .cuda() to my arrays it just uses the CPU, so it should not be too different to regular numpy.
  6. torchcuda: my numpy implementation, but with numpy replaced with PyTorch, and .cuda() applied to each array.
    The same as torch except using the GPU.
  7. cupy: my numpy implementation, but with numpy replaced with CuPy.
    CuPy is a drop-in replacement for numpy and, like PyTorch, only requires changing a couple of lines.

Nim

Nim is an interesting language that can compile to C (nimc) or javascript (nimjs). I thought this was a pretty good use-case for nim since I need javascript but writing scientific code in javascript is not fun. I started with a numpy-like library called arraymancer, which worked well, but since it relies on BLAS it would not compile to javascript (I could have checked that earlier...) Luckily, changing the code to simple for loops was pretty easy. Nim's syntax is a lot like Python, with some unnecessary differences like using echo instead of print. As someone used to Python, I didn't find it to be as friendly as I expected. The dream of Python-with-static-types is still a dream...

Javascript

Finally, I just programmed the alignment in javascript (js). All of the implementations are almost line-for-line identical, so this did not take long.

Speed comparison

I ran all the above implementations on some random DNA of various lengths and the results are plotted below.

import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
data = {'orig': {500: 2.23, 1000: 8.50, 1500: 18.96, 2000: 33.45, 2500: 52.70, 3000: 76.44, 5000: 209.90}, 
        'orig3': {500: 2.62, 1000: 10.34, 1500: 22.68, 2000: 40.22, 2500: 62.03, 3000: 90.15, 5000: 248.93}, 
        'numpy': {500: 1.54, 1000: 3.41, 1500: 6.27, 2000: 10.60, 2500: 16.11, 3000: 22.87, 5000: 67.45}, 
        'numba': {500: 5.54, 1000: 7.05, 1500: 9.28, 2000: 13.08, 2500: 17.40, 3000: 22.21, 5000: 56.69}, 
        'torch': {500: 1.67, 1000: 3.92, 1500: 7.61, 2000: 12.86, 2500: 19.55, 3000: 27.48, 5000: 82.90}, 
        'torchcuda': {500: 8.54, 1000: 22.47, 1500: 46.26, 2000: 80.12, 2500: 119.92, 3000: 169.95, 5000: 467.04}, 
        'cupy': {500: 35.71, 1000: 138.86, 1500: 951.97, 2000: 1713.57, 2500: 2660.11, 3000: 3798.51}, 
        'nimc': {500: 0.016, 1000: 0.041, 1500: 0.08, 2000: 0.13, 2500: 0.20, 3000: 0.31, 5000: 0.85}, 
        'nimjs': {500: 0.14, 1000: 0.28, 1500: 0.48, 2000: 0.75, 2500: 1.12, 3000: 1.53, 5000: 4.06}, 
        'js': {500: 0.09, 1000: 0.14, 1500: 0.20, 2000: 0.34, 2500: 0.41, 3000: 0.82, 5000: 1.64}}
vfast_ones = ["nimjs", "js", "nimc"]
fast_ones = ["torch", "numpy", "numba"] + vfast_ones
ok_ones = ["torchcuda", "orig3", "orig"] + fast_ones
df = pd.DataFrame(data)
df
cupy js nimc nimjs numba numpy orig orig3 torch torchcuda
500 35.71 0.09 0.01 0.13 5.54 1.54 2.23 2.62 1.67 8.54
1000 138.86 0.13 0.04 0.28 7.05 3.41 8.50 10.34 3.92 22.47
1500 951.97 0.20 0.08 0.48 9.28 6.27 18.96 22.68 7.61 46.26
2000 1713.57 0.34 0.13 0.75 13.08 10.60 33.45 40.22 12.86 80.12
2500 2660.11 0.41 0.20 1.12 17.49 16.11 52.70 62.03 19.55 119.92
3000 3798.51 0.82 0.31 1.53 22.21 22.87 76.44 90.15 27.48 169.95
5000 NaN 1.64 0.85 4.06 56.69 67.45 209.90 248.93 82.90 467.04

I'll skip cupy since it's much slower than everything else and throws the plots off. That doesn't imply anything negative about cupy and I'd use it again. It's extremely easy to replace numpy with cupy, and for properly vectorized code I'm sure it's much faster than numpy.

f,ax = plt.subplots(figsize=(16,12))
ax.set_title("fast: everything except cupy")
_ = df[ok_ones].plot(ax=ax)

plot_fast

f,ax = plt.subplots(figsize=(16,12))
ax.set_title("faster: numpy vs C vs js")
_ = df[fast_ones].plot(ax=ax)

plot_faster

f,ax = plt.subplots(figsize=(16,12))
ax.set_title("fastest: C vs js")
_ = df[vfast_ones].plot(ax=ax)

plot_fastest

Conclusions

I learned some interesting things here...

  • numba is good. It didn't speed this code up very much, but it was a bit faster than numpy for large alignments and didn't cost anything. I expected this to be the fastest Python-based code because there are several Python for loops (i.e., unvectorized code), which is where numba can help a lot.

  • I'm not sure why my numpy is faster than the original numpy since my changes were minimal. The original version is not coded for speed anyway.

  • GPUs don't help unless your code is written for GPUs. That basically means one repetitive task handed off to the GPU along with the data (no back-and-forth). There are ways to implement Needleman-Wunsch in a GPU-friendly way, but it complicates the code a lot. On the one hand this is a very obvious result to anyone who has used GPUs for computation — on the other hand, maybe a really smart compiler could use the CPU where appropriate and GPU where appropriate...

  • Nim is a pretty interesting language. I have seen it described as either "an easier C" or a "statically typed Python". To me it's definitely more like the former. It's not all that friendly compared to Python, but I think I'd try it again as a C/Cython replacement. Don't forget to compile with -d:release.

  • Javascript is fast! If nim is not compiled with -d:release it's even faster than nim's C code. Sadly, Google Apps Scripts' javascript is extremely slow for some reason. That was an unfortunate surprise, especially since it times out after about five minutes, so long alignments just fail! I can't explain why it's so slow...

Finally, just to note that this implementation is good enough for my purposes, but I haven't really spent any time making sure it works in all situations (apart from affirming that its output is the same as the original numpy code), so I wouldn't trust it too much. The code is available in this repo.

Comment

Boolean Biotech © Brian Naughton Powered by Pelican and Twitter Bootstrap. Icons by Font Awesome and Font Awesome More