Comparing bioinformatics workflow systems: nextflow, snakemake, reflow

Brian Naughton // Sun 02 June 2019 // Filed under datascience // Tags datascience cloud gcp aws bioinformatics

There are a lot of bioinformatics workflow languages and systems these days. Albert Vilella maintains a comprehensive spreadsheet and it's massive. A little confusingly, some of these are actual languages (CWL), some are engines for running workflows (Cromwell), and some are self-contained systems (reflow, nextflow, snakemake).

The point of these tools is to take a multi-part process — usually a combination of scripts and binaries — and run it many times, keeping a log of what succeeded and what failed, and managing the resources required.

Among the most popular (or at least ones I've heard of people using) are snakemake, toil, nextflow, luigi, bpipe, scipipe, and Cromwell/WDL.

For a comprehensive review, see this 2017 paper by Jeremy Leipzig. Here, I'm just going to compare an example script from the nextflow website to the equivalent in snakemake and reflow. These are the three workflow systems I am most familiar with, and I think the differences between them are pretty interesting.

Nextflow

Nextflow is probably the most popular workflow system right now. It is powerful, simple to get started with, and very well supported by Paolo Di Tommasso, with plenty of example scripts. It supports running pipelines locally, on a local cluster, or on the cloud. It can use containers (Docker or Singularity) for reproducibility. Nextflow uses AWS Batch as its main cloud backend (apparently everyone is these days).

The basic script below, from the nextflow site, is pretty self-explanatory.

script

#!/usr/bin/env nextflow

params.in = "$baseDir/data/sample.fa"
sequences = file(params.in)

/*
 * split a fasta file in multiple files
 */
process splitSequences {
    input:
    file 'input.fa' from sequences

    output:
    file 'seq_*' into records

    """
    awk '/^>/{f="seq_"++d} {print > f}' < input.fa
    """
}

/*
 * Simple reverse the sequences
 */
process reverse {
    input:
    file x from records

    output:
    stdout result

    """
    cat $x | rev
    """
}

/*
 * print the channel content
 */
result.subscribe { println it }

output

N E X T F L O W  ~  version 19.04.1
Launching basic.rf [hungry_celsius] - revision: c3e3b933c7
[warm up] executor > local
executor >  local (1)
[ea/b4bcb4] process > splitSequences [  0%] 0 of 1
1FDSA>
ACCAAACACAGTA
2FDSA>
ACCAAACACAGTA
3FDSA>
ACCAAACACAGTA
executor >  local (2)
[ea/b4bcb4] process > splitSequences [100%] 1 of 1 
[ef/0377ba] process > reverse        [100%] 1 of 1 
Completed at: 19-May-2019 10:17:39
Duration    : 951ms
CPU hours   : (a few seconds)
Succeeded   : 2

The output is quite comprehensive and easy to understand.

Snakemake

Snakemake, developed by Johannes Köster (also the founder of the bioconda project!) is the oldest of the three workflow systems here. As the name implies, it's sort of like make, but in Python.

Snakemake existed before containers and cloud computing were common. It is filename-focused, which makes it simple to get started with, and pretty easy to debug. In contrast, nextflow and reflow try to abstract away the names of files, sometimes at the cost of simplicity.

Even though I have used snakemake quite a bit, I found it pretty difficult to port the nextflow script above, mainly because it involves splitting one file into multiple. I think there are at least three ways to handle file splitting in snakemake: dynamic files (soon to be deprecated), the new checkpoints (I did not look into this much), and just using directories.

The directory option is the simplest to me, but the result is probably far from ideal snakemake. In previous snakemake pipelines I often relied on something similar: when one step of the pipeline is done, instead of relying on the appearance of a specific filename, I would just output a "sentinel" file, e.g., echo done > {output.sentinel}, and use that as the trigger for the next step.

Since Snakemake is based on Python, it is easy to extend and integrate with other scripts. To copy the nextflow script, here I just use the bash though.

script

rule all:
    input: "rev"

rule split_fasta:
    input: "input.fa"
    output: directory("split")
    run: shell("""awk '/^>/{{f="{output}/seq_"++d}} {{print > f}}' < {input}""")

rule reverse_fasta:
    input: "split"
    output: directory("rev")
    run: shell("""for f in {input}/*; do cat $f |rev > {output}/rev_$(basename $f); done""")

output

Snakemake outputs a directory, rev, containing the files rev_seq_1, rev_seq_2, rev_seq_3.

Reflow

Reflow is a relatively new workflow language developed by at Grail by @marius and colleagues. I've been using Reflow for a year or so, and it's the system I am most familiar with.

One big difference between reflow and the other systems is how constrained it is: simplifying a bit, every function takes (a) a docker container, (b) a file or directory stored in s3, and (c) some bash; and outputs another file or directory stored in s3.

Everything gets run on AWS. You cannot run it on a regular cluster. You can run a reflow pipeline locally but it's mostly for debugging, and feels more like emulating the real AWS pipeline.

script

val files = make("$/files")
val dirs = make("$/dirs")

input := file("s3://booleanbiotech/input.fa")

func Split(input file) dir =
    exec(image := "ubuntu") (out dir) {"
        echo Split {{input}};
        awk '/^>/{f="{{out}}/seq_"++d} {print > f}' < {{input}}
    "}

func Reverse(input file) file =
    exec(image := "ubuntu") (out file) {"
        echo Reverse {{input}};
        cat {{input}} | rev > {{out}}
    "}

split_dir := Split(input)
rev_files := [Reverse(f) | f <- dirs.Files(split_dir)]

val Main = rev_files

output

[file(sha256=sha256:15a48aae8b19682a84dcf39f9260901e0b8331e8f9f974efd11a9a15c950320f, size=16),
 file(sha256=sha256:adad5843ce952755cd5b5b71fd92c7a67be3063d047350cf7fffdd6031ee5c2e, size=16),
 file(sha256=sha256:441ff0551203c350251abc5782698371116cca12a735fa2dbed66de4820b9201, size=16)]

There is logging too, but overall the output of reflow is pretty opaque!

To see the contents of the generated files, you have to use reflow cat 15a48aae8b19682a84dcf39f9260901e0b8331e8f9f974efd11a9a15c950320f. Generally, the last step in your pipeline should copy the outputs to S3, but debugging this can be tough.

Comparing the Three

Snakemake has been great for local pipelines. Its "reports" — html output from stages of the pipeline — are especially nice. It also now supports containers (and conda), so it is far from stagnant. However, I haven't tried to run snakemake pipelines on the cloud, and I suspect since it is so file-focused it might be less suited to this.

I only seriously experimented with nextflow before it had AWS Batch support, which made it unsuitable for my purposes. At that time running pipelines on the cloud required quite a lot of setup, but now this issue seems to be solved. Nextflow development also seems to be the most active of the three. There are even Nextflow meetings, periodically.

Reflow is the least popular, least flexible, and least documented of the three. Nevertheless, as a user who exclusively runs pipelines on the cloud, I still like the reflow model best. It gets closest to my ideal for a workflow system, which is basically just annotating code with the resources required to run it.

This is because to reflow, everything is a hash: containers, code, inputs, outputs. For example, if you have genome A (md5sum(genome_sequence_bytes)=<hash1>), and you run adapter trimming function B (md5sum(script_bytes)=<hash2>), using container C (md5sum(docker_container_bytes)=<hash3>), then you "know" the output is the hash of md5sum(<hash1>+<hash2>+<hash3>)=<hash4>. If the file s3://my-reflow-cache/<hash4> exists, even if it's from an old workflow, you don't need to recalculate! (The above is the gist of how it works, anyway. As usual with caches, this system can also result in tricky bugs...)

For some technical details on differences between nextflow and reflow, this Twitter thread may also be of interest, since it includes some discussion from the authors of the tools.

Conclusions

If I were starting today, I would probably choose nextflow as the safe option, mainly due to its new AWS Batch support. Snakemake still works great, though it may be showing its age a bit, while reflow is excellent if you are AWS-exclusive and can deal with its lack of documentation and support.

I think my ideal workflow language would have reflow-style decorators on Python functions (or a subset of python). I'll be interested to see if the reflow model inspires some new workflow systems like this.


Comments


Boolean Biotech © Brian Naughton Powered by Pelican and Twitter Bootstrap. Icons by Font Awesome and Font Awesome More