Brian Naughton // Sun 02 June 2019 // Filed under datascience // Tags datascience cloud gcp aws bioinformatics

There are a lot of bioinformatics workflow languages and systems these days. Albert Vilella maintains a comprehensive spreadsheet and it's massive. A little confusingly, some of these are actual languages (CWL), some are engines for running workflows (Cromwell), and some are self-contained systems (reflow, nextflow, snakemake).

The point of these tools is to take a multi-part process — usually a combination of scripts and binaries — and run it many times, keeping a log of what succeeded and what failed, and managing the resources required.

Among the most popular (or at least ones I've heard of people using) are snakemake, toil, nextflow, luigi, bpipe, scipipe, and Cromwell/WDL.

For a comprehensive review, see this 2017 paper by Jeremy Leipzig. Here, I'm just going to compare an example script from the nextflow website to the equivalent in snakemake and reflow. These are the three workflow systems I am most familiar with, and I think the differences between them are pretty interesting.

Nextflow

Nextflow is probably the most popular workflow system right now. It is powerful, simple to get started with, and very well supported by Paolo Di Tommasso, with plenty of example scripts. It supports running pipelines locally, on a local cluster, or on the cloud. It can use containers (Docker or Singularity) for reproducibility. Nextflow uses AWS Batch as its main cloud backend (apparently everyone is these days).

The basic script below, from the nextflow site, is pretty self-explanatory.

script

#!/usr/bin/env nextflow

params.in = "$baseDir/data/sample.fa"
sequences = file(params.in)

/*
 * split a fasta file in multiple files
 */
process splitSequences {
    input:
    file 'input.fa' from sequences

    output:
    file 'seq_*' into records

    """
    awk '/^>/{f="seq_"++d} {print > f}' < input.fa
    """
}

/*
 * Simple reverse the sequences
 */
process reverse {
    input:
    file x from records

    output:
    stdout result

    """
    cat $x | rev
    """
}

/*
 * print the channel content
 */
result.subscribe { println it }

output

N E X T F L O W  ~  version 19.04.1
Launching basic.rf [hungry_celsius] - revision: c3e3b933c7
[warm up] executor > local
executor >  local (1)
[ea/b4bcb4] process > splitSequences [  0%] 0 of 1
1FDSA>
ACCAAACACAGTA
2FDSA>
ACCAAACACAGTA
3FDSA>
ACCAAACACAGTA
executor >  local (2)
[ea/b4bcb4] process > splitSequences [100%] 1 of 1 
[ef/0377ba] process > reverse        [100%] 1 of 1 
Completed at: 19-May-2019 10:17:39
Duration    : 951ms
CPU hours   : (a few seconds)
Succeeded   : 2

The output is quite comprehensive and easy to understand.

Snakemake

Snakemake, developed by Johannes Köster (also the founder of the bioconda project!) is the oldest of the three workflow systems here. As the name implies, it's sort of like make, but in Python.

Snakemake existed before containers and cloud computing were common. It is filename-focused, which makes it simple to get started with, and pretty easy to debug. In contrast, nextflow and reflow try to abstract away the names of files, sometimes at the cost of simplicity.

Even though I have used snakemake quite a bit, I found it pretty difficult to port the nextflow script above, mainly because it involves splitting one file into multiple. I think there are at least three ways to handle file splitting in snakemake: dynamic files (soon to be deprecated), the new checkpoints (I did not look into this much), and just using directories.

The directory option is the simplest to me, but the result is probably far from ideal snakemake. In previous snakemake pipelines I often relied on something similar: when one step of the pipeline is done, instead of relying on the appearance of a specific filename, I would just output a "sentinel" file, e.g., echo done > {output.sentinel}, and use that as the trigger for the next step.

Since Snakemake is based on Python, it is easy to extend and integrate with other scripts. To copy the nextflow script, here I just use the bash though.

script

rule all:
    input: "rev"

rule split_fasta:
    input: "input.fa"
    output: directory("split")
    run: shell("""awk '/^>/{{f="{output}/seq_"++d}} {{print > f}}' < {input}""")

rule reverse_fasta:
    input: "split"
    output: directory("rev")
    run: shell("""for f in {input}/*; do cat $f |rev > {output}/rev_$(basename $f); done""")

output

Snakemake outputs a directory, rev, containing the files rev_seq_1, rev_seq_2, rev_seq_3.

Reflow

Reflow is a relatively new workflow language developed by at Grail by @marius and colleagues. I've been using Reflow for a year or so, and it's the system I am most familiar with.

One big difference between reflow and the other systems is how constrained it is: simplifying a bit, every function takes (a) a docker container, (b) a file or directory stored in s3, and (c) some bash; and outputs another file or directory stored in s3.

Everything gets run on AWS. You cannot run it on a regular cluster. You can run a reflow pipeline locally but it's mostly for debugging, and feels more like emulating the real AWS pipeline.

script

val files = make("$/files")
val dirs = make("$/dirs")

input := file("s3://booleanbiotech/input.fa")

func Split(input file) dir =
    exec(image := "ubuntu") (out dir) {"
        echo Split {{input}};
        awk '/^>/{f="{{out}}/seq_"++d} {print > f}' < {{input}}
    "}

func Reverse(input file) file =
    exec(image := "ubuntu") (out file) {"
        echo Reverse {{input}};
        cat {{input}} | rev > {{out}}
    "}

split_dir := Split(input)
rev_files := [Reverse(f) | f <- dirs.Files(split_dir)]

val Main = rev_files

output

[file(sha256=sha256:15a48aae8b19682a84dcf39f9260901e0b8331e8f9f974efd11a9a15c950320f, size=16),
 file(sha256=sha256:adad5843ce952755cd5b5b71fd92c7a67be3063d047350cf7fffdd6031ee5c2e, size=16),
 file(sha256=sha256:441ff0551203c350251abc5782698371116cca12a735fa2dbed66de4820b9201, size=16)]

There is logging too, but overall the output of reflow is pretty opaque!

To see the contents of the generated files, you have to use reflow cat 15a48aae8b19682a84dcf39f9260901e0b8331e8f9f974efd11a9a15c950320f. Generally, the last step in your pipeline should copy the outputs to S3, but debugging this can be tough.

Comparing the Three

Snakemake has been great for local pipelines. Its "reports" — html output from stages of the pipeline — are especially nice. It also now supports containers (and conda), so it is far from stagnant. However, I haven't tried to run snakemake pipelines on the cloud, and I suspect since it is so file-focused it might be less suited to this.

I only seriously experimented with nextflow before it had AWS Batch support, which made it unsuitable for my purposes. At that time running pipelines on the cloud required quite a lot of setup, but now this issue seems to be solved. Nextflow development also seems to be the most active of the three. There are even Nextflow meetings, periodically.

Reflow is the least popular, least flexible, and least documented of the three. Nevertheless, as a user who exclusively runs pipelines on the cloud, I still like the reflow model best. It gets closest to my ideal for a workflow system, which is basically just annotating code with the resources required to run it.

This is because to reflow, everything is a hash: containers, code, inputs, outputs. For example, if you have genome A (sha256(genome_sequence_bytes)=<hash1>), and you run adapter trimming function B (sha256(script_bytes)=<hash2>), using container C (sha256(docker_container_bytes)=<hash3>), then you "know" the output is the hash of sha256(<hash1>+<hash2>+<hash3>)=<hash4>. If the file s3://my-reflow-cache/<hash4> exists, even if it's from an old workflow, you don't need to recalculate! (The above is the gist of how it works, anyway. As usual with caches, this system can also result in tricky bugs...)

For some technical details on differences between nextflow and reflow, this Twitter thread may also be of interest, since it includes some discussion from the authors of the tools.

Conclusions

If I were starting today, I would probably choose nextflow as the safe option, mainly due to its new AWS Batch support. Snakemake still works great, though it may be showing its age a bit, while reflow is excellent if you are AWS-exclusive and can deal with its lack of documentation and support.

I think my ideal workflow language would have reflow-style decorators on Python functions (or a subset of python). I'll be interested to see if the reflow model inspires some new workflow systems like this.

Comment
Brian Naughton // Sat 23 March 2019 // Filed under datascience // Tags datascience cloud gcp aws

These days most everything is on the cloud. However, probably the most common mode of working is to develop locally on a laptop, then deploy to the cloud when necessary. Instead, I like to try to run everything remotely on an instance on the cloud.

Why?

  • All your files are together in one place.
  • You can backup your cloud instance very easily (especially on GCP), and even spawn clone machines as necessary with more CPU etc.
  • You can work with large datasets. Laptops usually max out at 16GB RAM, but on a cloud instance you can get 50GB+. You can also expand the disk size on your cloud instance as necessary.
  • You can do a lot of computational work without making your laptop fans explode.
  • Your laptop becomes more like a dumb terminal. When your laptop dies, or is being repaired, it's easy to continue work without interruption.

The big caveats here are that this requires having an always-on cloud instance, which is relatively expensive, and you cannot work without an internet connection. As someone who spends a lot of time in jupyter notebooks munging large dataframe, I find the trade-offs are worth it.

Here are some tools I use that help make working on the cloud easier.

Mosh

Mosh is the most important tool here for working on the cloud. Instead of sshing into a machine once or more a day, now my laptop is continuously connected to a cloud instance, essentially until the instance needs to reboot (yearly?). Mosh also works better than regular ssh on weak connections, so it's handy for working on the train, etc. It makes working on a remote machine feel like working locally. If you use ssh a lot, try mosh instead.

Tmux

Since I just have one mosh connection at a time, I need to have some tabs. Most people probably already use screen or tmux anyway. I have a basic tmux setup, with ten or so tabs, each with a two-letter name just to keep things a bit neater.

tmux resurrect is the only tmux add-on I use. It works ok: if your server needs to restart, tmux resurrect will at least rememeber the names of your tabs.

Autossh

Mosh works amazingly well for a primary ssh connection, but to run everything on the cloud I also need a few ssh tunnels. Mosh cannot do this, so instead I need to use autossh. Like mosh, autossh tries to keep a connection open indefinitely. It seems to be slightly less reliable and fiddlier to set up than mosh, but has been working great for me recently.

Here's the command I use, based on this article and others. It took a while to get working via trial and error, so there may well be better ways. The ssh part of this autossh command comes from running gcloud compute ssh --dry-run.

autossh -M 0 -o "ServerAliveInterval 30" -o "ServerAliveCountMax 3" -f -t -i $HOME/.ssh/google_compute_engine -o CheckHostIP=no -o HostKeyAlias=compute.1234567890 -o IdentitiesOnly=yes -o StrictHostKeyChecking=yes -o UserKnownHostsFile=/Users/briann/.ssh/google_compute_known_hosts brian@12.345.67.890 -N -L 2288:localhost:8888 -L 2280:localhost:8880 -L 8385:localhost:8384 -L 2222:localhost:22 -L 8443:localhost:8443

The tunnels I set up:

  • 2288->8888 : to access jupyter running on my cloud instance (I keep 8888 for local jupyter)
  • 2280->8880 : to access a remote webserver (e.g., if i run python -m http.server on my cloud instance)
  • 8385->8384 : syncthing (see below)
  • 2222->22 : sshfs (see below)
  • 8443->8443 : coder (see below)

So to access jupyter, I just run jupyter notebook in a tmux tab on my cloud box, and go to https://localhost:2288.

To view a file on my cloud box I run python -m http.server on my cloud box, and go to https://localhost:2280.

Syncthing

Syncthing is a dropbox-like tool that syncs files across a group of computers. Unlike dropbox, the connections between machines are direct (i.e., there is no centralized server). It's pretty simple: you run syncthing on your laptop and on your cloud instance, they find each other and start syncing. Since 8384 is the default syncthing port, I can see syncthing's local and remote dashboards on https://localhost:8384 and https://localhost:8385 respectively. In my experience, syncthing works pretty well, but I recently stopped using it because I've found it unnecessary to have files synced to my laptop.

Sshfs

sshfs is a tool that lets you mount a filesystem over ssh. Like syncthing, I also don't use sshfs much any more since it's pretty slow and can fail on occasion. It is handy if you want to browse PDFs or similar files stored on your cloud instance though.

Coder

I recently started using Coder, which is Visual Studio Code (my preferred editor), but running in a browser. Amazingly, it's almost impossible to tell the difference between "native" VS Code (an Electron app) and the browser version, especially if it's running in full-screen mode.

It's very fast to get started. You run this on your cloud instance: docker run -t -p 127.0.0.1:8443:8443 -v "${PWD}:/root/project" codercom/code-server code-server --allow-http --no-auth then go to http://localhost:8443 and that's it!

Coder is new and has had some glitches and limitations for me. For example, I don't know how you are supposed to install extensions without also updating the Docker image, which is less than ideal, and the documentation is minimal. Still, the VS Code team seems to execute very quickly, so I am sticking with it for now. It think it will improve and stabilize soon.

Annoyances

One annoyance with having everything on the cloud is viewing files. X11 is the typical way to solve this problem, but I've never had much success with X11. Even at its best, it's ugly and slow. Most of my graphing, etc. happens in jupyter, so this is usually not a big issue.

However, for infrequent file viewing, this python code has come in handy.

def view(filename):
    from pathlib import Path
    from flask import Flask, send_file
    app = Flask(__name__)
    def get_view_func(_filename):
        def fn(): return send_file(filename_or_fp=str(Path(_filename).resolve()))
        return fn
    print(f'python -m webbrowser -t "http://localhost:2280"')
    app.add_url_rule(rule='/', view_func=get_view_func(filename))
    app.run("127.0.0.1", FLASK_PORT_LOCAL, debug=False)

Appendix: GCP activation script

This is the bash script I use to set up the above tools from my mac for my GCP instance. People using GCP might find something useful in here.

_gcp_activate () {
    # example full command: _gcp_activate myuserid myuserid@mydomain.com my-instance my-gcp-project us-central1-c $HOME/gcp/

    clear
    printf "#\n#    [[ gcp_activate script ]]   \n#\n"
    printf "# mosh: on mac, 'brew install mosh'\n"
    printf "# autossh: on mac, 'brew install autossh'\n"
    printf "# sshfs: on mac, download osxfuse and sshfs from https://osxfuse.github.io/\n"
    printf "#     https://www.everythingcli.org/ssh-tunnelling-for-fun-and-profit-autossh/\n"
    printf "#     sshfs may need a new entry in $HOME/.ssh/known_hosts if logging in to a new host\n"
    printf "#     The error is \"remote host has disconnected\"\n"
    printf "#     to achieve that, delete the localhost:2222 entry from $HOME/.ssh/known_hosts\n"
    printf "#\n"

    [ $# -eq 0 ] && printf "No arguments supplied\n" && return 1

    user=$1
    account=$2
    instance=$3
    gcpproject=$4
    zone=$5
    mountpoint=$6

    printf "#\n# 1. set gcp project, if it's not set \n#\n"

    # automatically set project
    echo gcloud config set account ${account};
    echo gcloud config set project ${gcpproject};

    # unmount sshfs
    printf "#\n# 2. unmount sshfs (this command fails if it's already unmounted, which is ok)\n#\n"
    echo umount -f ${mountpoint};

    # commands
    ssh_cmd=$(gcloud compute ssh ${user}@${instance} --zone=${zone} --dry-run) && \
    external_ip=$(printf "${ssh_cmd}" | sed -E 's/.+@([0-9\.]+)/\1/') && \
    autossh_cmd=$(printf "${ssh_cmd}" | sed s/'\/usr\/bin\/ssh'/'autossh -M 0 -o "ServerAliveInterval 30" -o "ServerAliveCountMax 3" -f'/) && \
    fullssh_cmd=$(printf "${autossh_cmd} -N -L 2288:localhost:8888 -L 2222:localhost:22 -L 2280:localhost:8000 -L 8385:localhost:8384 -L 8443:localhost:8443") && \
    printf "#\n# 3. run autossh to set up ssh tunnels for jupyter (2288), web (2280), and sshfs (2222)\n#\n" && \
    echo "${fullssh_cmd}" && \
    printf "#\n# 4. run sshfs to mount to ${mountpoint}\n#\n" && \
    echo sshfs -p 2222 -o reconnect,compression=yes,transform_symlinks,defer_permissions,IdentityFile=$HOME/.ssh/google_compute_engine,ServerAliveInterval=30,ServerAliveCountMax=0 -f \
    ${user}@localhost:. ${mountpoint} && \
    printf "#\n# 5. run mosh\n#\n" && \
    echo mosh -p 60000 --ssh=\"ssh -i $HOME/.ssh/google_compute_engine\" ${user}@${external_ip} -- tmux a
    printf "#\n# 6. if mosh fails run this\n#\n"
    echo gcloud compute ssh ${user}@${instance} -- killall mosh-server
}

gcp_activate () {
  _gcp_activate myuserid myuserid@mydomain.com my-instance my-project us-central1-c $HOME/gcp/
}
Comment
Brian Naughton // Sun 20 January 2019 // Filed under deeplearning // Tags deeplearning datascience neuralnet

Jake Dangerback is an instagram celebrity and influencer (of sorts). He's unusual in that he does not exist; he was created using millions of matrix multiplications. This post is a look at some of the freely available state-of-the-art neural networks I used to create him.

Background

Neural nets, specifically GANs, are getting really good at hallucinating realistic faces. Notably, NVIDIA had a paper in December 2018 that showed some pretty amazing results.

GANs GANs

Since these faces do not belong to anyone, they are perfect for use as celebrities. They don't need to be paid, never complain, and can be tailored to appeal to any niche demographic.

Step One: Making A Face

I used this awesome tool on kaggle (github) to create a face. I did not know that kaggle supported this colab-like interface but it's quite advanced. The tool has 21 levers to pull so you can create the perfect face for your audience.

jdb_kaggle

Using this tool I created country music star Jake Dangerback's face (his fans call him jdb). He is handsome but rugged and most importantly he has no legal personhood so I can use his likeness to endorse and sell products of all kinds.

jdb_jake_dangerback

Photoshopping jdb

It's fairly easy to create new images using this face. There are several sources of royalty-free images (e.g., pexels) where I can photoshop jdb's face in. jdb_cowboy_hat

Photoshopping a face is not that hard — at least at this quality — but it would be easier if a neural net did the photoshopping for me. Luckily there are many papers and github repos that do face swapping since it produces funny pictures. Most face swapping tools are mobile apps, but I did find Reflect face swap online. It seems to do a good job generally, but the result below looks a bit weird. It seems to be trying to mix the photos for realism rather than just replace the face.

reagan reagan faceswap
Reagan wearing a cowboy hat; jdb wearing a cowboy hat

Optimizing

If we had enough photos, we might consider automatically optimizing the image for likes using some kind of selfie-rating neural net like @karpathy's. I think the photoshopped images would have to be autogenerated to make this worthwhile.

Image filters, which are popular on instagram, can also use neural nets. The most famous example is neural style, which maps the style of one image — usually a painting — onto the content of another. The website deepart.io does this as a service. These neural style filters are very cool but not that useful for instagram content.

jdb deepart jdb deepart2

What About DeepFake?

DeepFake is a powerful technique for swapping faces in video that can produce very realistic results. There's even a tool called FakeApp that automates some of the steps. There's a nice blogpost showing how to swap Jimmy Fallon's face for John Oliver's. It looks pretty convincing.

jdb deepfake

DeepFake creates video, which I do not really need, and to train it you need video of the subject, which I do not have. I suppose theoretically you could create a 3D model and use that to generate the source video...

Step Two: The Third Dimension

It is limiting if every photo has to have jdb's face looking head-on. Luckily, there is a very cool 3D face reconstruction neural net that works based on a single photo.

jdb in 3D

The results are great and it takes less than a minute to work. You can even load it into Blender. jdb in blender

Taking it a step further, the free MakeHuman software will create a mesh of a body with various parameters. jdb in blender

This could probably be made to work well but it's way beyond my Blender skills. jdb in blender

There's also an interesting iOS app called mug life that will animate photos based on an inferred 3D mesh. jdb muglife The results are impressive, if creepy. jdb looks so alive! I don't think you can download the mesh though.

Enhance!

Sometimes the resulting photoshopped image can be pretty blurry, partially because the 3D model's texture resolution is not that high. Luckily there is another deep net to help here, called neural-enhance. It really does enhance a photo by doubling the resolution, which is pretty slick. The author includes a docker container, which makes running it very simple.

The results are very impressive in general, and it only takes a few minutes even on a CPU. Since it's trained on real photos, I am guessing it might also remove artifacts and rogue pixels due to photoshopping.

jdb singing jdb singing enhanced
From blurry (left) to enhanced (right). The shirt buttons are the most obvious improvement.

Step Three: Captioning

Instagram posts have captions, which I assume are important for engagement. There have been many attempts to caption or describe images using neural nets. The oldest one I remember is an influential Stanford paper from 2015. There are a few tools online too. I first tried Microsoft's captionbot.ai, assuming it used a neural net. The results were very meh, and it turns out it's not a neural net and is famously vague/bad.

Google's Cloud Vision API does much better though it's still not super-engaging content. For now, computers have not solved the instagram captioning problem.

jdb beach walk
Captionbot: I think it's a person standing on a beach and he seems 😐.
Google Cloud Vision: couple looking at each other on beach

Step Four: Songs

jdb is a country music celebrity, so he may need some songs. Thankfully there's a neural net for everything, including country music lyrics!

Some example lyrics:

No one with the danger in the world
I love my black fire as I know
But the short knees just around me
Fun the heart couldnes fall to back

It's not terrible ("couldnes"?), but it's also pretty dark for instagram... You can also generate music using RNNs but I did not find an easy way to generate country music.

Singing

Creating a speaking/singing voice from lyrics was not as easy as I thought it would be. I tried a few iOS apps, including LyreBird, but got strange results. Macs also have the say command (just type say hello into the terminal), which works ok.

I ended up using Google Cloud text-to-speech, which uses WaveNet, to turn lyrics into speech. It works via a simple json upload. Sadly, none of the available voices sounded particularly country.

To produce actual singing, there are several autotune apps, e.g., Voloco. However, the ones I tried sounded pretty autotuned, so perhaps more suitable for another genre.

I replaced the RNN lyrics with wikipedia-derived specs for a truck (a product jdb could endorse and a scalable lyric-generation system), added some royalty-free music, and the result is really something.

The Future

One of the most difficult aspects here is finding working neural nets and getting them to run. Even if the code is on github there are often missing files or onerous installation steps. Things will be easier when more neural nets get converted to javascript/tensorflow.js or appear on colab.

I don't know how many instagram celebrities are computer-generated today, though computer-generated celebrities are not a new thing, especially in Japan. Lil Miquela has >1M followers on instagram, though it's clear she is computer-generated.

It's pretty obvious there will be a lot of this kind of thing in the future. We can evolve new celebrities as ecological niches emerge, catering to new audiences. As people lose interest in one celebrity, we can just create others. They could even inhabit the same world and date each other. Eventually they will outnumber us, then maybe skynet us.

In the meantime, jdb is available to endorse products of all kinds.

Comment

Boolean Biotech © Brian Naughton Powered by Pelican and Twitter Bootstrap. Icons by Font Awesome and Font Awesome More