Tools for working on the cloud

Brian Naughton // Sat 23 March 2019 // Filed under datascience // Tags datascience cloud gcp aws

These days most everything is on the cloud. However, probably the most common mode of working is to develop locally on a laptop, then deploy to the cloud when necessary. Instead, I like to try to run everything remotely on an instance on the cloud.

Why?

  • All your files are together in one place.
  • You can backup your cloud instance very easily (especially on GCP), and even spawn clone machines as necessary with more CPU etc.
  • You can work with large datasets. Laptops usually max out at 16GB RAM, but on a cloud instance you can get 50GB+. You can also expand the disk size on your cloud instance as necessary.
  • You can do a lot of computational work without making your laptop fans explode.
  • Your laptop becomes more like a dumb terminal. When your laptop dies, or is being repaired, it's easy to continue work without interruption.

The big caveats here are that this requires having an always-on cloud instance, which is relatively expensive, and you cannot work without an internet connection. As someone who spends a lot of time in jupyter notebooks munging large dataframe, I find the trade-offs are worth it.

Here are some tools I use that help make working on the cloud easier.

Mosh

Mosh is the most important tool here for working on the cloud. Instead of sshing into a machine once or more a day, now my laptop is continuously connected to a cloud instance, essentially until the instance needs to reboot (yearly?). Mosh also works better than regular ssh on weak connections, so it's handy for working on the train, etc. It makes working on a remote machine feel like working locally. If you use ssh a lot, try mosh instead.

Tmux

Since I just have one mosh connection at a time, I need to have some tabs. Most people probably already use screen or tmux anyway. I have a basic tmux setup, with ten or so tabs, each with a two-letter name just to keep things a bit neater.

tmux resurrect is the only tmux add-on I use. It works ok: if your server needs to restart, tmux resurrect will at least rememeber the names of your tabs.

Autossh

Mosh works amazingly well for a primary ssh connection, but to run everything on the cloud I also need a few ssh tunnels. Mosh cannot do this, so instead I need to use autossh. Like mosh, autossh tries to keep a connection open indefinitely. It seems to be slightly less reliable and fiddlier to set up than mosh, but has been working great for me recently.

Here's the command I use, based on this article and others. It took a while to get working via trial and error, so there may well be better ways. The ssh part of this autossh command comes from running gcloud compute ssh --dry-run.

autossh -M 0 -o "ServerAliveInterval 30" -o "ServerAliveCountMax 3" -f -t -i $HOME/.ssh/google_compute_engine -o CheckHostIP=no -o HostKeyAlias=compute.1234567890 -o IdentitiesOnly=yes -o StrictHostKeyChecking=yes -o UserKnownHostsFile=/Users/briann/.ssh/google_compute_known_hosts brian@12.345.67.890 -N -L 2288:localhost:8888 -L 2280:localhost:8880 -L 8385:localhost:8384 -L 2222:localhost:22 -L 8443:localhost:8443

The tunnels I set up:

  • 2288->8888 : to access jupyter running on my cloud instance (I keep 8888 for local jupyter)
  • 2280->8880 : to access a remote webserver (e.g., if i run python -m http.server on my cloud instance)
  • 8385->8384 : syncthing (see below)
  • 2222->22 : sshfs (see below)
  • 8443->8443 : coder (see below)

So to access jupyter, I just run jupyter notebook in a tmux tab on my cloud box, and go to https://localhost:2288.

To view a file on my cloud box I run python -m http.server on my cloud box, and go to https://localhost:2280.

Syncthing

Syncthing is a dropbox-like tool that syncs files across a group of computers. Unlike dropbox, the connections between machines are direct (i.e., there is no centralized server). It's pretty simple: you run syncthing on your laptop and on your cloud instance, they find each other and start syncing. Since 8384 is the default syncthing port, I can see syncthing's local and remote dashboards on https://localhost:8384 and https://localhost:8385 respectively. In my experience, syncthing works pretty well, but I recently stopped using it because I've found it unnecessary to have files synced to my laptop.

Sshfs

sshfs is a tool that lets you mount a filesystem over ssh. Like syncthing, I also don't use sshfs much any more since it's pretty slow and can fail on occasion. It is handy if you want to browse PDFs or similar files stored on your cloud instance though.

Coder

I recently started using Coder, which is Visual Studio Code (my preferred editor), but running in a browser. Amazingly, it's almost impossible to tell the difference between "native" VS Code (an Electron app) and the browser version, especially if it's running in full-screen mode.

It's very fast to get started. You run this on your cloud instance: docker run -t -p 127.0.0.1:8443:8443 -v "${PWD}:/root/project" codercom/code-server code-server --allow-http --no-auth then go to http://localhost:8443 and that's it!

Coder is new and has had some glitches and limitations for me. For example, I don't know how you are supposed to install extensions without also updating the Docker image, which is less than ideal, and the documentation is minimal. Still, the VS Code team seems to execute very quickly, so I am sticking with it for now. It think it will improve and stabilize soon.

Annoyances

One annoyance with having everything on the cloud is viewing files. X11 is the typical way to solve this problem, but I've never had much success with X11. Even at its best, it's ugly and slow. Most of my graphing, etc. happens in jupyter, so this is usually not a big issue.

However, for infrequent file viewing, this python code has come in handy.

def view(filename):
    from pathlib import Path
    from flask import Flask, send_file
    app = Flask(__name__)
    def get_view_func(_filename):
        def fn(): return send_file(filename_or_fp=str(Path(_filename).resolve()))
        return fn
    print(f'python -m webbrowser -t "http://localhost:2280"')
    app.add_url_rule(rule='/', view_func=get_view_func(filename))
    app.run("127.0.0.1", FLASK_PORT_LOCAL, debug=False)

Appendix: GCP activation script

This is the bash script I use to set up the above tools from my mac for my GCP instance. People using GCP might find something useful in here.

_gcp_activate () {
    # example full command: _gcp_activate myuserid myuserid@mydomain.com my-instance my-gcp-project us-central1-c $HOME/gcp/

    clear
    printf "#\n#    [[ gcp_activate script ]]   \n#\n"
    printf "# mosh: on mac, 'brew install mosh'\n"
    printf "# autossh: on mac, 'brew install autossh'\n"
    printf "# sshfs: on mac, download osxfuse and sshfs from https://osxfuse.github.io/\n"
    printf "#     https://www.everythingcli.org/ssh-tunnelling-for-fun-and-profit-autossh/\n"
    printf "#     sshfs may need a new entry in $HOME/.ssh/known_hosts if logging in to a new host\n"
    printf "#     The error is \"remote host has disconnected\"\n"
    printf "#     to achieve that, delete the localhost:2222 entry from $HOME/.ssh/known_hosts\n"
    printf "#\n"

    [ $# -eq 0 ] && printf "No arguments supplied\n" && return 1

    user=$1
    account=$2
    instance=$3
    gcpproject=$4
    zone=$5
    mountpoint=$6

    printf "#\n# 1. set gcp project, if it's not set \n#\n"

    # automatically set project
    echo gcloud config set account ${account};
    echo gcloud config set project ${gcpproject};

    # unmount sshfs
    printf "#\n# 2. unmount sshfs (this command fails if it's already unmounted, which is ok)\n#\n"
    echo umount -f ${mountpoint};

    # commands
    ssh_cmd=$(gcloud compute ssh ${user}@${instance} --zone=${zone} --dry-run) && \
    external_ip=$(printf "${ssh_cmd}" | sed -E 's/.+@([0-9\.]+)/\1/') && \
    autossh_cmd=$(printf "${ssh_cmd}" | sed s/'\/usr\/bin\/ssh'/'autossh -M 0 -o "ServerAliveInterval 30" -o "ServerAliveCountMax 3" -f'/) && \
    fullssh_cmd=$(printf "${autossh_cmd} -N -L 2288:localhost:8888 -L 2222:localhost:22 -L 2280:localhost:8000 -L 8385:localhost:8384 -L 8443:localhost:8443") && \
    printf "#\n# 3. run autossh to set up ssh tunnels for jupyter (2288), web (2280), and sshfs (2222)\n#\n" && \
    echo "${fullssh_cmd}" && \
    printf "#\n# 4. run sshfs to mount to ${mountpoint}\n#\n" && \
    echo sshfs -p 2222 -o reconnect,compression=yes,transform_symlinks,defer_permissions,IdentityFile=$HOME/.ssh/google_compute_engine,ServerAliveInterval=30,ServerAliveCountMax=0 -f \
    ${user}@localhost:. ${mountpoint} && \
    printf "#\n# 5. run mosh\n#\n" && \
    echo mosh -p 60000 --ssh=\"ssh -i $HOME/.ssh/google_compute_engine\" ${user}@${external_ip} -- tmux a
    printf "#\n# 6. if mosh fails run this\n#\n"
    echo gcloud compute ssh ${user}@${instance} -- killall mosh-server
}

gcp_activate () {
  _gcp_activate myuserid myuserid@mydomain.com my-instance my-project us-central1-c $HOME/gcp/
}

Comments


Boolean Biotech © Brian Naughton Powered by Pelican and Twitter Bootstrap. Icons by Font Awesome and Font Awesome More