Recently, I've been using IPython notebook for some data analysis. It's pretty janky in places, like most browser-based software, but it's the closest thing Python has to an interactive environment. It definitely saves a bunch of time if you have a long series of independent data transformations and analyses.
Ideally, I'd be able to work on a notebook locally, but if I have some heavy computation to do, I could transparently send that to the cloud. With IPython it's technically possible to do that on a cell by cell basis using ipcluster, but it would be difficult to integrate that with a third-party cloud provider. It's also not an elegant system: for example, you have to manually do imports on each of your ipcluster nodes.
A simpler method is just to dump the IPython notebook to a Python file, then send the entire script to the cloud. Assuming there is one long-running bottleneck task in there, this should take about the same amount of time to run.
Using Domino/AWS with IPython
I'm still not convinced about Domino as an AWS intermediary — for example, once my free trial ends I think am limited to just one project on there — but that's ok since the following process has only a few Domino-specific elements.
I needed to add seaborn since it was not included in Domino's standard set of imports. For some reason I needed to include numexpr to get pandas to work. It's probably a good idea to include as little as possible in requirements.txt (i.e., don't pip freeze) since Domino has to install everything you ask for anew with each run.
It's important that I use all available CPUs when I am running on AWS/Domino, otherwise I am wasting money. I need a few CPUs free on my laptop though... I test this by just checking if the platform is Linux.
IPython's nbconvert preprocessor
IPython's handy nbconvert function converts IPython notebooks into other formats: most commonly, pure Python, HTML or PDF.
An IPython preprocessor is a little function that takes the cells in your notebook, and does something to them before nbconvert gets to them. Here I am using a preprocessor to do a few little things:
- comment out IPython "magic" commands (these will cause errors if run outside IPython)
- skip cells that have a special "SKIPCELL" comment (surprisingly, there's no IPython magic command for this)
- make the "print" function print to a file. For some reason Domino appears to munge stdout and stderr into one file. By making the "print" function print to a file I can separate my results from warnings etc.
I'd also like to dump all my inline plots to files, but I don't know how to do that.
Running the code on Domino
Finally, I created a small script that generates the cleaned-up Python file, sends it to Domino to run and opens the results page in a browser window.
Using this script I can trivially run my Kaggle script on 32 CPUs without sending my laptop into paroxysms. All in all it works pretty well.
Domino Pricing vs AWS
Domino sells standard "Compute Optimized" AWS instances, and marks them up 100%. It's not a bad deal for light to moderate users, considering that the AWS environments are already spun up and waiting, and that Domino charges by the minute instead of by the hour. For comparison purposes, 32 CPUs costs $1.68 an hour on AWS vs $3.36 on Domino.