archives –––––––– @btnaughton
Brian Naughton | Mon 22 September 2014 | data | open data kaggle data science

(Or, How To Kaggle)

The Africa Soil Property Prediction Challenge is a Kaggle competition where you are supposed to try to predict various soil measurements (like Calcium levels) in various parts of Africa using infrared spectroscopy readings. Seems like a worthwhile thing to do!

The Kaggle forums are a great place to pick up information on some of the practicalities of applied machine learning. In that spirit I thought I would share the (moderately successful) IPython/scikit-learn code I used.

Here is the notebook of code I used to make my predictions: (HTML version, IPython notebook.) When I originally ran this code it was in the top 10% of entries, but now it's way down the leaderboard. The winning regression method in my tests, by quite a distance, was Support Vector Regression.

Comment

IPython Notebook

Recently, I've been using IPython notebook for some data analysis. It's pretty janky in places, like most browser-based software, but it's the closest thing Python has to an interactive environment. It definitely saves a bunch of time if you have a long series of independent data transformations and analyses.

Ideally, I'd be able to work on a notebook locally, but if I have some heavy computation to do, I could transparently send that to the cloud. With IPython it's technically possible to do that on a cell by cell basis using ipcluster, but it would be difficult to integrate that with a third-party cloud provider. It's also not an elegant system: for example, you have to manually do imports on each of your ipcluster nodes.

A simpler method is just to dump the IPython notebook to a Python file, then send the entire script to the cloud. Assuming there is one long-running bottleneck task in there, this should take about the same amount of time to run.

Using Domino/AWS with IPython

I'm still not convinced about Domino as an AWS intermediary — for example, once my free trial ends I think am limited to just one project on there — but that's ok since the following process has only a few Domino-specific elements.

requirements.txt

I needed to add seaborn since it was not included in Domino's standard set of imports. For some reason I needed to include numexpr to get pandas to work. It's probably a good idea to include as little as possible in requirements.txt (i.e., don't pip freeze) since Domino has to install everything you ask for anew with each run.

seaborn==0.3.1
numexpr==2.3.1

IPython setup

It's important that I use all available CPUs when I am running on AWS/Domino, otherwise I am wasting money. I need a few CPUs free on my laptop though... I test this by just checking if the platform is Linux.

import platform, multiprocessing
N_CPUS = multiprocessing.cpu_count() if platform.system() == 'Linux' else 5

IPython's nbconvert preprocessor

IPython's handy nbconvert function converts IPython notebooks into other formats: most commonly, pure Python, HTML or PDF.

An IPython preprocessor is a little function that takes the cells in your notebook, and does something to them before nbconvert gets to them. Here I am using a preprocessor to do a few little things:

  • comment out IPython "magic" commands (these will cause errors if run outside IPython)
  • skip cells that have a special "SKIPCELL" comment (surprisingly, there's no IPython magic command for this)
  • make the "print" function print to a file. For some reason Domino appears to munge stdout and stderr into one file. By making the "print" function print to a file I can separate my results from warnings etc.

I'd also like to dump all my inline plots to files, but I don't know how to do that.

c = get_config()

#Export all the notebooks in the current directory to the sphinx_howto format.
c.NbConvertApp.notebooks = ['kaggle.ipynb']
c.NbConvertApp.export_format = 'python'
c.Exporter.preprocessors = ['domino_preprocessor.DominoPreprocessor']
from nbconvert.preprocessors import *

class DominoPreprocessor(Preprocessor):
    FIRSTCELL = None
    print_fn = """
global_out = open("stdout.ipy.txt",'w')
def print(*args, **kwargs):
    kwargs['file'] = global_out
    return __builtins__.print(*args, **kwargs)
"""

    def preprocess_cell(self, cell, resources, index):
        if cell.cell_type == 'code':
            SKIPCELL = True if "# SKIPCELL" in cell.source else False
            DominoPreprocessor.FIRSTCELL = True if DominoPreprocessor.FIRSTCELL is None else False

            newlines = []
            for line in cell.source.splitlines():
                if line.startswith('%'): line = "## "+line
                elif SKIPCELL: line = "# "+line
                newlines.append(line)
            if DominoPreprocessor.FIRSTCELL is True:
                newlines.append(DominoPreprocessor.print_fn)
                DominoPreprocessor.FIRSTCELL = False
            cell.source = '\n'.join(newlines)

        return cell, resources

Running the code on Domino

Finally, I created a small script that generates the cleaned-up Python file, sends it to Domino to run and opens the results page in a browser window.

#!/usr/bin/env python

#
# Create domino-appropriate python file
#
from subprocess import Popen, PIPE
p = Popen(["ipython", "nbconvert", "--config", "domino_config.py"], stdin=PIPE, stdout=PIPE, stderr=PIPE)
out1, err1 = p.communicate()

#
# Upload to Domino
#
p = Popen(["/Applications/domino/domino", "run", "kaggle.py"], stdin=PIPE, stdout=PIPE, stderr=PIPE)
out2, err2 = p.communicate()

#
# Show the results page
#
import re, webbrowser
rc = re.compile('(https://app.dominoup.com/[\S]+)')
url = rc.search(out2).group(1)
webbrowser.open(url)

Using this script I can trivially run my Kaggle script on 32 CPUs without sending my laptop into paroxysms. All in all it works pretty well.

Domino Pricing vs AWS

Domino sells standard "Compute Optimized" AWS instances, and marks them up 100%. It's not a bad deal for light to moderate users, considering that the AWS environments are already spun up and waiting, and that Domino charges by the minute instead of by the hour. For comparison purposes, 32 CPUs costs $1.68 an hour on AWS vs $3.36 on Domino.

domino domino
Comment
Brian Naughton | Sun 14 September 2014 | biotech | open data kaggle data science

I've been messing around in a Kaggle competition, and one of the frustrating aspects is waiting for my laptop to finish running analyses. It makes sense to me that I should be using someone else's cloud for this, since then I could get a bigger processor or more processors when I need them.

There used to be a nice Python-focused platform called PiCloud that did this. Essentially, PiCloud was selling a convenient interface to AWS and taking a cut. I never really used it but the execution was very nice. Sadly, I don't think many people used PiCloud, and it has since been bought by Dropbox and shut down.

The Python-based options for cloud data-analysis I know about right now are Wakari and Domino.

Wakari

Wakari is from the awesome Continuum Analytics guys (makers of Anaconda), so it's Python only. It's simply an iPython notebook that runs in your browser, and an AWS instance on the backend. They have an unlimited free tier, which is very nice for testing it out. It's not too expensive, but there is no pure pay-as-you-go option, which is my preference. To get more compute power than my laptop (which has good oomph at four 2.7GHz cores and 16GB of RAM) would be pretty expensive: even the $100/month premium option only has 3GB of RAM.

Domino

Domino appears to be a new company, and is purely focused on "data science". It supports Python, R, Julia and Matlab. It is a great concept, and has some really nice features, such as the ability to expose your model and results as API endpoints. There's a lot of scope for Domino to add value on top of the output of your analyses with UIs, APIs etc..

Despite its great promise, I found the execution of Domino very offputting:

  • it doesn't have seaborn (a Python library) installed. Why not just install every reasonably common library?
  • the introductory/free tier has a two hour limit and a clock counting down at you so you can't just try it out in peace. This should work more like a cellphone data cap, and just slow down when you hit the limit. I don't want to start something if I have to keep watching the clock.
  • a minor point, but every running instance has an associated animated loading spinner, which is just distracting.
  • the lowest tier ("hobby"), where you just pay per minute of CPU, is limited to one(!) project. To have five projects at a time I need to pay an additional $99 per month(!!) Why would I want to start using something with such restrictive limits? I would guess that the more projects I have going at the same time, the more I would end up paying Domino. Bizarre...

So unfortunately, until someone comes up with a good alternative, I am back to putting my laptop's CPU and noisy fans to work.

Comment
Brian Naughton | Sat 13 September 2014 | biotech | open data neuroscience drugs

NRDD paper on open data

Read More
Brian Naughton | Thu 11 September 2014 | data | open data virtual biotech

Open Data resources

Read More

Boolean Biotech © Brian Naughton Powered by Pelican and Twitter Bootstrap. Icons by Font Awesome and Font Awesome More