Brian Naughton // Mon 22 September 2014 // Filed under data // Tags open data kaggle data science

(Or, How To Kaggle)

The Africa Soil Property Prediction Challenge is a Kaggle competition where you are supposed to try to predict various soil measurements (like Calcium levels) in various parts of Africa using infrared spectroscopy readings. Seems like a worthwhile thing to do!

The Kaggle forums are a great place to pick up information on some of the practicalities of applied machine learning. In that spirit I thought I would share the (moderately successful) IPython/scikit-learn code I used.

Here is the notebook of code I used to make my predictions: (HTML version, IPython notebook.) When I originally ran this code it was in the top 10% of entries, but now it's way down the leaderboard. The winning regression method in my tests, by quite a distance, was Support Vector Regression.

Comment
Brian Naughton // Sun 14 September 2014 // Filed under biotech // Tags open data kaggle data science

I've been messing around in a Kaggle competition, and one of the frustrating aspects is waiting for my laptop to finish running analyses. It makes sense to me that I should be using someone else's cloud for this, since then I could get a bigger processor or more processors when I need them.

There used to be a nice Python-focused platform called PiCloud that did this. Essentially, PiCloud was selling a convenient interface to AWS and taking a cut. I never really used it but the execution was very nice. Sadly, I don't think many people used PiCloud, and it has since been bought by Dropbox and shut down.

The Python-based options for cloud data-analysis I know about right now are Wakari and Domino.

Wakari

Wakari is from the awesome Continuum Analytics guys (makers of Anaconda), so it's Python only. It's simply an iPython notebook that runs in your browser, and an AWS instance on the backend. They have an unlimited free tier, which is very nice for testing it out. It's not too expensive, but there is no pure pay-as-you-go option, which is my preference. To get more compute power than my laptop (which has good oomph at four 2.7GHz cores and 16GB of RAM) would be pretty expensive: even the $100/month premium option only has 3GB of RAM.

Domino

Domino appears to be a new company, and is purely focused on "data science". It supports Python, R, Julia and Matlab. It is a great concept, and has some really nice features, such as the ability to expose your model and results as API endpoints. There's a lot of scope for Domino to add value on top of the output of your analyses with UIs, APIs etc..

Despite its great promise, I found the execution of Domino very offputting:

  • it doesn't have seaborn (a Python library) installed. Why not just install every reasonably common library?
  • the introductory/free tier has a two hour limit and a clock counting down at you so you can't just try it out in peace. This should work more like a cellphone data cap, and just slow down when you hit the limit. I don't want to start something if I have to keep watching the clock.
  • a minor point, but every running instance has an associated animated loading spinner, which is just distracting.
  • the lowest tier ("hobby"), where you just pay per minute of CPU, is limited to one(!) project. To have five projects at a time I need to pay an additional $99 per month(!!) Why would I want to start using something with such restrictive limits? I would guess that the more projects I have going at the same time, the more I would end up paying Domino. Bizarre...

So unfortunately, until someone comes up with a good alternative, I am back to putting my laptop's CPU and noisy fans to work.

Comment

Boolean Biotech © Brian Naughton Powered by Pelican and Twitter Bootstrap. Icons by Font Awesome and Font Awesome More