UK Biobank now allows researchers to request genotype data for their studies.
Data from the first 150,000 individuals (of 500,000 total) will be released in May,
with the rest coming toward the end of the year.
The data comes from a 800k Affymetrix Axiom chip, which, unlike the Affy chips of old,
produces good quality data. You can also request imputed whole genomes, which will be
huge for many kinds of GWAS.
You can check to see if they have your SNP of interest here.
NIH Data Science
NIH has launched a new data science site.
This new site incorporates the old BD2K ("Big Data to Knowledge") site, which is going away.
This seems like a positive indicator for funding opportunities in data science,
though I don't see any direct indication of that on this site.
The Africa Soil Property Prediction Challenge is a Kaggle competition
where you are supposed to try to predict various soil measurements (like Calcium levels)
in various parts of Africa using infrared spectroscopy readings.
Seems like a worthwhile thing to do!
The Kaggle forums are a great place to pick up information on some of the
practicalities of applied machine learning.
In that spirit I thought I would share the (moderately successful)
IPython/scikit-learn code I used.
Here is the notebook of code I used to make my predictions:
When I originally ran this code it was in the top 10% of entries, but now it's way down
the leaderboard. The winning regression method in my tests, by quite a distance,
was Support Vector Regression.
Recently, I've been using IPython notebook for some data analysis.
It's pretty janky in places, like most browser-based software,
but it's the closest thing Python has to an interactive environment.
It definitely saves a bunch of time if you have a long series of independent
data transformations and analyses.
Ideally, I'd be able to work on a notebook locally, but if I have some heavy computation
to do, I could transparently send that to the cloud.
With IPython it's technically possible to do that on a cell by cell basis
using ipcluster, but it would be difficult to integrate that with a third-party cloud
It's also not an elegant system: for example, you have to manually do imports on each of
your ipcluster nodes.
A simpler method is just to dump the IPython notebook to a Python file, then
send the entire script to the cloud. Assuming there is one long-running bottleneck
task in there, this should take about the same amount of time to run.
Using Domino/AWS with IPython
I'm still not convinced about Domino as an AWS intermediary — for example, once my
free trial ends I think am limited to just one project on there — but that's ok since
the following process has only a few Domino-specific elements.
I needed to add seaborn since it was not included in Domino's standard set of imports.
For some reason I needed to include numexpr to get pandas to work.
It's probably a good idea to include as little as possible in requirements.txt
(i.e., don't pip freeze) since Domino has to install everything you ask for anew
with each run.
It's important that I use all available CPUs when I am running on AWS/Domino,
otherwise I am wasting money. I need a few CPUs free on my laptop though...
I test this by just checking if the platform is Linux.
IPython's nbconvert preprocessor
IPython's handy nbconvert function converts IPython notebooks into other formats:
most commonly, pure Python, HTML or PDF.
An IPython preprocessor is a little function that takes the cells in your notebook,
and does something to them before nbconvert gets to them.
Here I am using a preprocessor to do a few little things:
comment out IPython "magic" commands (these will cause errors if run outside IPython)
skip cells that have a special "SKIPCELL" comment (surprisingly, there's no IPython magic command for this)
make the "print" function print to a file. For some reason Domino appears to munge
stdout and stderr into one file. By making the "print" function print to a file I can
separate my results from warnings etc.
I'd also like to dump all my inline plots to files, but I don't know how to do that.
Running the code on Domino
Finally, I created a small script that generates the cleaned-up Python file,
sends it to Domino to run
and opens the results page in a browser window.
Using this script I can trivially run my Kaggle script on 32 CPUs without sending my
laptop into paroxysms. All in all it works pretty well.
Domino Pricing vs AWS
Domino sells standard "Compute Optimized" AWS instances, and marks them up 100%.
It's not a bad deal for light to moderate users,
considering that the AWS environments are already spun up and waiting,
and that Domino charges by the minute instead of by the hour.
For comparison purposes, 32 CPUs costs $1.68 an hour on AWS vs $3.36 on Domino.