Brian Naughton // Tue 14 April 2015 // Filed under data // Tags biobanks uk biobank big data data science nih

UK Biobank

UK Biobank now allows researchers to request genotype data for their studies. Data from the first 150,000 individuals (of 500,000 total) will be released in May, with the rest coming toward the end of the year. The data comes from a 800k Affymetrix Axiom chip, which, unlike the Affy chips of old, produces good quality data. You can also request imputed whole genomes, which will be huge for many kinds of GWAS. You can check to see if they have your SNP of interest here.

NIH Data Science

NIH has launched a new data science site. This new site incorporates the old BD2K ("Big Data to Knowledge") site, which is going away. This seems like a positive indicator for funding opportunities in data science, though I don't see any direct indication of that on this site.

Comment
Brian Naughton // Mon 22 September 2014 // Filed under data // Tags open data kaggle data science

(Or, How To Kaggle)

The Africa Soil Property Prediction Challenge is a Kaggle competition where you are supposed to try to predict various soil measurements (like Calcium levels) in various parts of Africa using infrared spectroscopy readings. Seems like a worthwhile thing to do!

The Kaggle forums are a great place to pick up information on some of the practicalities of applied machine learning. In that spirit I thought I would share the (moderately successful) IPython/scikit-learn code I used.

Here is the notebook of code I used to make my predictions: (HTML version, IPython notebook.) When I originally ran this code it was in the top 10% of entries, but now it's way down the leaderboard. The winning regression method in my tests, by quite a distance, was Support Vector Regression.

Comment
Brian Naughton // Mon 15 September 2014 // Filed under biotech // Tags open data domino data science ipython

IPython Notebook

Recently, I've been using IPython notebook for some data analysis. It's pretty janky in places, like most browser-based software, but it's the closest thing Python has to an interactive environment. It definitely saves a bunch of time if you have a long series of independent data transformations and analyses.

Ideally, I'd be able to work on a notebook locally, but if I have some heavy computation to do, I could transparently send that to the cloud. With IPython it's technically possible to do that on a cell by cell basis using ipcluster, but it would be difficult to integrate that with a third-party cloud provider. It's also not an elegant system: for example, you have to manually do imports on each of your ipcluster nodes.

A simpler method is just to dump the IPython notebook to a Python file, then send the entire script to the cloud. Assuming there is one long-running bottleneck task in there, this should take about the same amount of time to run.

Using Domino/AWS with IPython

I'm still not convinced about Domino as an AWS intermediary — for example, once my free trial ends I think am limited to just one project on there — but that's ok since the following process has only a few Domino-specific elements.

requirements.txt

I needed to add seaborn since it was not included in Domino's standard set of imports. For some reason I needed to include numexpr to get pandas to work. It's probably a good idea to include as little as possible in requirements.txt (i.e., don't pip freeze) since Domino has to install everything you ask for anew with each run.

IPython setup

It's important that I use all available CPUs when I am running on AWS/Domino, otherwise I am wasting money. I need a few CPUs free on my laptop though... I test this by just checking if the platform is Linux.

IPython's nbconvert preprocessor

IPython's handy nbconvert function converts IPython notebooks into other formats: most commonly, pure Python, HTML or PDF.

An IPython preprocessor is a little function that takes the cells in your notebook, and does something to them before nbconvert gets to them. Here I am using a preprocessor to do a few little things:

  • comment out IPython "magic" commands (these will cause errors if run outside IPython)
  • skip cells that have a special "SKIPCELL" comment (surprisingly, there's no IPython magic command for this)
  • make the "print" function print to a file. For some reason Domino appears to munge stdout and stderr into one file. By making the "print" function print to a file I can separate my results from warnings etc.

I'd also like to dump all my inline plots to files, but I don't know how to do that.

Running the code on Domino

Finally, I created a small script that generates the cleaned-up Python file, sends it to Domino to run and opens the results page in a browser window.

Using this script I can trivially run my Kaggle script on 32 CPUs without sending my laptop into paroxysms. All in all it works pretty well.

Domino Pricing vs AWS

Domino sells standard "Compute Optimized" AWS instances, and marks them up 100%. It's not a bad deal for light to moderate users, considering that the AWS environments are already spun up and waiting, and that Domino charges by the minute instead of by the hour. For comparison purposes, 32 CPUs costs $1.68 an hour on AWS vs $3.36 on Domino.

Comment
Brian Naughton // Sun 14 September 2014 // Filed under biotech // Tags open data kaggle data science

Paying someone else for their computer time

Read More

Boolean Biotech © Brian Naughton Powered by Pelican and Twitter Bootstrap. Icons by Font Awesome and Font Awesome More