Brian Naughton // Mon 15 February 2021 // Filed under genomics // Tags sequencing genomics nanopore protocol

In this post I'll describe how to sequence a human genome at home, something that's only recently become possible. The protocol described here is not necessarily the best way to do this, but it's what has worked best for me. It costs a few thousand dollars in equipment to get started, but the (low-coverage) sequencing itself requires only $150, a couple of hours of work, and almost no lab skills.

What does it mean to sequence a human genome?

First, it's useful to explain some terms: specifically, to differentiate a regular reference-based genome assembly from a de novo genome assembly.

Twenty years or so ago, the Human Genome Project resulted in one complete human genome sequence (actually combining data from several humans). Since all humans are 99.9%+ identical genetically, this reference genome can be used as a template for any human. The easiest way to sequence a human genome is to generate millions of short reads (100-300 base-pairs) and aligning them to this reference.

The alternative to this reference-based assembly is a de novo assembly, where you figure out the genome sequence without using the reference, by stitching together overlapping sequencing reads. This is much more difficult computationally (and actually impossible if your reads are too short), but the advantage is that you can potentially see large differences compared to the reference. For example, it's not uncommon to have some sequence in a genome that is not present in the reference genome, so-called structural variants.

There are also gaps in the reference genome, especially at the ends and middle of chromosomes, due to highly repetitive sequence. In fact, the first full end-to-end human chromosome was only sequenced last year, thanks to ultra-long nanopore reads.

For non-human species, genome assembly is usually de novo, either because the genomes are small and non-repetitive (bacteria), or there is no reference (newly sequenced species).

SNP chip

The cheapest way to get human genome sequence data is with a SNP chip, like the 23andMe chip. These chips work by measuring variation at specific, pre-determined positions in the genome. Since we know the positions that commonly vary in the human genome, we can just examine a few hundred thousand of those positions to see most of the variation. You can also accurately impute tons of additional variants not on the chip. The reason this is "genotyping" and not "sequencing" is that you don't get a contiguous sequence of As, Cs, Gs, and Ts. The major disadvantage of SNP chips is that you cannot directly measure variants not on the chip, so you miss things, especially rare and novel variants. On the other hand, the accuracy for a specific variant of interest (e.g., a recessive disease variant like cystic fibrosis ΔF508) is probably higher than from a sequenced genome.

Short-read sequencing

Short-read sequencing is almost always done with Illumina sequencers, though other short-read technologies are emerging. These machines output millions or billions of 100-300 base-pair reads that you can align to the reference human genome. Generally, people like to have on average 30X coverage of the human genome (~100 gigabases) to ensure high accuracy across the genome.

Although you can read variants not present on a SNP chip, this is still not a complete genome: coverage is not equal across the genome, so some regions will likely have too low coverage to call variants; the reference genome is incomplete; some structural variants (insertions, inversions, repetitive regions) cannot be detected with short reads.

Long-read sequencing

The past few years have seen single-molecule long-read sequencing develop into an essential complement and sometimes credible alternative to Illumina. The two players, Pacific Biosciences and Oxford Nanopore (ONT) are now mature technologies. The big advantage of these technologies is that you get reads much longer than 300bp — from low thousands up to megabases on ONT in extreme examples — so assembly is much easier. This enables de novo assembly, and is especially helpful with repetitive sequence. For this reason, long-read sequencing is almost essential for sequencing new species, especially highly repetitive plant genomes.

Sounds great! Why do people still use Illumina then? The per-base accuracy and per-base cost of Illumina is still considerably better than these competitors (though ONT's PromethION is getting close on price).

One huge advantage that ONT has over competitors is that the instrument is a fairly simple solid-state device that reads electrical signals from the nanopores. Since most of the technology is in the consumable "flow-cell" of pores, the instruments can be tiny and almost free to buy.

Instead of spending $50k-1M on a complex machine that requires a service contract, etc., you can get a stapler-sized MinION sequencer for almost nothing, and you can use it almost anywhere. ONT have also done a great job driving the cost per experiment down, especially by releasing a lower-output flow-cell adaptor called the flongle. Flongle flow-cells only cost $90 per flow-cell, and produce 100 megabases to >1 gigabase of sequence.

There is a great primer on how nanopore sequencing works at nanoporetech.com.


Nanopore Sequencing Equipment

(Note, to make this article stand alone, I copied text from my previous home lab blogpost.)

Surprisingly, you don't actually need much expensive equipment to do nanopore sequencing.

In my home lab, I have the following:

Optional equipment:

  • A wireless fridge thermometer. This was only $25 and it works great! It's useful to be able to keep track of temperatures in your fridge or freezer. Some fridges can get cold enough to freeze, which is deadly for flow-cells.
  • A GeneQuant to check the quality of DNA extractions. A 20 year old machine cost me about $150 on ebay. It's a useful tool, but does require quite a lot of sample (I use 200 µl). I wrote a bit more about it here.

The lab during a sequencing run. MinION running and used flow-cell on the desk in front

(a) Eppendorf 5415C centrifuge. (b) Anova sous vide in a Costco coffee can

Protocol Part One: extracting DNA

The first step in sequencing is DNA extraction (i.e., isolating DNA from biological material). I use a Zymo Quick-DNA Microprep Plus Kit, costing $132. It's 50 preps, so a little under $3 per prep. There are other kits out there, like NEB's Monarch, but these are harder to buy (requiring a P.O., or business address).

The Zymo kit takes "20 minutes" (it takes me about 40 minutes including setting up). It is very versatile: it can work with "cell culture, solid tissue, saliva, and any biological fluid sample". This prep is pretty easy to do, and all the reagents except Proteinase k are just stored at room temperature. They claim it can recover >50kb fragments, and anecdotally, this is the maximum length I have seen. That is far from the megabase-long "whale" reads some labs can achieve, but those preps are much more complex and time-consuming. Generally speaking, 10kb fragments are more than long enough for most use-cases.

Protocol Part Two: library prep

Library prep is the process of preparing the DNA for sequencing, for example by attaching the "motor protein" that ratchets the DNA through the pore one base at a time. The rapid library prep (RAD-004) is the simplest and quickest library prep method available, at $600 for 12 preps ($50 per prep).

Library prep is about as difficult as DNA extraction, and takes around 30 minutes. There are some very low volumes involved (down to 0.5µl, which is as low as my pipettes go), and you need two to use two water bath temperatures, but overall it's pretty straightforward.

The total time from acquiring a sample to beginning sequencing could be as little as 60-90 minutes. You do pay for this convenience in lower read lengths and lower throughput though.

The Data

The amount of data you can get from ONT/nanopore varies quite a lot. There is a fundamental difference between Illumina and nanopore in that nanopore is single-molecule sequencing. With nanopore, each read represents a single DNA molecule traversing the pore. With Illumina, a read is an aggegated signal from many DNA molecules (which contributes to the accuracy).

So, nanopore is really working with the raw material you put in. If there are contaminants, then they can jam up the pores. If there are mostly short DNA fragments in the sample, you will get mostly short reads. Over time, the pores degrade, so you won't get as much data from a months-old flow-cell as a new one.

Using the protocol above, I have been able to get around 100-200 megabases of data from one flongle ($1 per megabase!). There are probably a few factors contributing to this relatively low throughput: the rapid kit does not work as well as the more complex ligation kit; I don't do a lot of sequencing, so the protocol is certainly executed imperfectly; my flow-cells are not always fresh.

For a human sample, 100 megabases is less than a 0.1X genome, which raises the fair question of why you would want to do that? Today, the answer is mainly just because you can. You could definitely do some interesting ancestry analyses, but it would be difficult to validate without a reference database. gencove also has several good population-level use-cases for low-pass sequencing.

The next step up from a flongle is a full-size MinION flow-cell, which runs on the same equipment and uses the same protocol, but costs around $900, and in theory can produce up to 42 gigabases of sequence. This would be a "thousand dollar genome", though the accuracy is probably below what you would want for diagnosic purposes. In a year or two, I may be able to generate a diagnostic-quality human genome at home for around $1000, perhaps even a decent de novo assembly.

Comment
Brian Naughton // Sat 02 January 2021 // Filed under datascience // Tags metrology datascience

Youtuber Tesla Bjorn has a great channel reviewing electric cars. Unusually, he tests the range of the cars in a consistent and rigorous way, and posts all his results to a nicely-organized Google Sheet.

One interesting thing I picked up from his channel is that the Porsche Taycan has much worse range than Tesla with its stock tires, but excellent range of 360mi/580km with smaller, more efficient tires (and better weather). One obvious question I had was, how much does tire size and temperature affect range? Since the data was all there, I decided to take a look.

This was also a good excuse to play with brms, a Bayesian multilevel modeling tool for R. I won't explain multilevel modeling here — others have done a much better job at this than I could (see various links below) — but at least in my mind, it's like regular regression, but can incorporate information on subgroups and subgroup relationships. McElreath argues it should just be the default way to do regression, since the downsides are minimal. This blogpost makes a similar case, and helpfully notes that there is evidence that you should generally "use the "maximal" random effects structure supported by the design of the experiment". I take this to mean we should not worry too much about overfitting the model.

Other resources I like include:

Bambi vs brms

Even though this exercise was an excuse to try brms, I ended up mostly using Bambi. They are very similar tools, except Bambi is built on Python/PyMC3 and brms is built on R/Stan (pybrms also exists but I could not get it to work). I am assuming Bambi was influenced by brms too.

They both follow very similar syntax to widely-used R regression tools like lme4. Using Bambi also lets me use the excellent Arviz plotting package, and I can still use the more battle-tested brms to confirm my results.

Simulating data

Simulated data is always useful for these models. It really helps get comfortable with the process generating the data, and make sure the model can produce reasonable results under ideal conditions.

My model is an attempt to find the "base efficiency" of each car. Each car has a total tested efficiency, measured in Wh/km, which you can easily calculate based on the range (km) and the battery size (Wh).

In my model, a car's measured Wh/km (Whk) is the sum of effects due to:

  • the car's base efficiency (car)
  • the weight of the car in tons (ton)
  • the ambient temperature (dgC)
  • the tire size in inches (tir)
  • the surface being wet or dry (wet)
  • the tested speed (spd — Tesla Bjorn tests every vehicle at 90km/h and 120km/h).

The simulated data below match the real data pretty well.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

num_cars = 10
mu_Whk = 100
mu_dgC = 2
mu_wet = 10
mu_tir = 4
mu_spd = 50
mu_wgt = 2
mu_ton = 50

# parameters of each car
has_Whk = np.arange(50, 150, (150-50)/num_cars)
has_ton = np.random.normal(mu_wgt, mu_wgt/5, num_cars).clip(mu_wgt/2)
per_ton = np.random.normal(mu_ton, mu_ton/5, num_cars).clip(mu_ton/5)
per_dgC = np.random.normal(mu_dgC, mu_dgC/5, num_cars).clip(mu_dgC/5) * -1
per_wet = np.random.normal(mu_wet, mu_wet/5, num_cars).clip(mu_wet/5)
per_tir = np.random.normal(mu_tir, mu_tir/5, num_cars).clip(mu_tir/5)
per_spd = np.random.normal(mu_spd, mu_spd/5, num_cars).clip(mu_spd/5)

num_samples = 100
data = []
for _ in range(num_samples):
    car = np.random.choice(num_cars)
    dgC = np.random.normal(12, 4)
    wet = int(np.random.random() > 0.75)
    tir = max(0, int(np.random.normal(18,1)) - 16)
    for spd in [0, 1]:
        Whk = (has_Whk[car] + per_ton[car] * has_ton[car] + per_tir[car] * tir +
               (per_spd[car] if spd else 0) + (per_wet[car] if wet else 0))
        data.append([f"{car:02d}", has_ton[car], dgC, wet, tir, spd, Whk, has_Whk[car]])

df_sim = pd.DataFrame(data, columns=['car', 'ton', 'dgC', 'wet', 'tir', 'spd', 'Whk', 'has_Whk'])
assert len(set(df_sim.car)) == num_cars, "unlucky sample"
df_sim.to_csv("/tmp/df_ev_range_sim.csv")
df_sim.head()
    car ton         dgC         wet tir spd Whk         has_Whk
0   12  2.386572    15.096731   0   2   0   229.106858  110.0
1   12  2.386572    15.096731   0   2   1   293.139807  110.0
2   10  2.083366    12.632114   0   1   0   162.626350  100.0
3   10  2.083366    12.632114   0   1   1   221.408095  100.0
4   19  1.848348    13.203525   0   2   0   256.221089  145.0

Now I can build my model. The model is pretty simple: all priors are just T distributions (a bit fatter-tailed than normal). I originally bounded some of the priors (since e.g., a car cannot use <0 Wh/km), but since I couldn't get that to work with brms, I removed the bounds to make comparisons easier. (The bounds did not materially change the results).

I have an group per car (1|car), and although I could also have groups for (1|spd) and (1|wet), I opted not to do that since it didn't change the results, but made the interpretation a little messier. I set the intercept to 0 so that the 1|car intercept can represent the base efficiency by itself.

import arviz as az
from bambi import Model, Prior

prior_car = Prior('StudentT', nu=3, mu=100, sigma=50)
prior_ton = Prior('StudentT', nu=3, mu=50, sigma=10)
prior_dgC = Prior('StudentT', nu=3, mu=-1, sigma=1)
prior_wet = Prior('StudentT', nu=3, mu=10, sigma=20)
prior_tir = Prior('StudentT', nu=3, mu=2, sigma=5)
prior_spd = Prior('StudentT', nu=3, mu=50, sigma=20)

model_sim = Model(df_sim)
print(f"{len(df_sim)} samples")
res_sim = model_sim.fit(
    'Whk ~ 0 + ton + dgC + tir + spd + wet',
    group_specific=["1|car"],
    categorical=['car'],
    priors = {'car': prior_car, "ton": prior_ton, "dgC": prior_dgC,
              "wet": prior_wet, "tir": prior_tir, "spd": prior_spd},
    draws=1000, chains=4
)
az.plot_trace(results)
az.summary(res_sim)

The graph below shows that I can recover the base efficiency fairly well, though obviously there are limits to what can be done.

Real data

Now that this all looks good, I can try to model the real data.

import pandas as pd

def get_car(carstr):
    if carstr.lower().startswith('model'):
        carstr = 'Tesla ' + carstr
    if carstr.lower().startswith('tesla'):
        return '_'.join(carstr.split(' ')[:3]).replace('-','').lower()
    else:
        return '_'.join(carstr.split(' ')[:2]).replace('-','').lower()

# using google sheets data downloaded to tsv
df_wt = (pd.read_csv("/tmp/ev_weight.tsv", sep='\t'))
df_real = (pd.read_csv("/tmp/ev_range.tsv", sep='\t')
             .merge(df_wt, on='Car')
             .rename(columns={'Total':'weight'})
             .assign(tire_size=lambda df: df["Wheel front"].str.extract('\-(\d\d)'))
             .dropna(subset=['Wh/km', 'tire_size', 'weight'])
             .assign(tire_size=lambda df: df.tire_size.astype(int))
             .assign(Temp=lambda df: df.Temp.str.replace(u"\u2212",'-').astype(float))
          )

df_clean = (
df_real
    .assign(car = lambda df: df.Car.apply(get_car))
    .assign(Whk = lambda df: df['Wh/km'])
    .assign(ton = lambda df: df.weight/1000)
    .assign(wet = lambda df: (df.Surface=="Wet").astype(int))
    .assign(dgC = lambda df: df.Temp)
    .assign(tir = lambda df: (df.tire_size-14).astype(float))
    .assign(spd = lambda df: (df.Speed==120).astype(int))
    .loc[lambda df: ~df.car.str.startswith('mitsubishi')]
    .loc[lambda df: ~df.car.str.startswith('maxus')]
    [["Whk", "car", "tir", "ton", "wet", "spd", "dgC"]]
    .dropna()
)
df_real.to_csv("/tmp/df_ev_range.csv")
df_clean.to_csv("/tmp/df_ev_range_clean.csv")
prior_car = Prior('StudentT', nu=3, mu=100, sigma=50)
prior_ton = Prior('StudentT', nu=3, mu=50, sigma=20)
prior_dgC = Prior('StudentT', nu=3, mu=-1, sigma=2)
prior_wet = Prior('StudentT', nu=3, mu=10, sigma=20)
prior_tir = Prior('StudentT', nu=3, mu=2, sigma=5)
prior_spd = Prior('StudentT', nu=3, mu=50, sigma=30)

model_clean = Model(df_clean)
print(f"{len(df_clean)} samples")
res_clean = model_clean.fit(
    'Whk ~ 0 + ton + dgC + tir + spd + wet',
    group_specific=["1|car"],
    categorical=['car', 'wet', 'spd'],
    priors = {'car': prior_car, "ton": prior_ton, "dgC": prior_dgC,
              "wet": prior_wet, "tir": prior_tir, "spd": prior_spd},
    draws=2000, chains=4
)

car_means = res_clean.posterior['1|car'].mean(dim=("chain", "draw"))
res_clean.posterior = res_clean.posterior.sortby(-car_means)
g = az.plot_forest(res_clean.posterior, kind='ridgeplot', var_names=['1|car'], combined=True,
                   colors='#eeeeee', hdi_prob=0.999);
g[0].set_xlabel('Wh/km');

Results

There is plenty of overlap in these posteriors, but they do show that e.g., Tesla, Hyundai and Porsche seem to have good base efficiency, while Jaguar, Audi, and Volvo appear to have poor efficiency.

We can also look at the estimated range for each car at 90km/h given dry weather, 18" tires, 25C and a 100kWh battery. (Caveat: I don't know how to adjust the range for additional battery weight, so there is an unfair advantage for cars with small batteries.) Still, these results show that once battery density increases a bit, several cars may be able to hit the 400mi/644km range.

It was also interesting to discover the contributions of other factors, though the numbers below are just +/-1 standard deviation and should be taken with plenty of salt:

  • traveling at 120km/h adds 60–70 Wh/km compared to 90km/h;
  • each degree C colder adds 0.5–1.25 Wh/km;
  • each ton weight adds 20–40 Wh/km;
  • each inch of tire adds 2–6 Wh/km;
  • wet weather adds 10–20 Wh/km.

My main takeaways were that speed is extremely important (drag), tire size was less important than I had guessed based on the Porsche example, but wet weather was more important than I had guessed (maybe equivalent to carrying an extra ton??)

As always, there are things I could improve in the model, especially with more data: better chosen, bounded priors; more prior and posterior predictive checks (I had some issues here); checking for interactions (e.g., wet*dgC); perhaps some multiplicative effects (e.g., wet weather reduces efficiency by 5% instead of subtracting 10 Wh/km). I could also maybe predict optimal conditions for maximum range, though the current model would say on the surface of the sun with 0 inch tires.

I might come back to it when the database has grown a bit, but I am pretty satisfied with these results for now.

data / brms code

If anyone wants to rerun/improve upon the above, I put all the data here: ev_range.tsv, ev_weight.tsv, df_ev_range.csv, df_ev_range_clean.csv, df_ev_range_sim.csv. This could be a pretty nice dataset for teaching.

To validate (or at least reproduce) the model above, I ran the following R code. The results were virtually identical.

library('brms')
df_clean = read.csv("/tmp/df_ev_range_clean.csv")
priors = c(set_prior("student_t(3,100,50)", class = "b", coef = paste("car", sort(unique(df_clean$car)), sep="")),
           set_prior("student_t(3,50,20)", class = "b", coef = "ton"),
           set_prior("student_t(3,-1,2)", class = "b", coef = "dgC"),
           set_prior("student_t(3,10,20)", class = "b", coef = "wet"),
           set_prior("student_t(3,2,5)", class = "b", coef = "tir"),
           set_prior("student_t(3,50,30)", class = "b", coef = "spd")
          )
res_clean = brm(
    Whk ~ 0 + ton + dgC + tir + car + spd + wet + (1|car),
    data=df_clean,
    chains = 4,
    iter = 3000,
    cores = 4,
    prior=priors,
    control = list(adapt_delta = 0.95, max_treedepth = 20))
Comment
Brian Naughton // Sun 02 August 2020 // Filed under stats // Tags cgm quantifiedself pymc3 numpyro jax

This post is a numpyro-based probabilistic model of my blood glucose data (see my original blogpost for necessary background). The task is, given my blood glucose readings, my heart rate data, and what I ate, can I figure out the contribution of each food to my blood glucose.

This is a classic Bayesian problem, where I have some observed data (blood glucose readings), a model for how food and heart rate influence blood glucose, and I want to find likely values for my parameters.

Normally I would use pymc3 for this kind of thing, but this was a good excuse to try out numpyro, which is built on JAX, the current favorite for automatic differentiation.

To make the problem a little more concrete, say your fasting blood glucose is 80, then you eat something and an hour later it goes to 120, and an hour after that it's back down to 90. How much blood glucose can I attribute to that food? What if I eat the same food on other day, but this time my blood glucose shoots to 150? What if I eat something else 30 minutes after eating this food so the data overlap? In theory, all of these situations can be modeled in a fairly simple probabilistic model.

Imports

I am using numpyro, and jax numpy instead of regular numpy. In most cases, jax is a direct replacement for numpy, but sometimes I need original numpy, so I keep that as onp.

#!pip install numpyro
import jax.numpy as np
from jax import random
from jax.scipy.special import logsumexp

import numpyro
import numpyro.distributions as dist
from numpyro.diagnostics import hpdi
from numpyro import handlers
from numpyro.infer import MCMC, NUTS

import numpy as onp
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

import glob
import json
import io

from tqdm.autonotebook import tqdm
from contextlib import redirect_stdout

%config InlineBackend.figure_format='retina'

Some simple test data

Simple test data is always a good idea for any kind of probabilistic model. If I can't make the model work with the simplest possible data, then it's best to figure that out early, before starting to try to use real data. I believe McElreath advocates for something similar, though I can't find a citation for this.

To run inference with my model I need three 1D arrays:

  • bg_5min: blood glucose readings, measured every 5 minutes
  • hr_5min: heart rate, measured every 5 minutes
  • fd_5min: food eaten, measured every 5 minutes
def str_to_enum(arr):
    foods = sorted(set(arr))
    assert foods[0] == '', f"{foods[0]}"
    mapk = {food:n for n, food in enumerate(foods)}
    return np.array([mapk[val] for val in arr]), foods

bg_5min_test = np.array([91, 89, 92, 90, 90, 90, 88, 90, 93, 90,
                         90, 100, 108, 114.4, 119.52, 123.616, 126.89, 119.51, 113.61, 108.88, 105.11,
                         102.089, 99.67, 97.73, 96.18, 94.95, 93.96, 93.16, 92.53, 92.02, 91.62, 91.29, 91.03, 90.83, 90.66, 90.53,
                         90, 90, 90, 90, 90, 90, 90, 89, 90, 90, 89, 90, 90, 90, 90, 90, 90, 90, 90, 95, 95, 95, 95, 100])

_fd_5min_test = ['', '', '', '', '', 'food', '', '', '', '', '', '',
                 '', '', '', '', '', '', '', '', '', '', '', '',
                 '', '', '', '', '', '', '', '', '', '', '', '',
                 '', '', '', '', 'doof', '', '', '', '', '', '', '',
                 '', '', '', '', '', '', '', '', '', '', '', '', '']

fd_5min_test, fd_5min_test_key = str_to_enum(_fd_5min_test)
hr_5min_test = np.array([100] * len(bg_5min_test))

bg_5min_test = np.tile(bg_5min_test, 1)
fd_5min_test = np.tile(fd_5min_test, 1)
hr_5min_test = np.tile(hr_5min_test, 1)

print(f"Example test data (len {len(bg_5min_test)}):")
print("bg", bg_5min_test[:6])
print("heart rate", hr_5min_test[:6])
print("food as enum", fd_5min_test[:6])

assert len(bg_5min_test) == len(hr_5min_test) == len(fd_5min_test)
Example test data (len 60):
bg [91. 89. 92. 90. 90. 90.]
heart rate [100 100 100 100 100 100]
food as enum [0 0 0 0 0 2]

Read in real data as a DataFrame

I have already converted all my Apple Health data to csv. Here I process the data to include naive day and date (local time) — timezones are impossible otherwise.

df_complete = pd.DataFrame()

for f in glob.glob("HK*.csv"):
    _df = (pd.read_csv(f, sep=';', skiprows=1)
             .assign(date = lambda df: pd.to_datetime(df['startdate'].apply(lambda x:x[:-6]),
                                                      infer_datetime_format=True))
             .assign(day = lambda df: df['date'].dt.date,
                     time = lambda df: df['date'].dt.time)
          )

    if 'unit' not in _df:
        _df['unit'] = _df['type']
    df_complete = pd.concat([df_complete, _df[['type', 'sourcename', 'unit', 'day', 'time', 'date', 'value']]])

# clean up the names a bit and sort
df_complete = (df_complete.assign(type = lambda df: df['type'].str.split('TypeIdentifier', expand=True)[1])
                          .sort_values(['type', 'date']))
df_complete.sample(4)

I can remove a bunch of data from the complete DataFrame.

df = (df_complete
        .loc[lambda r: ~r.type.isin({"HeadphoneAudioExposure", "BodyMass", "UVExposure", "SleepAnalysis"})]
        .loc[lambda r: ~r.sourcename.isin({"Brian Naughton’s iPhone", "Strava"})]
        .loc[lambda r: r.date >= min(df_complete.loc[lambda r: r.sourcename == 'Connect'].date)]
        .assign(value = lambda df: df.value.astype(float)))
df.groupby(['type', 'sourcename']).head(1)
type sourcename unit day time date value
ActiveEnergyBurned Connect kcal 2018-07-08 21:00:00 2018-07-08 21:00:00 8.00
BasalEnergyBurned Connect kcal 2018-07-08 21:00:00 2018-07-08 21:00:00 2053.00
BloodGlucose Dexcom G6 mg/dL 2020-02-24 22:58:54 2020-02-24 22:58:54 111.00
DistanceWalkingRunning Connect mi 2018-07-08 21:00:00 2018-07-08 21:00:00 0.027
FlightsClimbed Connect count 2018-07-08 21:00:00 2018-07-08 21:00:00 0.00
HeartRate Connect count/min 2018-07-01 18:56:00 2018-07-01 18:56:00 60.00
RestingHeartRate Connect count/min 2019-09-07 00:00:00 2019-09-07 00:00:00 45.00
StepCount Connect count 2018-07-01 19:30:00 2018-07-01 19:30:00 33.00

I have to process my food data separately, since it doesn't come from Apple Health. The information was just in a Google Sheet.

df_food = (pd.read_csv("food_data.csv", sep='\t')
             .astype(dtype = {"date":"datetime64[ns]", "food": str})
             .assign(food = lambda df: df["food"].str.split(',', expand=True)[0] )
             .loc[lambda r: ~r.food.isin({"chocolate clusters", "cheese weirds", "skinny almonds", "asparagus"})]
             .sort_values("date")
            )
df_food.sample(4)
date food
2020-03-03 14:10:00 egg spinach on brioche
2020-03-11 18:52:00 lentil soup and bread
2020-02-27 06:10:00 bulletproof coffee
2020-03-02 06:09:00 bulletproof coffee

For inference purposes, I am rounding everything to the nearest 5 minutes. This creates the three arrays the model needs.

_df_food = (df_food
              .assign(rounded_date = lambda df: df['date'].dt.round('5min'))
              .set_index('rounded_date'))

_df_hr = (df.loc[lambda r: r.type == 'HeartRate']
            .assign(rounded_date = lambda df: df['date'].dt.round('5min'))
            .groupby('rounded_date')
            .mean())

def get_food_at_datetime(datetime):
    food = (_df_food
              .loc[datetime - pd.Timedelta(minutes=1) : datetime + pd.Timedelta(minutes=1)]
              ['food'])

    if any(food):
        return sorted(set(food))[0]
    else:
        return None

def get_hr_at_datetime(datetime):
    hr = (_df_hr
            .loc[datetime - pd.Timedelta(minutes=1) : datetime + pd.Timedelta(minutes=1)]
            ['value'])
    if any(hr):
        return sorted(set(hr))[0]
    else:
        return None


df_5min = \
(df.loc[lambda r: r.type == "BloodGlucose"]
   .loc[lambda r: r.date >= pd.datetime(2020, 2, 25)]
   .assign(rounded_date = lambda df: df['date'].dt.round('5min'),
           day = lambda df: df['date'].dt.date,
           time = lambda df: df['date'].dt.round('5min').dt.time,
           food = lambda df: df['rounded_date'].apply(get_food_at_datetime),
           hr = lambda df: df['rounded_date'].apply(get_hr_at_datetime)
          )
)


bg_5min_real = onp.array(df_5min.pivot_table(index='day', columns='time', values='value').interpolate('linear'))
hr_5min_real = onp.array(df_5min.pivot_table(index='day', columns='time', values='hr').interpolate('linear'))
fd_5min_real = onp.array(df_5min.fillna('').pivot_table(index='day', columns='time', values='food', aggfunc=lambda x:x).fillna('')).astype(str)
assert bg_5min_real.shape == hr_5min_real.shape == fd_5min_real.shape, "array size mismatch"

print(f"Real data (len {bg_5min_real.shape}):")
print("bg", bg_5min_real[0][:6])
print("hr", hr_5min_real[0][:6])
print("food", fd_5min_real[0][:6])
Real data (len (30, 288)):
bg [102. 112. 114. 113. 112. 111.]
hr [50.33333333 42.5        42.66666667 43.         42.66666667 42.5       ]
food ['' '' '' '' '' '']

numpyro model

The model is necessarily pretty simple, given the amount of data I have to work with. Note that I use "g" as a shorthand for mg/dL, which may be confusing. The priors are:

  • baseline_mu: I have a uniform prior on my "baseline" (fasting) mean blood glucose level. If I haven't eaten, I expect my blood glucose to be distributed around this value. This is a uniform prior from 80–110 mg/dL.
  • all_food_g_5min_mu: Each food deposits a certain number of mg/dL of sugar into my blood every 5 minutes. This is a uniform prior from 8–16 mg/dL per 5 minutes.
  • all_food_duration_mu: Each food only starts depositing glucose after a delay for digestion and absorption. This is a uniform prior from 45–100 minutes.
  • all_food_g_5min_mu: Food keeps depositing sugar into my blood for some time. Different foods may be quick spikes of glucose (sugary) or slow release (fatty). This is a uniform prior from 25–50 minutes.
  • regression_rate: The body regulates glucose by releasing insulin, which pulls blood glucose back to baseline at a certain rate. I allow for three different rates, depending on my heart rate zone. Using this simple model, the higher the blood glucose, the more it regresses, so it produces a somewhat bell-shaped curve.

Using uniform priors is really not best practices, but it makes it easier to keep values in a reasonable range. I did experiment with other priors, but settled on this after failing to keep my posteriors to reasonable values. As usual, explaining the priors to the model is like asking for a wish from an obtuse genie. This is a common problem with probabilistic models, at least for me.

The observed blood glucose is then a function that uses this approximate logic:

for t in all_5_minute_increments:
  blood_glucose[t] = baseline[t] # blood glucose with no food
  for food in foods:
    if t - time_eaten[food] > food_delays[food]: # then the food has started entering the bloodstream
      if t - (time_eaten[food] + food_delays[food]) < food_duration[food]: # then food is still entering the bloodstream
        blood_glucose[t] += food_g_5min[food]

  blood_glucose[t] = blood_glucose[t] - (blood_glucose[t] - baseline[t]) * regression_rate[heart_rate_zone_at_time(t)]

This model has some nice properties:

  • every food has only three parameters: the delay, the duration, and the mg/dL per 5 minutes. The total amount of glucose (loosely defined) can be easily calculate by multiplying food_duration * food_g_5_min.
  • the model only has two other parameters: the baseline glucose, and regression rate
MAX_TIMEPOINTS = 70 # effects are modeled 350 mins into the future only
DL = 288 # * 5 minutes = 1 day
SUB = '' # or a number, to subset
DAYS = [11]

%%time
def model(bg_5min, hr_5min, fd_5min, fd_5min_key):

    def zone(hr):
        if hr < 110: return "1"
        elif hr < 140: return "2"
        else: return "3"

    assert fd_5min_key[0] == '', f"{fd_5min_key[0]}"
    all_foods = fd_5min_key[1:]

    # ----------------------------------------------

    baseline_mu = numpyro.sample('baseline_mu', dist.Uniform(80, 110))

    all_food_g_5min_mu = numpyro.sample("all_food_g_5min_mu", dist.Uniform(8, 16))
    food_g_5mins = {food : numpyro.sample(f"food_g_5min_{food}",
                                          dist.Uniform(all_food_g_5min_mu-8, all_food_g_5min_mu+8))
                    for food in all_foods}

    all_food_duration_mu = numpyro.sample(f"all_food_duration_mu", dist.Uniform(5, 10))
    food_durations = {food : numpyro.sample(f"food_duration_{food}",
                                            dist.Uniform(all_food_duration_mu-3, all_food_duration_mu+3))
                      for food in all_foods}

    regression_rate = {zone : numpyro.sample(f"regression_rate_{zone}", dist.Uniform(0.7, 0.96))
                       for zone in "123"}

    all_food_delay_mu = numpyro.sample("all_food_delay_mu", dist.Uniform(9, 20))
    food_delays = {food : numpyro.sample(f"food_delay_{food}", dist.Uniform(all_food_delay_mu-4, all_food_delay_mu+4))
                   for food in all_foods}

    food_g_totals = {food : numpyro.deterministic(f"food_g_total_{food}", food_durations[food] * food_g_5mins[food])
                     for food in all_foods}

    result_mu = [0 for _ in range(len(bg_5min))]
    for j in range(1, len(result_mu)):
        if fd_5min[j] == 0:
            continue

        food = fd_5min_key[fd_5min[j]]
        for _j in range(min(MAX_TIMEPOINTS, len(result_mu)-j)):
            result_mu[j+_j] = numpyro.deterministic(
                f"add_{food}_{j}_{_j}",
                (result_mu[j+_j-1] * regression_rate[zone(hr_5min[j+_j])]
                 + np.where(_j > food_delays[food],
                            np.where(food_delays[food] + food_durations[food] - _j < 0, 0, 1) * food_g_5mins[food],
                            0)
                )
            )


    for j in range(len(result_mu)):
        result_mu[j] = numpyro.deterministic(f"result_mu_{j}", baseline_mu + result_mu[j])

    # observations
    obs = [numpyro.sample(f'result_{i}', dist.Normal(result_mu[i], 5), obs=bg_5min[i])
           for i in range(len(result_mu))]

def make_model(bg_5min, hr_5min, fd_5min, fd_5min_key):
    from functools import partial
    return partial(model, bg_5min, hr_5min, fd_5min, fd_5min_key)

Sample and store results

Because the process is slow, I usually only run one day at at time, and store all the samples in json files. It would be much better to run inference on all days simultaneously, especially since I have data for the same food on different days. Unfortunately, this turned out to take way too long.

assert len(bg_5min_real.ravel())/DL == 30, len(bg_5min_real.ravel())/DL

for d in DAYS:
    print(f"day {d}")
    bg_5min = bg_5min_real.ravel()[int(DL*d):int(DL*(d+1))]
    hr_5min = hr_5min_real.ravel()[int(DL*d):int(DL*(d+1))]
    _fd_5min = fd_5min_real.ravel()[int(DL*d):int(DL*(d+1))]
    fd_5min, fd_5min_key = str_to_enum(_fd_5min)

    #
    # subset it for testing
    #
    if SUB != '':
        bg_5min, hr_5min, fd_5min = bg_5min[-SUB*3:-SUB*1], hr_5min[-SUB*3:-SUB*1], fd_5min[-SUB*3:-SUB*1]

    open(f"bg_5min_{d}{SUB}.json",'w').write(json.dumps(list(bg_5min)))
    open(f"hr_5min_{d}{SUB}.json",'w').write(json.dumps(list(hr_5min)))
    open(f"fd_5min_key_{d}{SUB}.json",'w').write(json.dumps(fd_5min_key))
    open(f"fd_5min_{d}{SUB}.json",'w').write(json.dumps([int(ix) for ix in fd_5min]))

    rng_key = random.PRNGKey(0)
    rng_key, rng_key_ = random.split(rng_key)
    num_warmup, num_samples = 500, 5500

    kernel = NUTS(make_model(bg_5min, hr_5min, fd_5min, fd_5min_key))
    mcmc = MCMC(kernel, num_warmup, num_samples)
    mcmc.run(rng_key_)
    mcmc.print_summary()
    samples_1 = mcmc.get_samples()
    print(d, sorted([(float(samples_1[food_type].mean()), food_type) for food_type in samples_1 if "total" in food_type], reverse=True))
    print(d, sorted([(float(samples_1[food_type].mean()), food_type) for food_type in samples_1 if "delay" in food_type], reverse=True))
    print(d, sorted([(float(samples_1[food_type].mean()), food_type) for food_type in samples_1 if "duration" in food_type], reverse=True))

    open(f"samples_{d}{SUB}.json", 'w').write(json.dumps({k:[round(float(v),4) for v in vs] for k,vs in samples_1.items()}))

    print_io = io.StringIO()
    with redirect_stdout(print_io):
        mcmc.print_summary()
    open(f"summary_{d}{SUB}.json", 'w').write(print_io.getvalue())
day 11


sample: 100%|██████████| 6000/6000 [48:58<00:00,  2.04it/s, 1023 steps of size 3.29e-04. acc. prob=0.81]



                                             mean       std    median      5.0%     95.0%     n_eff     r_hat
                      all_food_delay_mu     10.39      0.29     10.40      9.95     10.86     10.32      1.04
                   all_food_duration_mu      9.22      0.20      9.22      8.93      9.51      2.77      2.28
                     all_food_g_5min_mu      9.81      0.37      9.78      9.16     10.32      5.55      1.44
                            baseline_mu     89.49      0.37     89.47     88.89     90.09      9.89      1.00
          food_delay_bulletproof coffee      7.59      0.86      7.51      6.37      8.98     23.44      1.02
      food_delay_egg spinach on brioche     13.75      0.41     13.92     13.04     14.26      4.81      1.08
              food_delay_milk chocolate      6.70      0.21      6.74      6.38      7.00     17.39      1.00
     food_delay_vegetables and potatoes     12.03      0.93     12.15     10.52     13.44     16.36      1.08
       food_duration_bulletproof coffee      9.92      1.36      9.72      8.27     12.09      2.72      2.68
   food_duration_egg spinach on brioche     11.42      0.34     11.42     10.94     11.93      2.51      2.59
           food_duration_milk chocolate      7.01      0.22      6.98      6.58      7.34     16.01      1.00
  food_duration_vegetables and potatoes      8.56      1.40      8.52      6.55     10.43      4.89      1.37
         food_g_5min_bulletproof coffee      2.45      0.38      2.41      1.86      3.01      7.07      1.58
     food_g_5min_egg spinach on brioche      7.38      0.33      7.39      6.86      7.92      9.65      1.36
             food_g_5min_milk chocolate     11.38      0.38     11.40     10.70     11.94     21.69      1.06
    food_g_5min_vegetables and potatoes     14.46      2.99     15.55      9.90     18.17     10.31      1.17
                      regression_rate_1      0.93      0.00      0.93      0.92      0.93     24.68      1.04
                      regression_rate_2      0.74      0.03      0.73      0.71      0.78      3.68      1.69
                      regression_rate_3      0.78      0.03      0.77      0.73      0.82      4.28      1.60

Plot results

Plotting the data in a sensible way is a little bit tricky. Hopefully the plot is mostly self-explanatory, but note that baseline is set to zero, the y-axis is on a log scale, and the x-axis is midnight to midnight, in 5 minute increments.

def get_plot_sums(samples, bg_5min, hr_5min, fd_5min_key):
    samples = {k: onp.array(v) for k,v in samples.items()}
    bg_5min = onp.array(bg_5min)
    hr_5min = onp.array(hr_5min)

    means = []
    for n in tqdm(range(len(bg_5min))):
        means.append({"baseline": np.mean(samples[f"baseline_mu"])})
        for _j in range(MAX_TIMEPOINTS):
            for food in fd_5min_key:
                if f"add_{food}_{n-_j}_{_j}" in samples:
                    means[n][food] = np.mean(samples[f"add_{food}_{n-_j}_{_j}"])

    plot_data = []
    ordered_foods = ['baseline'] + sorted({k for d in means for k, v in d.items() if k != 'baseline'})
    for bg, d in zip(bg_5min, means):
        tot = 0
        plot_data.append([])
        for food in ordered_foods:
            if food in d and d[food] > 0.1:
                plot_data[-1].append(round(tot + float(d[food]), 2))
                if food != 'baseline':
                    tot += float(d[food])
            else:
                plot_data[-1].append(0)
    print("ordered_foods", ordered_foods)
    return pd.DataFrame(plot_data, columns=ordered_foods)
for n in DAYS:
    print(open(f"summary_{n}{SUB}.json").read())
    samples = json.load(open(f"samples_{n}{SUB}.json"))
    bg_5min = json.load(open(f"bg_5min_{n}{SUB}.json"))
    hr_5min = json.load(open(f"hr_5min_{n}{SUB}.json"))
    fd_5min = json.load(open(f"fd_5min_{n}{SUB}.json"))
    fd_5min_key = json.load(open(f"fd_5min_key_{n}.json"))

    print("fd_5min_key", fd_5min_key)
    plot_data = get_plot_sums(samples, bg_5min, hr_5min, fd_5min_key)
    plot_data['real'] = bg_5min
    plot_data['heartrate'] = [hr for hr in hr_5min]

    baseline = {int(i) for i in plot_data['baseline']}
    assert len(baseline) == 1
    baseline = list(baseline)[0]
    log_plot_data = plot_data.drop('baseline', axis=1)
    log_plot_data['real'] -= plot_data['baseline']

    display(log_plot_data);

    f, ax = plt.subplots(figsize=(16,6));
    ax = sns.lineplot(data=log_plot_data.rename(columns={"heartrate":"hr"})
                                        .drop('hr', axis=1),
                      dashes=False,
                      ax=ax);
    ax.set_ylim(-50, 120);
    ax.set_title(f"day {n} baseline {baseline}");
    ax.lines[len(log_plot_data.columns)-2].set_linestyle("--");
    f.savefig(f'food_model_{n}{SUB}.png');
    f.savefig(f'food_model_{n}{SUB}.svg');

png

Total blood glucose

Finally, I can rank each food by its total blood glucose contribution. Usually, this matches expectations, and occasionally it's way off. Here it looks pretty reasonable: the largest meal by far contributed the largest total amount, and the milk chocolate had the shortest delay.

from pprint import pprint
print("### total grams of glucose")
pprint(sorted([(round(float(samples_1[food_type].mean()), 2), food_type) for food_type in samples_1 if "total" in food_type], reverse=True))
print("### food delays")
pprint(sorted([(round(float(samples_1[food_type].mean()), 2), food_type) for food_type in samples_1 if "delay" in food_type], reverse=True))
print("### food durations")
pprint(sorted([(round(float(samples_1[food_type].mean()), 2), food_type) for food_type in samples_1 if "duration" in food_type], reverse=True))
### total blood glucose contribution
[(123.52, 'food_g_total_vegetables and potatoes'),
 (84.24, 'food_g_total_egg spinach on brioche'),
 (79.69, 'food_g_total_milk chocolate'),
 (23.88, 'food_g_total_bulletproof coffee')]
### food delays
[(13.75, 'food_delay_egg spinach on brioche'),
 (12.03, 'food_delay_vegetables and potatoes'),
 (10.39, 'all_food_delay_mu'),
 (7.59, 'food_delay_bulletproof coffee'),
 (6.7, 'food_delay_milk chocolate')]
### food durations
[(11.42, 'food_duration_egg spinach on brioche'),
 (9.92, 'food_duration_bulletproof coffee'),
 (9.22, 'all_food_duration_mu'),
 (8.56, 'food_duration_vegetables and potatoes'),
 (7.01, 'food_duration_milk chocolate')]

Conclusions

It sort of works. For example, day 11 looks pretty good! Unfortunately, when it's wrong, it can be pretty far off, so I am not sure about the utility without imbuing the model with more robustness. (For example, note that the r_hat is often out of range.)

The parameterization of the model is pretty unsatisfying. To make the model hang together I have to constrain the priors to specific ranges. I could probably grid-search the whole space pretty quickly to get to a maximum posterior here.

I think if I had a year of data instead of a month, this could get pretty interesting. I would also need to figure out how to sample quicker, which should be possible given the simplicity of the model. With the current implementation I am unable to sample even a month of data at once.

Finally, here is the complete set of plots I generated, to see the range of results.

png

Comment
Brian Naughton // Sat 04 July 2020 // Filed under sequencing // Tags metrology dnasequencing genetics dna aqi

Using a GeneQuant to measure water purity

Read More
Brian Naughton // Sun 19 April 2020 // Filed under health // Tags quantifiedself health cgm

Using a CGM to test the effect of foods on blood glucose.

Read More
Brian Naughton // Sat 27 July 2019 // Filed under biotech // Tags sequencing flongle nanopore homelab biotech

DNA sequencing at home using Oxford Nanopore's new flongle sequencer.

Read More
Brian Naughton // Sun 02 June 2019 // Filed under datascience // Tags datascience cloud gcp aws bioinformatics

A brief comparison of three bioinformatics workflow tools: nextflow, snakemake, reflow.

Read More
Brian Naughton // Sat 23 March 2019 // Filed under datascience // Tags datascience cloud gcp aws

A list of tools for working remotely on a cloud instance.

Read More
Brian Naughton // Sun 20 January 2019 // Filed under deeplearning // Tags deeplearning datascience neuralnet

Using several neural nets to create an instagram celebrity, Jake Dangerback.

Read More
Brian Naughton // Sun 11 November 2018 // Filed under sequencing // Tags biotech sequencing dna

Sequencing gap

Read More

Boolean Biotech © Brian Naughton Powered by Pelican and Twitter Bootstrap. Icons by Font Awesome and Font Awesome More