archives –––––––– @btnaughton
Brian Naughton | Sun 03 October 2021 | datascience | statistics datascience

I really like resampling methods for data analysis, and I think they are under-utilized. The main thing these techniques have in common is that instead of an analytical solution, they make a computer repeatedly analyze variations of the data. Examples of resampling methods include permutation tests, the bootstrap, and cross-validation.

Cross-validation in particular seems to have really taken over model selection over the last ten years or so (e.g., compared to the more "statistical" AIC or BIC). There may be an argument here that this is due to non-statisticians needing statistical tests (e.g., in machine learning) so they naturally gravitate to using computational power instead of statistical expertise.

In biology, there are a few tests than come up a lot, but the main one is probably testing if distribution A is different to distribution B (e.g., experiment vs control). This is usually done with a t-test, but can also be done — with some advantages — by a permutation test. I looked for a post on this topic but could not find one that exactly covered what I was interested in, so this article will show some data on this.

Here are a few related articles that may be of interest:

The t-test

I would guess most biologists use t-tests occasionally, as the simplest way to check if an experimental result differs from a control. I would also argue that most people who use the t-test don't really know the main assumptions:

  • the distributions are normal
  • the variances of the distributions are equal
  • the sample size is large

If you violate any of these assumptions, you are supposed to consider another test.

If distributions are not normal

We usually assume if a distribution is not normal then it's some unknown distribution. In this case you could use a non-parametric test, like the Mann–Whitney U test. Of course, there's no such thing as a completely normal distribution, so it's a bit unclear where to draw the line here.

If variances are unequal

If the variances are unequal, Welch's t-test is recommended. Welch's t-test is "an adaptation of Student's t-test, and is more reliable when the two samples have unequal variances and/or unequal sample sizes." I'm not sure I've ever seen this test used in a biology paper, though variances are very often unequal. For example, any time you are counting things (colonies, etc) it's likely a Poisson distribution where the variance equals the mean.

If sample size is small

It's a bit of a grey area how few samples count as "small", but some sources say below 30. That is actually a pretty large number for many biological experiments!

For example, if I do a quick review of the first four papers that pop up for the search "xenograft tumor treatment t-test":

In order to publish, the authors of these papers need to be able to report a p-value, and the t-test is the most convenient, and accepted tool.

Violations

It's possible that the above violations in assumptions are just fine, and the stats work out fine. Still, the lack of clarity on what counts as a violation here seems to defeat the purpose of having the rules, and a permutation test might offer a safer approach.

Permutation Test

The general problem below will be to determine if two distributions with mean of 0 (red) and diff (yellow), are different.

from scipy.stats import norm, laplace, ttest_ind, median_test, mannwhitneyu, pearsonr, linregress
from matplotlib import pyplot as plt
from IPython.display import SVG
import seaborn as sns
import pandas as pd
import numpy as np

from tqdm.auto import tqdm
tqdm.pandas()
f, ax = plt.subplots(figsize=(12,4))
diff = 1
dist1 = norm(0, 1).rvs(100)
dist2 = norm(diff, 1).rvs(100)
sns.histplot(dist1, color='r', bins=40);
sns.histplot(dist2, color='y', bins=40);

png

This is the code for a permutation test, for use in place of a one- or two-sided t-test, Welch's t-test, etc. In all examples below I will use a two-sided t-test, since this is more commonly used than one-sided. (Although to me, one-sided makes more sense most the time!)

def ptest_ind(dist1, dist2, fn=np.mean, N=30000, alternative="two-sided"):
    """permutation test, matching scipy.ttest_ind syntax"""
    from collections import namedtuple
    from warnings import warn

    PTResult = namedtuple("PTResult", ["stat", "pvalue"])

    new_dists = np.concatenate([dist1, dist2])
    test = fn(dist1) - fn(dist2)

    res = []
    for _ in range(N):
        np.random.shuffle(new_dists)
        ndist1 = new_dists[:len(dist1)]
        ndist2 = new_dists[len(dist1):]
        res.append(test > fn(ndist1) - fn(ndist2))

    stat = sum(res) / N

    if alternative == "less":
        pvalue = stat
    elif alternative == "greater":
        pvalue = 1 - stat
    elif alternative == "two-sided":
        pvalue = 2*min(stat, 1-stat)
    else:
        raise ValueError(f"{alternative=}")

    if pvalue == 0:
        warn(f"permutation test sample limit reached: <1/{N}")
        pvalue = 1.0/N

    return PTResult(stat, pvalue)
def plot_corr(df, x, y, min_xy):
    """Plot the correlation between two tests, and report Pearson's r"""
    lx, ly = f"log_{x}", f"log_{y}"
    _df = (df_rows
             .replace([np.inf, -np.inf], np.nan)
             .dropna()
             .assign(**{lx: lambda df: df[x].apply(np.log10), ly: lambda df: df[y].apply(np.log10)}))

    res = linregress(_df[ly], _df[lx])

    f, ax = plt.subplots(figsize=(6,6))
    sns.scatterplot(data=_df, x=lx, y=ly, ax=ax);
    sns.lineplot(x=np.array([0, min_xy]), y=res.intercept + res.slope*np.array([0, min_xy]), color='r', linewidth=0.5, ax=ax)
    ax.set_xlim(min_xy, 0);
    ax.set_ylim(min_xy, 0);
    ax.axline((min_xy, min_xy), (0, 0), linewidth=0.5, color='g');

    return res.rvalue

t-test with normal distributions

This is the simplest case: comparing two equal variance, high sample size, normal distributions. Here we see an almost perfect correspondence between the results of the t-test and permutation test. This is good, since it shows the permutation test is not reporting biased (e.g., too low) p-values.

sample_size = 30

rows = []
for diff in tqdm(np.tile(np.arange(0, 0.4, 0.02), 1)):
    dist1 = norm(0, 1).rvs(sample_size)
    dist2 = norm(diff, 1).rvs(sample_size)

    tt = ttest_ind(dist1, dist2, alternative="two-sided").pvalue
    pt = ptest_ind(dist1, dist2, alternative="two-sided").pvalue

    rows.append((tt, pt))

df_rows = pd.DataFrame(rows, columns=["tt", "pt"])
print(f't-test vs permutation r: {plot_corr(df_rows, "tt", "pt", -3):.3f}')
t-test vs permutation r: 1.000

png

t-test with normal distributions, but low sample size

What happens to the results of a t-test when the number of samples is low? Here, the low sample size results in some noise that decreases the correlation, but it still seems to work very similarly to the permutation test, even with just 3 samples (despite the regression line below).

sample_size = 3

rows = []
for diff in tqdm(np.tile(np.arange(0, 0.6, 0.03), 5)):
    dist1 = norm(0, 1).rvs(sample_size)
    dist2 = norm(diff, 1).rvs(sample_size)

    tt = ttest_ind(dist1, dist2, alternative="two-sided").pvalue
    pt = ptest_ind(dist1, dist2, alternative="two-sided").pvalue

    rows.append((tt, pt))

df_rows = pd.DataFrame(rows, columns=["tt", "pt"])
print(f't-test vs permutation r: {plot_corr(df_rows, "tt", "pt", -2):.3f}')
/tmp/ipykernel_2790/1233266942.py:30: UserWarning: permutation test sample limit reached: <1/30000
  warn(f"permutation test sample limit reached: <1/{N}")


t-test vs permutation r: 0.760

png

t-test with normal distributions, but unequal variances

scipy's ttest_ind has an option to use Welch's t-test instead of a t-test. Again, the results are almost identical between Welch's t-test and the permutation test.

sample_size = 30

rows = []
for diff in tqdm(np.tile(np.arange(0, 1.5, 0.075), 5)):
    dist1 = norm(0, 1).rvs(sample_size)
    dist2 = norm(diff, 2).rvs(sample_size)

    tt = ttest_ind(dist1, dist2, alternative='two-sided', equal_var=False).pvalue
    pt = ptest_ind(dist1, dist2, alternative="two-sided").pvalue

    rows.append((tt, pt))

df_rows = pd.DataFrame(rows, columns=["tt", "pt"])
print(f't-test vs permutation r: {plot_corr(df_rows, "tt", "pt", -3):.3f}')
/tmp/ipykernel_2790/1233266942.py:30: UserWarning: permutation test sample limit reached: <1/30000
  warn(f"permutation test sample limit reached: <1/{N}")


t-test vs permutation r: 0.967

png

t-test with non-normal distributions

In this case, the Mann-Whitney U test is usually recommended instead of the t-test. The Mann-Whitney U test "is a nonparametric test of the null hypothesis that, for randomly selected values X and Y from two populations, the probability of X being greater than Y is equal to the probability of Y being greater than X." However, two distributions can have equal medians, but different distributions, and Mann-Whitney may return a significant p-value.

Another option is to directly test for a difference in medians, with Mood's median test.

I'm not sure why the permutation test and median test results don't match more closely here, though the median test only produces a few possible p-values. Still, the results are quite close, and appear unbiased. The correlation between the permutation tests and Mann-Whitney U test is also very high.

sample_size = 30

rows = []
for diff in tqdm(np.tile(np.arange(0, 0.5, 0.025), 5)):
    dist1 = laplace(0, 1).rvs(sample_size)
    dist2 = laplace(diff, 1).rvs(sample_size)

    mt = median_test(dist1, dist2, ties="ignore")[1] # ugh, inconsistent scipy syntax
    ut = mannwhitneyu(dist1, dist2, alternative="two-sided")[1] # ugh, inconsistent scipy syntax
    pt = ptest_ind(dist1, dist2, fn=np.median, alternative="two-sided").pvalue

    rows.append((mt, ut, pt))

df_rows = pd.DataFrame(rows, columns=["mt", "ut", "pt"])
print(f'median test vs permutation r:  {plot_corr(df_rows, "mt", "pt", -3):.3f}')
print(f'mann-whitney vs permutation r: {plot_corr(df_rows, "ut", "pt", -3):.3f}')
print(f'mann-whitney vs median test r: {plot_corr(df_rows, "ut", "mt", -3):.3f}')
median test vs permutation r:  0.863
mann-whitney vs permutation r: 0.920
mann-whitney vs median test r: 0.873

png

png

png

t-test with normal distributions, but one outlier sample

A common problem with real-world datasets is outlier datapoints that greatly skew the distribution and hence skew p-values. Removing or accounting for outliers will increase the robustness of a test (e.g., how sensitive the result is to an erroneous reading).

It's interesting that the t-test is reasonably well correlated with the other two tests, but way off the diagonal.

sample_size = 30

rows = []
for diff in tqdm(np.tile(np.arange(0, 0.5, 0.025), 5)):
    dist1 = np.concatenate([norm(0, 1).rvs(sample_size), np.array([0])])
    dist2 = np.concatenate([norm(diff, 1).rvs(sample_size), np.array([100])])

    mt = mannwhitneyu(dist1, dist2, alternative='two-sided').pvalue
    tt = ttest_ind(dist1, dist2, alternative="two-sided").pvalue
    pt = ptest_ind(dist1, dist2, fn=np.median, alternative="two-sided").pvalue

    rows.append((mt, tt, pt))

df_rows = pd.DataFrame(rows, columns=["mt", "tt", "pt"])

print(f'mann-whitney vs permutation r: {plot_corr(df_rows, "mt", "pt", -3):.3f}')
print(f't-test vs mann-whitney r:      {plot_corr(df_rows, "tt", "mt", -3):.3f}')
print(f't-test vs permutation r:       {plot_corr(df_rows, "tt", "pt", -3):.3f}')
mann-whitney vs permutation r: 0.923
t-test vs mann-whitney r:      0.926
t-test vs permutation r:       0.835

png

png

png

Conclusion

In general, the permutation test gives you the same answer you would get if you chose the "right" test for your task. But, you have to exert less effort in figuring out what the right test is.

There is a very direct connection between the test you want and the calculation performed. For example, with the Mann-Whitney U test, it's confusing what is actually being tested, apart from "the non-parametric alternative to a t-test".

The permutation test is also robust in a way that serves most use-cases. Outliers are handled conservatively, and there are few or no assumptions to violate.

Disadvantages

There are some disadvantages too, but relatively minor I think.

  • Regular t-tests do pretty well in my tests above, even with only 3 samples! The only major problem is with outliers, which can imply extremely wide distributions if taken at face-value.
  • Permutation tests can be slow, so if you are doing thousands of tests, or you need precise, very low p-values, then they will likely not be suitable.
  • You have to know precisely what you are testing. For example, "are the medians of these two distributions identical?". This is mostly an advantage in that it's more explicit.
  • You get more statistical power if you use the test that includes the correct assumptions. This seems to be a very minor effect in my tests above.

Overall, you can't go too far wrong with permutation tests since they are so simple. If nothing else, they are a good way to sanity check results from other tests with more assumptions.

Comment
Brian Naughton | Sat 02 January 2021 | datascience | metrology datascience

Youtuber Tesla Bjorn has a great channel reviewing electric cars. Unusually, he tests the range of the cars in a consistent and rigorous way, and posts all his results to a nicely-organized Google Sheet.

One interesting thing I picked up from his channel is that the Porsche Taycan has much worse range than Tesla with its stock tires, but excellent range of 360mi/580km with smaller, more efficient tires (and better weather). One obvious question I had was, how much does tire size and temperature affect range? Since the data was all there, I decided to take a look.

This was also a good excuse to play with brms, a Bayesian multilevel modeling tool for R. I won't explain multilevel modeling here — others have done a much better job at this than I could (see various links below) — but at least in my mind, it's like regular regression, but can incorporate information on subgroups and subgroup relationships. McElreath argues it should just be the default way to do regression, since the downsides are minimal. This blogpost makes a similar case, and helpfully notes that there is evidence that you should generally "use the "maximal" random effects structure supported by the design of the experiment". I take this to mean we should not worry too much about overfitting the model.

Other resources I like include:

Bambi vs brms

Even though this exercise was an excuse to try brms, I ended up mostly using Bambi. They are very similar tools, except Bambi is built on Python/PyMC3 and brms is built on R/Stan (pybrms also exists but I could not get it to work). I am assuming Bambi was influenced by brms too.

They both follow very similar syntax to widely-used R regression tools like lme4. Using Bambi also lets me use the excellent Arviz plotting package, and I can still use the more battle-tested brms to confirm my results.

Simulating data

Simulated data is always useful for these models. It really helps get comfortable with the process generating the data, and make sure the model can produce reasonable results under ideal conditions.

My model is an attempt to find the "base efficiency" of each car. Each car has a total tested efficiency, measured in Wh/km, which you can easily calculate based on the range (km) and the battery size (Wh).

In my model, a car's measured Wh/km (Whk) is the sum of effects due to:

  • the car's base efficiency (car)
  • the weight of the car in tons (ton)
  • the ambient temperature (dgC)
  • the tire size in inches (tir)
  • the surface being wet or dry (wet)
  • the tested speed (spd — Tesla Bjorn tests every vehicle at 90km/h and 120km/h).

The simulated data below match the real data pretty well.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

num_cars = 10
mu_Whk = 100
mu_dgC = 2
mu_wet = 10
mu_tir = 4
mu_spd = 50
mu_wgt = 2
mu_ton = 50

# parameters of each car
has_Whk = np.arange(50, 150, (150-50)/num_cars)
has_ton = np.random.normal(mu_wgt, mu_wgt/5, num_cars).clip(mu_wgt/2)
per_ton = np.random.normal(mu_ton, mu_ton/5, num_cars).clip(mu_ton/5)
per_dgC = np.random.normal(mu_dgC, mu_dgC/5, num_cars).clip(mu_dgC/5) * -1
per_wet = np.random.normal(mu_wet, mu_wet/5, num_cars).clip(mu_wet/5)
per_tir = np.random.normal(mu_tir, mu_tir/5, num_cars).clip(mu_tir/5)
per_spd = np.random.normal(mu_spd, mu_spd/5, num_cars).clip(mu_spd/5)

num_samples = 100
data = []
for _ in range(num_samples):
    car = np.random.choice(num_cars)
    dgC = np.random.normal(12, 4)
    wet = int(np.random.random() > 0.75)
    tir = max(0, int(np.random.normal(18,1)) - 16)
    for spd in [0, 1]:
        Whk = (has_Whk[car] + per_ton[car] * has_ton[car] + per_tir[car] * tir +
               (per_spd[car] if spd else 0) + (per_wet[car] if wet else 0))
        data.append([f"{car:02d}", has_ton[car], dgC, wet, tir, spd, Whk, has_Whk[car]])

df_sim = pd.DataFrame(data, columns=['car', 'ton', 'dgC', 'wet', 'tir', 'spd', 'Whk', 'has_Whk'])
assert len(set(df_sim.car)) == num_cars, "unlucky sample"
df_sim.to_csv("/tmp/df_ev_range_sim.csv")
df_sim.head()
    car ton         dgC         wet tir spd Whk         has_Whk
0   12  2.386572    15.096731   0   2   0   229.106858  110.0
1   12  2.386572    15.096731   0   2   1   293.139807  110.0
2   10  2.083366    12.632114   0   1   0   162.626350  100.0
3   10  2.083366    12.632114   0   1   1   221.408095  100.0
4   19  1.848348    13.203525   0   2   0   256.221089  145.0

Now I can build my model. The model is pretty simple: all priors are just T distributions (a bit fatter-tailed than normal). I originally bounded some of the priors (since e.g., a car cannot use <0 Wh/km), but since I couldn't get that to work with brms, I removed the bounds to make comparisons easier. (The bounds did not materially change the results).

I have an group per car (1|car), and although I could also have groups for (1|spd) and (1|wet), I opted not to do that since it didn't change the results, but made the interpretation a little messier. I set the intercept to 0 so that the 1|car intercept can represent the base efficiency by itself.

import arviz as az
from bambi import Model, Prior

prior_car = Prior('StudentT', nu=3, mu=100, sigma=50)
prior_ton = Prior('StudentT', nu=3, mu=50, sigma=10)
prior_dgC = Prior('StudentT', nu=3, mu=-1, sigma=1)
prior_wet = Prior('StudentT', nu=3, mu=10, sigma=20)
prior_tir = Prior('StudentT', nu=3, mu=2, sigma=5)
prior_spd = Prior('StudentT', nu=3, mu=50, sigma=20)

model_sim = Model(df_sim)
print(f"{len(df_sim)} samples")
res_sim = model_sim.fit(
    'Whk ~ 0 + ton + dgC + tir + spd + wet',
    group_specific=["1|car"],
    categorical=['car'],
    priors = {'car': prior_car, "ton": prior_ton, "dgC": prior_dgC,
              "wet": prior_wet, "tir": prior_tir, "spd": prior_spd},
    draws=1000, chains=4
)
az.plot_trace(results)
az.summary(res_sim)

The graph below shows that I can recover the base efficiency fairly well, though obviously there are limits to what can be done.

Real data

Now that this all looks good, I can try to model the real data.

import pandas as pd

def get_car(carstr):
    if carstr.lower().startswith('model'):
        carstr = 'Tesla ' + carstr
    if carstr.lower().startswith('tesla'):
        return '_'.join(carstr.split(' ')[:3]).replace('-','').lower()
    else:
        return '_'.join(carstr.split(' ')[:2]).replace('-','').lower()

# using google sheets data downloaded to tsv
df_wt = (pd.read_csv("/tmp/ev_weight.tsv", sep='\t'))
df_real = (pd.read_csv("/tmp/ev_range.tsv", sep='\t')
             .merge(df_wt, on='Car')
             .rename(columns={'Total':'weight'})
             .assign(tire_size=lambda df: df["Wheel front"].str.extract('\-(\d\d)'))
             .dropna(subset=['Wh/km', 'tire_size', 'weight'])
             .assign(tire_size=lambda df: df.tire_size.astype(int))
             .assign(Temp=lambda df: df.Temp.str.replace(u"\u2212",'-').astype(float))
          )

df_clean = (
df_real
    .assign(car = lambda df: df.Car.apply(get_car))
    .assign(Whk = lambda df: df['Wh/km'])
    .assign(ton = lambda df: df.weight/1000)
    .assign(wet = lambda df: (df.Surface=="Wet").astype(int))
    .assign(dgC = lambda df: df.Temp)
    .assign(tir = lambda df: (df.tire_size-14).astype(float))
    .assign(spd = lambda df: (df.Speed==120).astype(int))
    .loc[lambda df: ~df.car.str.startswith('mitsubishi')]
    .loc[lambda df: ~df.car.str.startswith('maxus')]
    [["Whk", "car", "tir", "ton", "wet", "spd", "dgC"]]
    .dropna()
)
df_real.to_csv("/tmp/df_ev_range.csv")
df_clean.to_csv("/tmp/df_ev_range_clean.csv")
prior_car = Prior('StudentT', nu=3, mu=100, sigma=50)
prior_ton = Prior('StudentT', nu=3, mu=50, sigma=20)
prior_dgC = Prior('StudentT', nu=3, mu=-1, sigma=2)
prior_wet = Prior('StudentT', nu=3, mu=10, sigma=20)
prior_tir = Prior('StudentT', nu=3, mu=2, sigma=5)
prior_spd = Prior('StudentT', nu=3, mu=50, sigma=30)

model_clean = Model(df_clean)
print(f"{len(df_clean)} samples")
res_clean = model_clean.fit(
    'Whk ~ 0 + ton + dgC + tir + spd + wet',
    group_specific=["1|car"],
    categorical=['car', 'wet', 'spd'],
    priors = {'car': prior_car, "ton": prior_ton, "dgC": prior_dgC,
              "wet": prior_wet, "tir": prior_tir, "spd": prior_spd},
    draws=2000, chains=4
)

car_means = res_clean.posterior['1|car'].mean(dim=("chain", "draw"))
res_clean.posterior = res_clean.posterior.sortby(-car_means)
g = az.plot_forest(res_clean.posterior, kind='ridgeplot', var_names=['1|car'], combined=True,
                   colors='#eeeeee', hdi_prob=0.999);
g[0].set_xlabel('Wh/km');

Results

There is plenty of overlap in these posteriors, but they do show that e.g., Tesla, Hyundai and Porsche seem to have good base efficiency, while Jaguar, Audi, and Volvo appear to have poor efficiency.

We can also look at the estimated range for each car at 90km/h given dry weather, 18" tires, 25C and a 100kWh battery. (Caveat: I don't know how to adjust the range for additional battery weight, so there is an unfair advantage for cars with small batteries.) Still, these results show that once battery density increases a bit, several cars may be able to hit the 400mi/644km range.

It was also interesting to discover the contributions of other factors, though the numbers below are just +/-1 standard deviation and should be taken with plenty of salt:

  • traveling at 120km/h adds 60–70 Wh/km compared to 90km/h;
  • each degree C colder adds 0.5–1.25 Wh/km;
  • each ton weight adds 20–40 Wh/km;
  • each inch of tire adds 2–6 Wh/km;
  • wet weather adds 10–20 Wh/km.

My main takeaways were that speed is extremely important (drag), tire size was less important than I had guessed based on the Porsche example, but wet weather was more important than I had guessed (maybe equivalent to carrying an extra ton??)

As always, there are things I could improve in the model, especially with more data: better chosen, bounded priors; more prior and posterior predictive checks (I had some issues here); checking for interactions (e.g., wet*dgC); perhaps some multiplicative effects (e.g., wet weather reduces efficiency by 5% instead of subtracting 10 Wh/km). I could also maybe predict optimal conditions for maximum range, though the current model would say on the surface of the sun with 0 inch tires.

I might come back to it when the database has grown a bit, but I am pretty satisfied with these results for now.

data / brms code

If anyone wants to rerun/improve upon the above, I put all the data here: ev_range.tsv, ev_weight.tsv, df_ev_range.csv, df_ev_range_clean.csv, df_ev_range_sim.csv. This could be a pretty nice dataset for teaching.

To validate (or at least reproduce) the model above, I ran the following R code. The results were virtually identical.

library('brms')
df_clean = read.csv("/tmp/df_ev_range_clean.csv")
priors = c(set_prior("student_t(3,100,50)", class = "b", coef = paste("car", sort(unique(df_clean$car)), sep="")),
           set_prior("student_t(3,50,20)", class = "b", coef = "ton"),
           set_prior("student_t(3,-1,2)", class = "b", coef = "dgC"),
           set_prior("student_t(3,10,20)", class = "b", coef = "wet"),
           set_prior("student_t(3,2,5)", class = "b", coef = "tir"),
           set_prior("student_t(3,50,30)", class = "b", coef = "spd")
          )
res_clean = brm(
    Whk ~ 0 + ton + dgC + tir + car + spd + wet + (1|car),
    data=df_clean,
    chains = 4,
    iter = 3000,
    cores = 4,
    prior=priors,
    control = list(adapt_delta = 0.95, max_treedepth = 20))
Comment

There are a lot of bioinformatics workflow languages and systems these days. Albert Vilella maintains a comprehensive spreadsheet and it's massive. A little confusingly, some of these are actual languages (CWL), some are engines for running workflows (Cromwell), and some are self-contained systems (reflow, nextflow, snakemake).

The point of these tools is to take a multi-part process — usually a combination of scripts and binaries — and run it many times, keeping a log of what succeeded and what failed, and managing the resources required.

Among the most popular (or at least ones I've heard of people using) are snakemake, toil, nextflow, luigi, bpipe, scipipe, and Cromwell/WDL.

For a comprehensive review, see this 2017 paper by Jeremy Leipzig. Here, I'm just going to compare an example script from the nextflow website to the equivalent in snakemake and reflow. These are the three workflow systems I am most familiar with, and I think the differences between them are pretty interesting.

Nextflow

Nextflow is probably the most popular workflow system right now. It is powerful, simple to get started with, and very well supported by Paolo Di Tommasso, with plenty of example scripts. It supports running pipelines locally, on a local cluster, or on the cloud. It can use containers (Docker or Singularity) for reproducibility. Nextflow uses AWS Batch as its main cloud backend (apparently everyone is these days).

The basic script below, from the nextflow site, is pretty self-explanatory.

script

#!/usr/bin/env nextflow

params.in = "$baseDir/data/sample.fa"
sequences = file(params.in)

/*
 * split a fasta file in multiple files
 */
process splitSequences {
    input:
    file 'input.fa' from sequences

    output:
    file 'seq_*' into records

    """
    awk '/^>/{f="seq_"++d} {print > f}' < input.fa
    """
}

/*
 * Simple reverse the sequences
 */
process reverse {
    input:
    file x from records

    output:
    stdout result

    """
    cat $x | rev
    """
}

/*
 * print the channel content
 */
result.subscribe { println it }

output

N E X T F L O W  ~  version 19.04.1
Launching basic.rf [hungry_celsius] - revision: c3e3b933c7
[warm up] executor > local
executor >  local (1)
[ea/b4bcb4] process > splitSequences [  0%] 0 of 1
1FDSA>
ACCAAACACAGTA
2FDSA>
ACCAAACACAGTA
3FDSA>
ACCAAACACAGTA
executor >  local (2)
[ea/b4bcb4] process > splitSequences [100%] 1 of 1 
[ef/0377ba] process > reverse        [100%] 1 of 1 
Completed at: 19-May-2019 10:17:39
Duration    : 951ms
CPU hours   : (a few seconds)
Succeeded   : 2

The output is quite comprehensive and easy to understand.

Snakemake

Snakemake, developed by Johannes Köster (also the founder of the bioconda project!) is the oldest of the three workflow systems here. As the name implies, it's sort of like make, but in Python.

Snakemake existed before containers and cloud computing were common. It is filename-focused, which makes it simple to get started with, and pretty easy to debug. In contrast, nextflow and reflow try to abstract away the names of files, sometimes at the cost of simplicity.

Even though I have used snakemake quite a bit, I found it pretty difficult to port the nextflow script above, mainly because it involves splitting one file into multiple. I think there are at least three ways to handle file splitting in snakemake: dynamic files (soon to be deprecated), the new checkpoints (I did not look into this much), and just using directories.

The directory option is the simplest to me, but the result is probably far from ideal snakemake. In previous snakemake pipelines I often relied on something similar: when one step of the pipeline is done, instead of relying on the appearance of a specific filename, I would just output a "sentinel" file, e.g., echo done > {output.sentinel}, and use that as the trigger for the next step.

Since Snakemake is based on Python, it is easy to extend and integrate with other scripts. To copy the nextflow script, here I just use the bash though.

script

rule all:
    input: "rev"

rule split_fasta:
    input: "input.fa"
    output: directory("split")
    run: shell("""awk '/^>/{{f="{output}/seq_"++d}} {{print > f}}' < {input}""")

rule reverse_fasta:
    input: "split"
    output: directory("rev")
    run: shell("""for f in {input}/*; do cat $f |rev > {output}/rev_$(basename $f); done""")

output

Snakemake outputs a directory, rev, containing the files rev_seq_1, rev_seq_2, rev_seq_3.

Reflow

Reflow is a relatively new workflow language developed by at Grail by @marius and colleagues. I've been using Reflow for a year or so, and it's the system I am most familiar with.

One big difference between reflow and the other systems is how constrained it is: simplifying a bit, every function takes (a) a docker container, (b) a file or directory stored in s3, and (c) some bash; and outputs another file or directory stored in s3.

Everything gets run on AWS. You cannot run it on a regular cluster. You can run a reflow pipeline locally but it's mostly for debugging, and feels more like emulating the real AWS pipeline.

script

val files = make("$/files")
val dirs = make("$/dirs")

input := file("s3://booleanbiotech/input.fa")

func Split(input file) dir =
    exec(image := "ubuntu") (out dir) {"
        echo Split {{input}};
        awk '/^>/{f="{{out}}/seq_"++d} {print > f}' < {{input}}
    "}

func Reverse(input file) file =
    exec(image := "ubuntu") (out file) {"
        echo Reverse {{input}};
        cat {{input}} | rev > {{out}}
    "}

split_dir := Split(input)
rev_files := [Reverse(f) | f <- dirs.Files(split_dir)]

val Main = rev_files

output

[file(sha256=sha256:15a48aae8b19682a84dcf39f9260901e0b8331e8f9f974efd11a9a15c950320f, size=16),
 file(sha256=sha256:adad5843ce952755cd5b5b71fd92c7a67be3063d047350cf7fffdd6031ee5c2e, size=16),
 file(sha256=sha256:441ff0551203c350251abc5782698371116cca12a735fa2dbed66de4820b9201, size=16)]

There is logging too, but overall the output of reflow is pretty opaque!

To see the contents of the generated files, you have to use reflow cat 15a48aae8b19682a84dcf39f9260901e0b8331e8f9f974efd11a9a15c950320f. Generally, the last step in your pipeline should copy the outputs to S3, but debugging this can be tough.

Comparing the Three

Snakemake has been great for local pipelines. Its "reports" — html output from stages of the pipeline — are especially nice. It also now supports containers (and conda), so it is far from stagnant. However, I haven't tried to run snakemake pipelines on the cloud, and I suspect since it is so file-focused it might be less suited to this.

I only seriously experimented with nextflow before it had AWS Batch support, which made it unsuitable for my purposes. At that time running pipelines on the cloud required quite a lot of setup, but now this issue seems to be solved. Nextflow development also seems to be the most active of the three. There are even Nextflow meetings, periodically.

Reflow is the least popular, least flexible, and least documented of the three. Nevertheless, as a user who exclusively runs pipelines on the cloud, I still like the reflow model best. It gets closest to my ideal for a workflow system, which is basically just annotating code with the resources required to run it.

This is because to reflow, everything is a hash: containers, code, inputs, outputs. For example, if you have genome A (sha256(genome_sequence_bytes)=<hash1>), and you run adapter trimming function B (sha256(script_bytes)=<hash2>), using container C (sha256(docker_container_bytes)=<hash3>), then you "know" the output is the hash of sha256(<hash1>+<hash2>+<hash3>)=<hash4>. If the file s3://my-reflow-cache/<hash4> exists, even if it's from an old workflow, you don't need to recalculate! (The above is the gist of how it works, anyway. As usual with caches, this system can also result in tricky bugs...)

For some technical details on differences between nextflow and reflow, this Twitter thread may also be of interest, since it includes some discussion from the authors of the tools.

Conclusions

If I were starting today, I would probably choose nextflow as the safe option, mainly due to its new AWS Batch support. Snakemake still works great, though it may be showing its age a bit, while reflow is excellent if you are AWS-exclusive and can deal with its lack of documentation and support.

I think my ideal workflow language would have reflow-style decorators on Python functions (or a subset of python). I'll be interested to see if the reflow model inspires some new workflow systems like this.

Comment

A list of tools for working remotely on a cloud instance.

Read More

Boolean Biotech © Brian Naughton Powered by Pelican and Twitter Bootstrap. Icons by Font Awesome and Font Awesome More