Last week, Biogen released some Phase Ib data for their Alzheimer's drug, Aducanumab (BIIB037). I don't know too much about how these issuances work, but it's bizarre to me that you can show a chart at a conference, and not the actual data, and yet the information increases Biogen's value by billions of dollars. Maybe someone has access to the raw data, but I can't find it --- that sounds illegal for a public company anyway.
Since there is a lot of money at stake, there are people on Wall Street reanalyzing the data, though not publicly. I found an interesting example from Morgan Stanley here: http://www.investorvillage.com/smbd.asp?mb=2793&mn=414&pt=msg&mid=14782583
On Twitter, there was some skepticism about the data that piqued my interest. Someone said they didn't trust the data because just a few patients changing tack could change the results. It sounds plausible, but it's not obvious how to quantify that. Others said the experiment was underpowered; with about 20 patients per arm of the study, that also sounds plausible.
I decided to take a look at the data to see if there is anything more quantitative to say.
# This cell is just setup from __future__ import division, print_function from IPython.display import Image, SVG import numpy as np import pandas as pd import random from pprint import pprint from matplotlib import pyplot as plt import seaborn as sns sns.set_style("whitegrid") SVG_HEAD = """<?xml version="1.0" standalone="no"?><!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">""" def uprint(astr): print("{}\n".format(astr) + "-"*len(astr)) %matplotlib inline %config InlineBackend.figure_format = 'retina'
Parsing data from slides
I found some slides from Biogen's talk here (http://www.biospace.com/News/biogen-idec-slides-for-alzheimers-drug-wow/369620/). They look like they were taken by someone in the audience with a cellphone.
I thought the data from a slide on reduction in SUVR (a measure of plaque volume in the brain) looked the meatiest. They also showed data from MMSE and CDR-SB (paper-based tests), and a couple of other slides on plaque volume. For my purposes, I think it's reasonable just to test the assumption that the drug reduces plaque volume, not that it improves cognition.
I used Photoshop's awesome perspective crop tool to correct for the perspective on these charts. Then I used SVG to manually determine the values in the chart. Not terribly sophisticated, but I think it works as well as anything else out there; it's flexible and only a few lines of code.
IWH = (805,596) # image dimensions after crop origin = (162,242) totalwh = (604,238) totaly = -0.3 bw = 46 # week, dose, num, xy, wh, se data = [(26, 0, 34, (35,0), (bw,5), 15), (26, 1, 26, (81,0), (bw,26), 15), (26, 3, 27, (127,0), (bw,69), 15), (26, 6, 23, (173,0), (bw,112), 17), (26, 10, 27, (219,0), (bw,162), 16), (54, 0, 21, (334,0), (bw,0), 17), (54, 1, 21, (380,0), (bw,44), 18), (54, 3, 26, (426,0), (bw,110), 16), (54, 6, 0, (472,0), (bw,0), 0), (54, 10, 21, (518,0), (bw,210), 17) ] def bg(xy,wh): return '''<rect x="{}" y="{}" width="{}" height="{}" fill="#ffff00" fill-opacity="0.05" /> '''.format(xy[0],xy[1],wh[0],wh[1]) def box(xy,wh,sd): return '''<rect x="{}" y="{}" width="{}" height="{}" stroke="#ff0000" fill="none" /> <line x1="{}" y1="{}" x2="{}" y2="{}" stroke="#00ffff" /> '''.format(origin[0] + xy[0], origin[1] + xy[1], wh[0], wh[1], origin[0] + xy[0] + wh[0]/2, origin[1] + xy[1] + wh[1] - sd, origin[0] + xy[0] + wh[0]/2, origin[1] + xy[1] + wh[1] + sd) svg = SVG_HEAD + """<svg width="{:d}px" height="{:d}px" version="1.1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink= "http://www.w3.org/1999/xlink"> <image xlink:href="biospace-news-biogen-idec-2.png" x="0" y="0" width="{:d}px" height="{:d}px"/> {} {}</svg>""".format(IWH[0], IWH[1], IWH[0], IWH[1], bg(origin, totalwh), ''.join([box(dat[3],dat[4],dat[5]) for dat in data])) SVG(svg)
Data
Now I can map these hand-drawn rectangles onto numbers, using the transparent yellow box as a reference. Unfortunately, there is no data for Week 54 / 6 mg/kg.
Based on these means and standard deviations, I can generate example values, maximizing the entropy since I want these values to represent only what I know from the chart. Here, this just means assuming a normal distribution. In reality, it's unlikely to be normal, since there are obviously important covariates like APOE e4 status.
def format_data(data): return "Week\tDose\tNum\tMean\tSD\n" + \ "\n".join('{}\t{}\t{}\t{:.3f}\t{:.3f}'.format(d[0], d[1], d[2], d[3] ,d[4]) for d in data) def transform(data): tdata = [] for dat in data: y = dat[4][1] * (totaly/totalwh[1]) sd = dat[2]**.5 * abs(dat[5] * (totaly/totalwh[1])) tdata.append((dat[0], dat[1], dat[2], y, sd)) return tdata print(format_data(transform(data)))
Week Dose Num Mean SD 26 0 34 -0.006 0.110 26 1 26 -0.033 0.096 26 3 27 -0.087 0.098 26 6 23 -0.141 0.103 26 10 27 -0.204 0.105 54 0 21 -0.000 0.098 54 1 21 -0.055 0.104 54 3 26 -0.139 0.103 54 6 0 -0.000 0.000 54 10 21 -0.265 0.098
tdata = transform(data) def _gen_random(loc, scale, size): while True: yield np.random.normal(loc=loc, scale=scale, size=size) e_vals = {} example_vals = {} for d in tdata: if d[4] > 0: # sd != 0 e_vals[(d[0],d[1])] = _gen_random(d[3], d[4], d[2]) example_vals[(d[0],d[1])] = e_vals[(d[0],d[1])].next() uprint("Example values generated from chart data") pprint({k:v[:4] for k,v in example_vals.items()})
Example values generated from chart data ---------------------------------------- {(26, 0): array([-0.19925843, -0.17029101, 0.11315593, -0.01056504]), (26, 1): array([ 0.03167159, -0.07239358, -0.00892414, 0.07187854]), (26, 3): array([-0.10161264, -0.01548505, -0.02418133, -0.00441314]), (26, 6): array([-0.14738112, -0.29103252, -0.13790777, 0.11396582]), (26, 10): array([-0.11370946, -0.04119348, -0.25958697, 0.06817597]), (54, 0): array([ 0.12597653, -0.01024324, 0.08792991, -0.03493747]), (54, 1): array([ 0.01815586, -0.18449709, -0.11815812, -0.14284047]), (54, 3): array([-0.46736124, -0.15305449, -0.23322862, -0.20419422]), (54, 10): array([-0.34687603, -0.23872551, -0.33569562, -0.11804706])}
t-test
Now that I have some values, I can run t-tests and see if my p-values agree with the p-values in the chart. The chart only indicates whether or not their statistical test produced p-values of less than 0.05 (*), 0.01 (**), or 0.001 (***).
The results generally agree pretty well with the chart. In fact, the first time I ran this code, they agreed completely. Again, covariates are bound to matter here.
At the bottom of the slide it explains that the data were analyzed with an ANCOVA (like ANOVA with covariates). Each dose of the drug is also compared pair-wise to the placebo to get a p-value per dose. The statistical test they used to get their p-values is not mentioned, but if were from a two-group ANCOVA, it should produce similar results to a t-test.
The big caveats are that I don't have access to the listed covariates and I don't know the true distribution of effect sizes.
from scipy.stats import ttest_ind from itertools import permutations def _stars(pval): stars = [(.001,"***"), (.01,"**"), (.05,"*"), (np.inf,"")] return next(s[1] for s in stars if pval <= s[0]) uprint("Pairwise t-test p-values") for wk in [26,54]: uprint("Week {}".format(wk)) for dose1,dose2 in ((a,b) for a,b in permutations([0,1,3,6,10], 2) if a<b): if wk == 54 and 6 in (dose1,dose2): continue # no data for 54/6 _pvals = [] for _ in range(100): v1, v2 = e_vals[(wk,dose1)].next(), e_vals[(wk,dose2)].next() _pvals.append( ttest_ind(v1, v2)[1] ) pval = np.mean(_pvals) # arithmetic mean print("{} vs {}\t{:.4f}\t{}".format(dose1, dose2, pval, _stars(pval)))
Pairwise t-test p-values ------------------------ Week 26 ------- 0 vs 1 0.3751 0 vs 3 0.0497 * 0 vs 6 0.0022 ** 0 vs 10 0.0000 *** 1 vs 3 0.1364 1 vs 6 0.0132 * 1 vs 10 0.0000 *** 3 vs 6 0.1619 3 vs 10 0.0032 ** 6 vs 10 0.1066 Week 54 ------- 0 vs 1 0.1441 0 vs 3 0.0013 ** 0 vs 10 0.0000 *** 1 vs 3 0.0411 * 1 vs 10 0.0000 *** 3 vs 10 0.0087 **
Robustness
How sensitive are these results to changes in the data? It's not obvious to me what the best way to test this is, especially without the real data. One simple way to test robustness is by zeroing some fraction of the data. In other words, for some fraction of patients, make it so that the drug has zero effect.
Here, I artificially zero from 0 to 100% of the plaque volume values, and plot the p-values for each dose.
from itertools import product zeros = np.arange(0,1.01,0.05) wks, doses = [26,54], [1,3,6,10] pvals = {} for wk,dose in product(wks, doses): if wk == 54 and dose == 6: continue for zero in zeros: pvals[(wk,dose,zero)] = [] for _ in range(100): v0, vd = e_vals[(wk,0)].next(), e_vals[(wk,dose)].next() vd = [v if random.random()>zero else 0 for v in vd] pvals[(wk,dose,zero)].append( ttest_ind(v0, vd)[1] ) df = pd.DataFrame.from_dict(pvals) color_map = {1:"indianred", 3:"darkseagreen", 6:"steelblue", 10:"purple"} for wk in wks: f, ax = plt.subplots(1, figsize=(12, 4)) plt.title("Week {}".format(wk)) for dose in doses: if wk == 54 and dose == 6: continue _ = sns.tsplot(np.log(df[wk][dose].values), time=df[wk][dose].columns, condition=dose, color=color_map) plt.ylim(-10) plt.xlabel("fraction zeroed") plt.ylabel("log(pvalue)") plt.axhline(y=np.log(.05), linestyle='--', color="gray", label="0.05") _ = plt.text(1.01,np.log(.05),"p=0.05",fontsize=14,color="gray") plt.axhline(y=np.log(.01), linestyle='--', color="gray", label="0.01") _ = plt.text(1.01,np.log(.01),"p=0.01",fontsize=14,color="gray") plt.axhline(y=np.log(.001), linestyle='--', color="gray", label="0.001") _ = plt.text(1.01,np.log(.001),"p=0.001",fontsize=14,color="gray")
Results
The results graph shows that you would need to zero out about 20-40% of the data for the results to move down a p-value category (e.g., from 0.01 to 0.05).
Another way to think about that is you would have to have 4-10 patients in an arm of the study who, instead of responding, did not respond at all.
While that doesn't seem implausible, if it happened, only one of the five significant p-values in the graph would be affected. Hence, I think the results as presented are pretty robust to patients switching to non-responders.
Power and priors
Is the study underpowered? Since it's a phase Ib, it's almost by definition underpowered to detect treatment effects. However, it also depends a lot on your prior for an Alzheimer's drug working. If these data were not for Alzheimer's, but something more innocuous, it might not be called underpowered.
The prior could include the general lack of success of Alzheimer's drugs, but also intangibles such as potential data-dredging, unconscious biases, generally bad science (see Ioannidis), the huge amount of money at stake, and maybe (as someone suggested) even the age of the CEO.
The prior could also include investor confidence. A JPM survey gave this drug about a 35% chance of approval (https://twitter.com/TylerHCanalyst/status/578219628054347776/photo/1), where the general probability of any drug being approved after phase I is about 15% (http://pharmamarketresearchconference.com/files/Clinical_Development_Success_Rates_for_Investigational_Drugs.pdf).
Should I buy?
Obviously, I have no idea. The stock is already up about 30% since January though (when preliminary data appeared) and about 10% this week, so maybe only if you think the probability of approval is well above 35%.
Image("biib037_survey.png")