–––––––– –––––––– archives investing twitter

In this post I'll describe how to sequence a human genome at home, something that's only recently become possible.

The protocol described here is not necessarily the best way to do this, but it's what has worked best for me. It costs a few thousand dollars in equipment to get started, but the (low-coverage) sequencing itself requires only $150, a couple of hours of work, and almost no lab skills.

What does it mean to sequence a human genome?

First, it's useful to explain some terms: specifically, to differentiate a regular reference-based genome assembly from a de novo genome assembly.

Twenty years or so ago, the Human Genome Project resulted in one complete human genome sequence (actually combining data from several humans). Since all humans are 99.9%+ identical genetically, this reference genome can be used as a template for any human. The easiest way to sequence a human genome is to generate millions of short reads (100-300 base-pairs) and aligning them to this reference.

The alternative to this reference-based assembly is a de novo assembly, where you figure out the genome sequence without using the reference, by stitching together overlapping sequencing reads. This is much more difficult computationally (and actually impossible if your reads are too short), but the advantage is that you can potentially see large differences compared to the reference. For example, it's not uncommon to have some sequence in a genome that is not present in the reference genome, so-called structural variants.

There are also gaps in the reference genome, especially at the ends and middle of chromosomes, due to highly repetitive sequence. In fact, the first full end-to-end human chromosome was only sequenced last year, thanks to ultra-long nanopore reads.

For non-human species, genome assembly is usually de novo, either because the genomes are small and non-repetitive (bacteria), or there is no reference (newly sequenced species).

SNP chip

The cheapest way to get human genome sequence data is with a SNP chip, like the 23andMe chip. These chips work by measuring variation at specific, pre-determined positions in the genome. Since we know the positions that commonly vary in the human genome, we can just examine a few hundred thousand of those positions to see most of the variation. You can also accurately impute tons of additional variants not on the chip. The reason this is "genotyping" and not "sequencing" is that you don't get a contiguous sequence of As, Cs, Gs, and Ts. The major disadvantage of SNP chips is that you cannot directly measure variants not on the chip, so you miss things, especially rare and novel variants. On the other hand, the accuracy for a specific variant of interest (e.g., a recessive disease variant like cystic fibrosis ΔF508) is probably higher than from a sequenced genome.

Short-read sequencing

Short-read sequencing is almost always done with Illumina sequencers, though other short-read technologies are emerging. These machines output millions or billions of 100-300 base-pair reads that you can align to the reference human genome. Generally, people like to have on average 30X coverage of the human genome (~100 gigabases) to ensure high accuracy across the genome.

Although you can read variants not present on a SNP chip, this is still not a complete genome: coverage is not equal across the genome, so some regions will likely have too low coverage to call variants; the reference genome is incomplete; some structural variants (insertions, inversions, repetitive regions) cannot be detected with short reads.

Long-read sequencing

The past few years have seen single-molecule long-read sequencing develop into an essential complement and sometimes credible alternative to Illumina. The two players, Pacific Biosciences and Oxford Nanopore (ONT) are now mature technologies.

The big advantage of these technologies is that you get reads much longer than 300bp — from low thousands up to megabases on ONT in extreme examples — so assembly is much easier. This enables de novo assembly, and is especially helpful with repetitive sequence. For this reason, long-read sequencing is almost essential for sequencing new species, especially highly repetitive plant genomes.

Sounds great! Why do people still use Illumina then? The per-base accuracy and per-base cost of Illumina is still considerably better than these competitors (though ONT's PromethION is getting close on price).

One huge advantage that ONT has over competitors is that the instrument is a fairly simple solid-state device that reads electrical signals from the nanopores. Since most of the technology is in the consumable "flow-cell" of pores, the instruments can be tiny and almost free to buy.

Instead of spending $50k-1M on a complex machine that requires a service contract, etc., you can get a stapler-sized MinION sequencer for almost nothing, and you can use it almost anywhere. ONT have also done a great job driving the cost per experiment down, especially by releasing a lower-output flow-cell adaptor called the flongle. Flongle flow-cells only cost $90 per flow-cell, and produce 100 megabases to >1 gigabase of sequence.

There is a great primer on how nanopore sequencing works at nanoporetech.com.


Nanopore Sequencing Equipment

(Note, to make this article stand alone, I copied text from my previous home lab blogpost.)

Surprisingly, you don't actually need much expensive equipment to do nanopore sequencing.

In my home lab, I have the following:

Optional equipment:

  • A wireless fridge thermometer. This was only $25 and it works great! It's useful to be able to keep track of temperatures in your fridge or freezer. Some fridges can get cold enough to freeze, which is deadly for flow-cells.
  • A GeneQuant to check the quality of DNA extractions. A 20 year old machine cost me about $150 on ebay. It's a useful tool, but does require quite a lot of sample (I use 200 µl). I wrote a bit more about it here.

*The lab during a sequencing run. MinION running and used flow-cell on the desk in front*

*(a) Eppendorf 5415C centrifuge. (b) Anova sous vide in a Costco coffee can*

Protocol Part One: extracting DNA

The first step in sequencing is DNA extraction (i.e., isolating DNA from biological material). I use a Zymo Quick-DNA Microprep Plus Kit, costing $132. It's 50 preps, so a little under $3 per prep. There are other kits out there, like NEB's Monarch, but these are harder to buy (requiring a P.O., or business address).

The Zymo kit takes "20 minutes" (it takes me about 40 minutes including setting up). It is very versatile: it can work with "cell culture, solid tissue, saliva, and any biological fluid sample". This prep is pretty easy to do, and all the reagents except Proteinase k are just stored at room temperature. They claim it can recover >50kb fragments, and anecdotally, this is the maximum length I have seen. That is far from the megabase-long "whale" reads some labs can achieve, but those preps are much more complex and time-consuming. Generally speaking, 10kb fragments are more than long enough for most use-cases.

Protocol Part Two: library prep

Library prep is the process of preparing the DNA for sequencing, for example by attaching the "motor protein" that ratchets the DNA through the pore one base at a time. The rapid library prep (RAD-004) is the simplest and quickest library prep method available, at $600 for 12 preps ($50 per prep).

Library prep is about as difficult as DNA extraction, and takes around 30 minutes. There are some very low volumes involved (down to 0.5µl, which is as low as my pipettes go), and you need two to use two water bath temperatures, but overall it's pretty straightforward.

The total time from acquiring a sample to beginning sequencing could be as little as 60-90 minutes. You do pay for this convenience in lower read lengths and lower throughput though.

The Data

The amount of data you can get from ONT/nanopore varies quite a lot. There is a fundamental difference between Illumina and nanopore in that nanopore is single-molecule sequencing. With nanopore, each read represents a single DNA molecule traversing the pore. With Illumina, a read is an aggegated signal from many DNA molecules (which contributes to the accuracy).

So, nanopore is really working with the raw material you put in. If there are contaminants, then they can jam up the pores. If there are mostly short DNA fragments in the sample, you will get mostly short reads. Over time, the pores degrade, so you won't get as much data from a months-old flow-cell as a new one.

Using the protocol above, I have been able to get around 100-200 megabases of data from one flongle ($1 per megabase!). There are probably a few factors contributing to this relatively low throughput: the rapid kit does not work as well as the more complex ligation kit; I don't do a lot of sequencing, so the protocol is certainly executed imperfectly; my flow-cells are not always fresh.

For a human sample, 100 megabases is less than a 0.1X genome, which raises the fair question of why you would want to do that? Today, the answer is mainly just because you can. You could definitely do some interesting ancestry analyses, but it would be difficult to validate without a reference database. gencove also has several good population-level use-cases for low-pass sequencing.

The next step up from a flongle is a full-size MinION flow-cell, which runs on the same equipment and uses the same protocol, but costs around $900, and in theory can produce up to 42 gigabases of sequence. This would be a "thousand dollar genome", though the accuracy is probably below what you would want for diagnosic purposes. In a year or two, I may be able to generate a diagnostic-quality human genome at home for around $1000, perhaps even a decent de novo assembly.

Comment

It's been a dream of mine for a long time to be able to do sequencing at home — just take whatever stuff I want: microbiome, viral/bacterial infections, insects, fungi, foods, sourdough, sauerkraut, and sequence it. Now at last, with the debut of Oxford Nanopore's flongle, it's possible!

So, a few months back, I bought some flongles (basically on launch day) and set up a home sequencing lab. In this post I'll describe what's in the lab and my first sequencing experiments.

What is a flongle?

As a refresher, Oxford Nanopore's MinION sequencer is a hand-held, single molecule nanopore sequencer. As DNA passes through a pore, the obstruction modulates the current across the pore in a pattern that can be mapped onto a sequence of nucleotides.

DNA going through a pore, from an ONT video

The MinION device itself is a fairly simple container for the nanopore-containing flow-cells. The standard MinION flow-cell contains 512 channels (each of which has one active pore at a time), plus the ASIC chip that reads the changes in current. There's a great explainer on the Oxford Nanopore website.

Oxford Nanopore's newest flow-cell, the flongle ("flow-cell dongle"), is basically a cheaper, more disposable version of the standard MinION flow-cell. The flongle snaps into an adapter that includes the ASIC you usually find in a regular flow-cell.

(a) MinION sequencer with loaded flow-cell. (b) Flongle and adapter.

Whereas a 512 channel flow-cell costs $475-900, 126 channel flongles cost about $90-150 each. At the time of writing, you have to buy at least a starter pack of 12 for $1860, which includes the adapter.

Each pore can output several megabases of sequence, so the total amount of sequence can be in the hundreds of megabases (the biggest flongle run I know of is ~2Gb, in the hands of an experienced lab.)

For now, you have to spend quite a lot to get access to flongles, and supply is severely limited. I have received only four so far, out of the 48 I bought! (That was the minimum order size at launch.) So, there are definitely beta program issues here. Still, $100 NGS runs!

Home Lab Equipment

Surprisingly, you don't actually need that much expensive equipment to do nanopore sequencing. For my home lab, I bought the following:

The lab during a sequencing run. MinION running and Used flongle on the desk in front

(a) Eppendorf 5415C centrifuge. (b) Anova sous vide in a Costco coffee can

DNA Extraction

The first step in sequencing is DNA extraction. I bought a Zymo Quick-DNA Microprep Plus Kit for $132 (50 preps, so a little under $3 per prep).

This kit takes "20 minutes" (it takes me about 40 minutes including setting up). It can work with "cell culture, solid tissue, saliva, and any biological fluid sample". This prep is very easy to do, and almost all the reagents are just stored at room temperature. They claim it can recover >50kb fragments, which is very respectable. This is far from the megabase-long "whale" reads some labs work to achieve, but those preps are much more complex and time-consuming. Generally speaking, 10kb reads are more than long enough for most use-cases, and even 100bp-1kbp reads can still be used for species ID.

I am lucky to have access to a nanodrop spectrophotometer at work, so I can check my DNA quality. (Nanodrops cost thousands of dollars, even second-hand.) I think this wouldn't matter if I was was sequencing saliva repeatedly: that seems to work the same every time. However, it matters quite a bit for experimenting with sequencing different sample types.

Library prep

Library prep is the process of preparing the DNA for sequencing, for example by attaching the "motor protein" that ratchets the DNA through the pore one base at a time. Not being an experimentalist, I like to stick to the simplest possible protocols. That means rapid DNA extraction and ONT's rapid library prep (RAD-004), which costs $600 for 12 preps ($50 per prep).

Library prep is a little harder than DNA extraction, but still only takes around 40 minutes. There are some very low volumes involved (0.5µl, which is as low as my pipettes go!), and you need two water bath temperatures, but overall it's pretty straightforward.

The total time from acquiring a sample to beginning sequencing is maybe 1.5 hours. You definitely pay for this convenience in read length and throughput, but the tradeoff is not too bad.

Admittedly, the cost is more like $150 than $100 per run, but with the nuclease wash protocol now available to rejuvenate flow-cells, I think it's ok to round down...

Experiment one: saliva

This is probably the simplest possible experiment: extract human and bacterial DNA from saliva, and sequence it. Saliva has lots of human DNA — surprisingly, most of it is from white blood cells — and plenty of oral microbiome bacteria, and it's easy to get as much as you want. However, since bacterial genomes are about 0.001X the size of a human genome, you'd need 1000 bacterial genomes for every human genome if you want equal coverage of both.

Experiment one: (a) DNA quantification from saliva. (b) A decent read length distribution, topping out at 60kb.

This experiment generated a pretty respectable 100 megabases of sequence in 24 hours, which is basically what I was hoping for.

As soon as the DNA is loaded, reads start to get written to disk. After a minute, you have reads you can feed into BLAST to see if everything is working as expected. The instant access to data is a great reward for doing the boring prep work.

First sequencing run at home: pores sequencing, 34 minutes and 10 megabases in!

There are a few ways to analyze the data. There are several metagenome analysis tools, like Centrifuge and Kraken. I spent a couple of days(!) downloading the Centrifuge databases — which are massive since they need reference sequence data from bacteria, fungi, viruses etc. — only to have the build fail right afterward.

Luckily, Oxford Nanopore has some convenient tools online for analysis. It turns out that one of these, What's In My Pot (WIMP) is based on Centrifuge so it's convenient to just run that.

Experiment one: WIMP results for unfiltered saliva

As we can see, >99% of the reads are human or Escherichia. Upon closer inspection, the reads labeled "Escherichia coli" and "Escherichia virus Lambda" are all lambda DNA. As a QC step, I spiked lambda DNA (provided by ONT for QC purposes) into my DNA library at approximately 13% by volume. About 12% of my reads are lambda, so I know the molarity of my input sample is not too far off the reference lambda DNA.

After you get past the human and lambda DNA, the vast majority of reads map to known oral microbiome bacteria. Without anything to compare to, I can't point to any specific trends here yet.

Human DNA

What can you do with 80 megabases of human DNA? I know from just BLASTing reads that the accuracy is consistently 89-91%. Since a hundred megabases is only a 0.03X genome, it's not very useful for any human genetics tasks except maybe ancestry assignment, Gencove-style.

One thing I can do is intersect these reads with my 23andme data, and see how often it's concordant (the 23andme data is diploid and these are single molecule reads so it's not quite simple matching). Doing this intersection using bcftools and including only high quality reads resulted in only a few hundred SNPs. I did not find any variants that disagreed, which was surprising but nice to see.

Experiments two and three: failing to filter saliva

Obviously, it's a waste to generate so many human reads. Since I don't need my genome sequenced again (ok, I only have exome data), especially 0.03X at a time, I wanted to try to enrich for oral bacteria. There are host depletion kits that apparently work well, but that's kind of expensive, so I wanted to see what would happen if I just tried to physically filter saliva.

We know that human cells are usually >10µm and bacterial cells are usually <10µm so that's a pretty simple threshold to filter by. I bought a "10µm" paper filter on amazon, and just filtered saliva through it.

Experiments two and three produced almost identical results. The only differences were that after experiment two failed I tried to eliminate contamination from the paper filter by pre-washing it, and I quantified the DNA with a nanodrop, a step I skipped in experiment two. After multiple rounds of filtering–centrifugation–pouring off, I only managed to get 10ng/µl of DNA, which is very low. However, I knew that my first 32ng/µl run worked fine, so I convinced myself it must reach the recommended minimum of 5 femtomoles of DNA (that's only 3 billion molecules!), especially since the 260/280 was not that bad.

The experiment worked as planned, in the sense that instead of 99+% human DNA, I got 50% human DNA and 50% bacterial. However, instead of 100 megabases, I only got 2, and most were low quality!

Experiment three: (a) DNA quantification from filtered saliva. (b) WIMP on filtered saliva

My best guess here is that somehow the paper contaminated the DNA, since the pores apparently got clogged after just a couple of hours. I should at least have made sure I had a lot of DNA, though I don't have great ideas on how to do that beyond just spitting for an hour... It's likely I'll just need to use a proper microbiome prep kit next time.

Experiment four: wasp sequencing

Amazingly, despite having pretty small genomes (100s of megabases), most insects have never been sequenced! It's not clear to me that you can create a high quality genome assembly from only flongle reads, but if you can get 100 megabases of DNA, that's definitely a good start.

We have a wasp trap in our back yard. It caught a wasp but we were not sure what kind. It could be the most common type of wasp in the area, the western yellowjacket. It looks exactly like one, which is a bit of a clue.

Distribution of western yellowjacket vs common aerial yellowjacket, according to iNaturalist

But eyes can deceive. The only real way to figure out for sure if this is even a wasp is by sequencing its genome, or at least it would be if there were a reference genome. Surprisingly there is no genome for the western yellowjacket or the other likely species, the common aerial yellowjacket.

(a) Before mushing, with an aphid and other tiny insects in the second tube. (b) After mushing, which was pretty gross.

I took the wasp, plus an aphid that looked freshly caught in a spiderweb, and a few other tiny insects scurrying around nearby. Then I mushed them up and used the Zymo solid tissue protocol to extract DNA.

Experiment four: (a) DNA quantification and (b) read-length distribution

This time the DNA extraction was great quantity and quality. The total amount of sequence generated was 100 megabases again. However, the read length is extremely short on average. A general rule for nanopore sequencing is that you get out what you put in. In retrospect this problem should have been pretty obvious: although it looked ok, the wasp was not fresh enough so its DNA was very degraded.

Interestingly, there are quite a few long fragments (>5kb) in here too, and these map imperfectly to various aphid genomes (indicating that this particular aphid has also not been sequenced) and bacteria including possible wasp endosymbionts like Pantoea agglomerans. This is expected if the aphid and bacteria are fresh.

Experiment four: (a) WIMP of wasp and aphid reads. (b) BLASTing reads produces better results

I also ran WIMP but it turned out not to be useful, since this is not a "real" metagenomics run (i.e., it's not mainly a mixture of microbes and viruses). The closest matches are just misleading.

It would have been nice to be the first to assemble the western yellowjacket genome, or even a commensal bacterial genome, but I would have needed a lot more reads. Wasp genomes are around 200 megabases, so to get a decent quality genome I'd need at least 10 gigabases of sequence (50X coverage). That means a MinION run (or several), perhaps polished with some illumina data. The commensal bacteria are probably under 5 megabases, so it would be easier to create a reference genome, assuming any could be grown outside the wasp...

Next steps

Four flongles in, I am still pretty amazed that I can generate a hundred megabases of sequence, at home, for so little money and equipment.

I can almost run small genome projects at home, and submit the assembled genomes to ncbi. (I still need more sequencing throughput to do this in earnest.) Like the western yellowjacket, there are tons of genomes yet to be sequenced that should be sequenced. In general, plants and more complex eukaryotes will be too difficult, but bacteria, fungi, and insects should all be possible at home.

Preserving the DNA sequences of species could become an extremely important step in conservation and even de-extinction. The only group I know of doing work in this area is revive & restore. One of their projects is to try to bring back the woolly mammoth by bootstrapping from elephant to elephant–mammoth hybrid, and eventually to full mammoth. Of course this would not be possible without the mammoth genome sequence. The list of endangered species is very long, so there's a lot to do.

Comment
Brian Naughton | Sun 11 November 2018 | sequencing | biotech sequencing dna

I took a look at the data in Albert Vilella's very useful NGS specs spreadsheet using Google's slick colab notebook. (If you have yet to try colab it's worth a look.)

Doing this in colab was a bit trickier than normal, so I include the code here for reference.

First, I need the gspread lib to parse google sheets data, and the id of the sheet itself.

!pip install --upgrade -q gspread
sheet_id = "1GMMfhyLK0-q8XkIo3YxlWaZA5vVMuhU1kg41g4xLkXc"

Then I authorize myself with Google (a bit awkward but it works).

from google.colab import auth
auth.authenticate_user()

import gspread
from oauth2client.client import GoogleCredentials

gc = gspread.authorize(GoogleCredentials.get_application_default())

I parse the data into a pandas DataFrame.

sheet = gc.open_by_key(sheet_id)

import pandas as pd
rows = sheet.worksheet("T").get_all_values()
df = pd.DataFrame.from_records([r[:10] for r in rows if r[3] != ''])

I have to clean up the data a bit so that all the sequencing rates are Gb/day numbers.

import re
dfr = df.rename(columns=df.iloc[0]).drop(index=0).rename(columns={"Rate: (Gb/d) ":"Rate: (Gb/d)"}).set_index("Platform")["Rate: (Gb/d)"]
dfr = dfr[(dfr != "--") & (dfr != "TBC")]
for n, val in enumerate(dfr):
  if "-" in val:
    rg = re.search("(\d+).(\d+)", val).groups()
    val = (float(rg[0]) + float(rg[1])) / 2
    dfr[n] = val
dfr = pd.DataFrame(dfr.astype(float)).reset_index()

I tacked on some data I think is representative of Sanger throughput, if not 100% comparable to the NGS data.

A large ABI 3730XL can apparently output up to 1-2 Mb of data a day in total (across thousands of samples). A lower-throughput ABI SeqStudio can output 1-100kb (maybe more).

dfr_x = pd.concat([dfr, 
                   pd.DataFrame.from_records([{"Platform":"ABI 3730xl", "Rate: (Gb/d)":.001}, 
                                              {"Platform": "ABI SeqStudio", "Rate: (Gb/d)":.0001}])])

dfr_x["Rate: (Mb/d)"] = dfr_x["Rate: (Gb/d)"] * 1000

If I plot the data there's a pretty striking, three-orders-of-magnitude gap from 1Mb-1Gb. Maybe there's not enough demand for this range, but I think it's actually just an artifact of how these technologies evolved, and especially how quickly Illumina's technology scaled up.

import seaborn as sns
import matplotlib.pyplot as plt
f, ax = plt.subplots(figsize=(16,8))
fax = sns.stripplot(data=dfr_x, y="Platform", x="Rate: (Mb/d)", size=8, ax=ax);
fax.set(xscale="log");
fax.set(xlim=(.01, None));

sequencing gap plot

Getting a single 1kb sequencing reaction done by a service in a day for a couple of dollars is easy, so the very low throughput end is pretty well catered for.

However, if you are a small lab or biotech doing any of:

  • microbial genomics: low or high coverage WGS
  • synthetic biology: high coverage plasmid sequencing
  • disease surveillance: pathogen detection, assembly
  • human genetics: HLA sequencing, immune repertoire sequencing, PGx or other panels
  • CRISPR edits: validating your edit, checking for large deletions

you could probably use a few megabases of sequence now and then without having to multiplex 96X.

If it's cheap enough, I think this is an interesting market that Nanopore's new Flongle can take on, and for now there's no competition at all.

Comment

Some notes on the Oxford Nanopore Conference 2015

Read More
Brian Naughton | Tue 11 November 2014 | data | bayesian pymc sequencing

Analyzing sequencing read errors with PyMC3

Read More

Boolean Biotech © Brian Naughton Powered by Pelican and Twitter Bootstrap. Icons by Font Awesome and Font Awesome More