MashMap: approximate long-read mapping using minimizers and MinHash

May 22, 2017

Chirag Jain recently presented a paper at RECOMB’17 titled “A fast approximate algorithm for mapping long reads to large reference databases” (preprint | proceedings). This paper describes the algorithms behind MashMap, which is our new tool designed for approximate read mapping. Chirag joined the lab last year as a summer fellow, and I asked him to write a new read mapper. (How else does one learn bioinformatics?) He clearly lived up to the challenge, and I think the paper contains some useful ideas for the looming “long-read” era. I wanted to summarize those ideas here for anyone who missed RECOMB.

Assembling the Cliveome

April 28, 2017

We recently participated in a collaborative effort to sequence, assemble, and analyze a human genome (GM12878) using the Oxford Nanopore MinION (Jain et al. 2017). As part of that project, Josh Quick and Nick Loman developed a nanopore sequencing protocol capable of generating “ultra-long” reads of length 100 kb and greater. In the paper we predict that reads of such length could enable the most continuous human assemblies to date, with NG50 contig sizes exceeding 30 Mbp. Thus far, we have only collected 5x coverage using the ultra-read protocol and cannot fully test this prediction. However, another human dataset, the “Cliveome”, lets us compare the effect of read length and coverage on nanopore assembly. Here we present a brief analysis of that assembly, which achieved a remarkable contig NG50 of 24.5 Mbp.

Fast and highly accurate HLA typing by linearly-seeded graph alignment

March 22, 2017

HLA*PRG:LA approximates the graph alignment process by starting with linear sequence alignments. It brings down the resource requirements per sample for the HLA typing process to 30GB RAM/30 CPU hours, and produces highly accurate calls.

Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly

March 6, 2017

After a long and fruitful collaboration with our friends at the USDA, we have finally published the goat genome in Nature Genetics. “We demonstrate current state of the art for de novo assembly using the domestic goat (Capra hircus) based on long reads for contig formation, short reads for consensus validation, and scaffolding by optical and chromatin interaction mapping. These combined technologies produced what is, to our knowledge, the most continuous de novo mammalian assembly to date, with chromosome-length scaffolds and only 649 gaps.” If you hit the paywall, you can read the preprint for free on bioRxiv, and NHGRI did a nice piece on the backstory. Huge credit to co-first authors Derek Bickhart, Ben Rosen, and Sergey Koren for years of hard work on this project.

Assembly of a human genome from nanopore sequencing data

January 8, 2017

An international consortium recently released ~30x coverage of a human immortalized cell line (NA12878) sequenced using Oxford Nanopore MinION instruments. Release 3 of this dataset included 39 flowcells, which generated 14,183,584 reads and 91,240,120,433 bases, mostly using the 1D ligation prep, but with a few rapid kit runs as well. Our friends Nick Loman and Jared Simpson asked if we could assemble this data with Canu. Of course we said yes.