GIS

The NHGRI Center for Genomics and Data Science Research is hiring!

2024-08-13T00:00:00+00:00

Join our team and contribute to the development of complete, personalized “telomere-to-telomere” (T2T) genome assemblies and the analysis of previously inaccessible regions of the genome! We are currently accepting applications for center coordinator, bioinformatics engineer/scientist, and postdoctoral researcher.

These positions are under the supervision of center director Dr. Adam Phillippy, whose research section develops and applies computational methods for the analysis of massive genomics datasets with a focus on genome sequencing and comparative genomics. Dr. Phillippy and team are leaders in the field of genome assembly and sequence analysis who have developed many widely used bioinformatics tools (e.g. MUMmer, Mash, Canu, Verkko), finished the first truly complete sequence of a human genome, and recently completed T2T reference genomes for the great apes.

The section is seeking applicants with an interest in developing and/or applying computational methods for genome assembly, sequence alignment, variant detection, variant annotation, and information visualization. Current projects include efforts to sequence and explore all corners of the human pangenome (Human Pangenome Project), complete the genomes of diverse vertebrate species (Vertebrate Genomes Project), and enable personalized genomics for the diagnosis of genetic diseases through collaborations with the Undiagnosed Disease Program and others. Available positions include:

Center Coordinator. Help build the future of data science at NHGRI! This position will be responsible for coordinating research and outreach activities across the newly established Center for Genomics and Data Science Research. Duties will include the organization of bioinformatics conferences and other community-engagement activities aimed towards growing the profile of genomic data science at the NIH, as well as providing administrative support for members of the center and our associated research consortia (e.g. T2T Consortium). There will be additional opportunities to participate in center’s research. Past experience in genomics research and large-scale data management is preferred, but applications from all education levels from BS to PhDs are encouraged. Strong organizational and English skills are a requirement, along with an eagerness to help and enthusiasm for genomic data science! Salary commensurate with experience, roughly in the range of $80–130k.
Bioinformatics Engineer/Scientist. Enjoy the fun of bioinformatics methods development in a low-stress and highly collaborative environment. This position will support algorithm and software development within the Genome Informatics Section, primarily working with Dr. Sergey Koren and the genome assembly team that has developed Canu and Verkko. A list of software currently supported by the section is available on our homepage. Previous experience in bioinformatics, especially the problem of genome assembly is preferred, but there is no minimum education requirement. PhD applicants will be considered for the position of Bioinformatics Scientist. Very strong programming and analytical skills are a requirement. Salary commensurate with experience, roughly in the range of $100–150k.
Postdoctoral Researcher. Perform research at the forefront of genomics in an exciting and supportive environment. Postdocs in the Genome Informatics Section supervised by Dr. Phillippy are supported for up to 7 years and have wide latitude to carry out their own research vision. This position is focused entirely on research and is meant to build independence and equip the trainee for the next step in their career. We are best suited to mentor computational genomics researchers with overlapping interests to our own. You can find a list of our publications on our lab homepage or via Dr. Phillippy’s Google Scholar page. Postdoctoral stipends start between $67–77k per year and include family health insurance at no additional cost. International postdocs are sponsored for a J-1 visa. More information on the NIH postdoc program is available from the NIH Office of Intramural Training and Education.

To apply: Interested applicants should submit their CV, a brief statement of interest, and the names of three references to: adam.phillippy@nih.gov

The NHGRI Intramural Research Program is located on NIH’s main campus in Bethesda, Maryland and offers a wide array of training and collaboration opportunities. These are in-person positions with flexible hours and an option for telework once a week. Funding is stable and includes access to extensive high-performance computing resources (BioWulf), the NIH intramural sequencing center (NISC), NHGRI core facilities, and the NIH Clinical Center.

The NIH is dedicated to building a diverse community in its training and employment programs.

Going ape for T2T

2023-11-27T00:00:00+00:00

Last year we released complete, gapless, “T2T” sex chromosomes for chimp, bonobo, gorilla, Sumatran orangutan, Bornean orangutan, and siamang gibbon. This December we are proud to announce our latest preprint “The Complete Sequence and Comparative Analysis of Ape Sex Chromosomes”! Over the past year, we have also finished the autosomes for these genomes! The v2.0 assemblies for these species are now available from our T2T-primates project page, and all of the raw HiFi, ONT, Hi-C, and Illumina sequencing data can be found on GenomeArk. This has been a Herculean effort involving nearly everyone in the lab and a large swath of the T2T team. It turns out that finishing six genomes is a lot more work than finishing one! A huge thank you to everyone involved, especially Kateryna Makova for spearheading the project.

The Q100 project

2023-10-31T00:00:00+00:00

Today, we are excited to release the v1.0 T2T assembly of the HG002 benchmark genome! This assembly is part of what we have dubbed the “Q100” project, or in other words, our quest to assemble a completely error-free human genome (in the Phred QV scale, Q100 equates to an error rate of 1 per 10 billion bases). The Genome in a Bottle consortium has released some tremendous resources over the years, including DNA reference materials such as HG002. However, these reference materials are currently defined as a list of variants called against the GRCh38 reference genome. A more natural representation is the complete sequence of the genome itself, i.e. a “genome benchmark” as opposed to a “variant benchmark”. This is our first step towards creating such a genome benchmark. We will have much more to say about this in the coming year, but for now you can find more information at the GitHub page linked above.

A complete Y chromosome

2023-08-23T00:00:00+00:00

Our latest paper, with the self-explanatory title The complete sequence of a Y chromosome, is now published! If you do not have a Nature subscription, here is a free-to-read link. This complete Y chromosome marks the final human chromosome to be finished from telomere to telomere, and its sequence has been incorporated into the CHM13v2 reference genome, which now includes the complete sequence of all 24 human chromosomes. The complete sequence and associated analysis resources can be found on the CHM13 GitHub repo. Congrats to the entire T2T team on completing this highly repetitive chromosome that was once considered “unassemblable”!

It’s finally finished!

2022-03-31T00:00:00+00:00

Today is a big day. One that was 30+ years in the making. We have finally uncovered every last bit of the human genome! I wanted to celebrate by writing a behind the scenes tribute to the amazing members of the T2T consortium, but that will have to wait. The past two years have been a whirlwind and I need some time to recuperate. For now, just the basics: links to the assembly, the browser, and the papers. And don’t worry, we didn’t forget chrY this time!

Summary

The T2T-CHM13v2.0 release includes complete assemblies for all 24 human chromosomes. Chromosomes 1-22 and X are from the CHM13hTERT cell line and chromosome Y is from NIST HG002 (aka PGP huAA53E0). The CHM13 chromosomes are described in the below linked publications, and we expect to release a preprint on the HG002 chrY within the coming month. This data is released into the public domain without restriction. We politely request that you cite Nurk et al. “The complete sequence of a human genome” for its use.

Science special issue “Completing the human genome”

T2T companion papers at Science, Nature Methods, and Genome Research

UCSC Genome Browser (chm13)

NCBI GenBank Record (GCA_009914755.4)

Ancillary data (GitHub)

Analysis Sets

In addition to the NCBI assembly record, we are providing the following FASTA files for convenience.

chm13v2.0.fa.gz
CHM13v2.0 reference with repeats soft-masked and sequence names converted to the UCSC style

chm13v2.0_noY.fa.gz
CHM13v2.0 excluding the Y chromosome

chm13v2.0_maskedY.fa.gz
CHM13v2.0 with the Y pseudoautosomal region (PAR) hard masked with Ns

chm13v2.0_maskedY_rCRS.fa.gz
CHM13v2.0 with the Y pseudoautosomal region (PAR) hard masked with Ns and the CHM13 mitochondrial genome replaced with the rCRS reference

Acknowledgements

A huge thank you and congratulations to all the members of the T2T consortium! There are 100 co-authors of the assembly paper, and many more who have contributed to our understanding of the first complete human genome. I cannot show them all here, but hopefully this montage will give you a feel for the incredible team that made this dream a reality.

We are hiring!

2021-10-27T00:00:00+00:00

Update: These positions have now been filled The Genome Informatics Section is hiring! Come join our outstanding team at the NIH’s National Human Genome Research Institute and contribute to the development of new reference genomes and computational methods for DNA sequencing and analysis. Will consider postdoc, PhD, and engineer applications. More information and application instructions follow below.

Multiple positions are available under the supervision of Dr. Adam Phillippy at the NIH’s National Human Genome Research Institute (NHGRI). Dr. Phillippy leads the Genome Informatics Section, which develops and applies computational methods for the analysis of massive genomics datasets with a focus on problems related to genome sequencing. Members of the section have developed many widely used bioinformatics methods (e.g. MUMmer, Mash, Canu), and are leaders in the field of long-read DNA sequencing.

The section is currently seeking applicants with an interest in developing and/or applying computational methods for genome assembly, sequence alignment, structural variant detection, metagenomics, and information visualization. Current projects in the lab include our efforts to finish the human reference genome (Telomere-to-Telomere Consortium), sequence and explore the human pan-genome (Human Pangenome Project), and assemble the genomes of many diverse vertebrate species (Vertebrate Genomes Project). Future work includes democratizing genomics with real-time nanopore sequencing and analysis. Applicants must possess strong English, programming, and analytical skills. Past experience in computational genomics is preferred but not required.

The NHGRI Intramural Research Program is located on NIH’s main campus in Bethesda, Maryland and offers a wide array of training and collaboration opportunities for early-career scientists. The funding for these positions is stable and offers wide latitude in the design and pursuit of bioinformatics research. The successful candidate will have access to extensive high-performance computing resources (BioWulf), the NIH intramural sequencing center (NISC), NHGRI core facilities, and the NIH Clinical Center. The typical stipend for a computational postdoc is approximately $70k per year and includes family health insurance. Software engineering salaries are commensurate with experience and competitive with industry. Answers to some frequently asked questions can be found here: NIH Postdoc FAQs. PhD training can also be arranged through most nearby universities, such as the University of Maryland, via the NIH Graduate Partnerships Program.

To apply: Interested applicants should submit their CV, a brief personal statement, and the names of three references to: adam.phillippy@nih.gov

The NIH is dedicated to building a diverse community in its training and employment programs.

The complete sequence of a human genome

2021-07-23T00:00:00+00:00

The Telomere-to-Telomere (T2T) consortium is proud to announce our v1.1 assembly, as well as a number of preprints describing our analyses of the first truly complete genome! You can find an updated list of publications on our consortium homepage.

The (near) complete sequence of a human genome

2020-09-22T00:00:00+00:00

The Telomere-to-Telomere (T2T) consortium is proud to announce our v1.0 assembly of a complete human genome. This post briefly summarizes our work over the past year, including a month-long virtual workshop in June, as we strove to complete as many human chromosomes as possible. Our progress over the summer exceeded our wildest expectations and resulted in the completion of all human chromosomes, with the only exception being the 5 rDNA arrays. Our v1.0 assembly includes more than 100 Mbp of novel sequence compared to GRCh38, achieves near-perfect sequence accuracy, and unlocks the most complex regions of the genome to functional study. We plan to release a series of preprints in the coming months that fully describe our methods and analyses, but due to its tremendous value, we are releasing the assembly immediately.

Roughly twenty years ago, the International Human Genome Sequencing Consortium published an “Initial Sequencing and Analysis of the Human Genome” simultaneously with “The Sequence of the Human Genome” from Celera Genomics. Although the public consortium chose a more humble title that suggested some work was left to be done, amidst all the pomp it was easy to miss the fact that the human genome had not actually been finished. The key caveat appears early on in both papers, e.g. “A 2.91-billion base pair (bp) consensus sequence of the euchromatic portion of the human genome was generated”. Ignoring heterochromatin, due to difficulty in mapping, cloning, or assembling these sequences, excluded upwards of 10% of the genome from these initial drafts, and that missing fraction has been underappreciated ever since. Today, the latest human genome reference (GRCh38) still contains 161 Mbp of “unknown” sequence constituting 5% of the genome.

Now, twenty years later, we are finally able to fill in the blanks thanks to a confluence of new sequencing technologies from PacBio and Oxford Nanopore. Within the past year, the T2T consortium assembled the first complete human chromosomes, Chromosome X and Chromosome 8, using Nanopore ultra-long (UL) sequencing as a backbone and polishing that sequence with PacBio and Illumina. However, the recent release of PacBio’s HiFi technology led us to revise our recipe. In the assembly presented here, we first constructed a highly-accurate assembly graph using PacBio HiFi reads and then resolved any structural ambiguities with the help of Nanopore UL reads. The following image shows a Bandage visualization of our HiFi string graph for the CHM13 genome, with most chromosomes resolved as individual components.

After resolution of the remaining “tangles” with the help of Nanopore UL reads, the sequence of each complete chromosome was obtained via a consensus of HiFi reads taken from the corresponding traversal of the graph. This approach allowed us to reach T2T continuity for all chromosomes, while retaining the accuracy of PacBio HiFi throughout. Nanopore data also helped by patching a few regions of the genome that PacBio failed to sequence due to an apparent sequencing bias in GA-rich repeats. Similar graph-based hybrid approaches have previously been used to combine Illumina and long-read sequencing data for microbial genomes, but this has not been possible for human genomes due to the larger genome size and higher repeat complexity. However, because HiFi reads are both long and accurate, the complexity of the resulting HiFi graph is tremendously reduced compared to Illumina, allowing for complete resolution of the remaining tangles via Hamiltonian paths and, in the worst cases, Nanopore threading.

We estimate that the consensus quality of our HiFi-based assembly exceeds Q60 (less than 1 error per million bases), with most remaining errors localized to homopolymers, which is a known issue for both PacBio and Nanopore sequencing. To correct homopolymer errors and further improve quality, we mapped the raw PacBio, Nanopore, and Illumina reads to our initial assembly and called variants using DeepVariant and Sniffles. Considering only the most confident variant calls (likely to be assembly errors, rather than heterozygosity), we made a total of 4 structural corrections and 993 small variant corrections. We estimate that the quality of our polished assembly approaches Q70 (1 error per 10 million bases), with no known structural errors. We plan to continue scrutinizing the assembly over the coming months and will generate updated versions to correct any errors identified.

Both the v0.9 pre-polished and v1.0 polished assembly are now freely available for download. The CHM13 cell line possesses a 46,XX karyotype with almost complete homozygosity between chromosome pairs. As such, the v1.0 assembly contains 23 chromosomes (no ChrY) and 1 mitochondrial genome totaling 3,045,441,522 bp of assembled sequence. A small number of heterozygous variants were observed in the genome and will be fully cataloged in a future release. As mentioned above, only the 5 rDNA arrays remain unfinished. Some full and partial rDNA units are assembled on each of the 5 acrocentric p-arms, but the centers of these arrays are currently represented by a total of 11.5 Mbp of Ns in the assembly (on Chromosomes 13, 14, 15, 21, 22). Because the rDNA arrays are near-identical tandem repeats, the content of these arrays is known and only the contained variants remain to be determined. We expect to finish these arrays and add a Chromosome Y (from a different cell line) in the coming year.

Due to our rapid release timeline, many details of our methods and validation have been omitted here. We plan to post a preprint fully describing this project in the coming months, and our consortium is in the process of characterizing these newly uncovered regions of the genome. We have freely released all of our raw data and assemblies without restriction, but ask as a courtesy that you contact us if you would like to contribute analyses prior to our initial publication of results in approximately 6 months. The T2T is an open consortium and all are welcome to join our effort to generate the first truly complete assembly of a human genome. If you would like to join us, please contact T2T co-chairs Adam Phillippy adam.phillippy@nih.gov and Karen Miga khmiga@ucsc.edu to be added to our mailing list.

Acknowledgements

Who is the T2T consortium? You can find a continually updated list of contributors on our members page. I would like to sincerely thank everyone involved for their dedication over these past few months. An incredible amount of work was accomplished in a very short time. Thanks also to our industry partners at PacBio, Oxford Nanopore, Arima Genomics, Amazon Web Services, and DNAnexus who helped enable this work.

I would especially like to thank the working group co-chairs who helped me organize this summer’s finishing workshop: Sergey Nurk (assembly), Karen Miga (satellite DNAs), Arang Rhie (polishing and validation), Mark Diekhans (browser and annotation), Mitchell Vollger (segmental duplications), and Justin Zook (variant calling). Sergey Nurk deserves special credit for developing the HiFi assembly methods that enabled such rapid progress this year.

Lastly, I would like to acknowledge an abbreviated list of software that was critical to the success of this project: HiCanu, Miniasm, Winnowmap, Minimap, GraphAligner, MUMmer, CentroFlye, TandemTools, StringDecomposer, Bandage, GFA, IGV, DeepVariant, Merqury, and SDA. In particular, the contributions of Heng Li in terms of both tools and formats have been invaluable. Never forget that the field of genomics is entirely dependent on free software and the developers of these tools deserve endless thanks and support!

Stay tuned for much more to come from the T2T consortium this year as we begin to analyze these new sequences of the human genome. For now I will just leave you with a teaser dot plot of all ≥50 bp exact repeats within a novel 3 Mbp region on the short arm of human Chromosome 14, which I think looks quite beautiful (fwd in purple, revcomp in blue).

De novo assembly of haplotype-resolved genomes with trio binning

2018-10-22T00:00:00+00:00

Our latest paper with Tim Smith (USDA) is now out in Nature Biotechnology — “Complex allelic variation hampers the assembly of haplotype-resolved sequences from diploid genomes. We developed trio binning, an approach that simplifies haplotype assembly by resolving allelic variation before assembly … Trio binning uses short reads from two parental genomes to first partition long reads from an offspring into haplotype-specific sets. Each haplotype is then assembled independently, resulting in a complete diploid reconstruction.” Here are links to the full paper and a nice summary from NHGRI with quotes from me and Tim. Credit to Sergey Koren and Arang Rhie for developing this great new method. We have many more trios planned!

Human genome assemblies with nanopore, an update

2018-05-23T00:00:00+00:00

We recently participated in a collaborative effort to sequence, assemble, and analyze a human genome (GM12878) using the Oxford Nanopore MinION (Jain et al. 2018). Since then, we’ve also developed a trio-based strategy for assembling complete haplotypes from long-read data (Koren et al. 2018). Oxford Nanopore has continued to advance in the meantime, releasing several major base-calling updates. Other tools, such as Nanopolish, have also gotten faster and added new functionality, like methylation-aware polishing. So, we decided to re-analyze the dataset from the paper using the latest base calling and assembly tools. The new assembly increases the NG50 to over 10 Mbp and trio binning accurately reconstructs key MHC genes for both haplotypes.

We started by re-calling the raw data using Albacore v2.1 with default parameters. The total coverage increased from 37X to 41X. A free 4-fold coverage increase from a software update, not bad! The average read length also increased from 7.3 to 8.1 kbp. Re-assembling with Canu 1.6 gave an improved NG50 of 10.2 Mbp and required approximately 150k cpu hours, as in the paper. We also assembled the genome using a combination of Canu 1.7 read correction with WTDBG contigging. This further increased the NG50 to 12.4 Mbp due to improved read correction in Canu 1.7, and required 30k cpu thanks to the speed of WTDBG. There is still room for improvement with coverage or longer reads since the most continuous human assemblies are over 20 Mbp. The Cliveome is currently the most continuous, with an NG50 of 29 Mbp from a high-coverage combination MinION and GridION data.

The Canu + WTDBG assembly is more continuous than either a Miniasm assembly (NG50=6.7Mbp, 8k cpu with Racon) or a WTDBG-only assembly (NG50=8.7Mbp, 0.5k cpu), neither of which perform read correction. Thus, Canu’s read correction does appear to benefit assembly and we plan to make this Canu + WTDBG pipeline available in a future Canu release. In the meantime, users can run the Canu correction module and feed its output directly into WTDBG for a faster assembly option.

We also evaluated the base quality of the new Canu + WTDBG assembly. The identity is 98.94%, up from the 95.94% we previously reported. This also improves upon the 97.80% identity we reported for a chromosome 20 assembly of Scrappie-called reads (Jain et al. 2018). After two rounds of Nanopolish with its new “CpG methylation” mode, the Canu + WTDBG assembly identity improved to 99.76%. We still see a deletion bias and a high fraction of short indels, but a peak for Alu indels is clearly visible now, whereas it was not before. We are encouraged by these improvements.

However, 99.76% is still a bit disappointing. It is much better than 95.94%, but represents an error about every 400 bp. Commonly, Illumina data is used to polish away this remaining 0.25% of error, but this has its own issues, especially for heterozygous genomes and complex repetitive regions where short-read mapping is a challenge.

Because parental data is available for GM12878, we also attempted complete haplotype assembly using our recently described trio binning approach (Koren et al. 2018). Using TrioCanu, we classified the nanopore reads for GM12878 into maternal and paternal haplotype bins prior to assembly. Despite the lower coverage (40x Nanopore vs. 70x PacBio), we saw a similar classification rate (85% of bases classified by haplotype) and similar NG50 stats for the assembled haplotigs (1.3 Mbp paternal, 1.2 Mbp maternal). The identity for both haplotypes after two rounds of CpG Nanopolish was 99.24%. Consistent with our results on other datasets, we would expect both larger NG50s and higher identity with more coverage. In this case, Nanopolish is dealing with less than 20x coverage per haplotype.

We also aligned the two nanopore haplotypes to one another to call structural variants and compared these results to the same analysis performed with PacBio:

There are again more short indels than expected in the Nanopore assembly (presumably due to base-calling biases and the lower depth of coverage). In comparison, the PacBio haplotypes show more concentrated SV peaks at the typical Alu and LINE sizes (300bp and 6kb). Repeating the MHC analysis from our trio binning preprint, the HLA typing genes in both nanopore haplotypes are correct at G-level resolution and properly phased, with an average edit distance of 1 bp per gene (12 errors total). This is better than the 10x Genomics result but worse than the PacBio result presented in the preprint.

Highlighting the difficulty of polishing complex regions with Illumina data, attempting to run Pilon on each nanopore haplotype using the parental Illumina data actually reduced the quality and introduced additional errors in several MHC genes. However, restricting Pilon to only correct indels did fix all typing gene errors and yielded a final consensus accuracy of 99.92%. (Note that this experiment was only for evaluative purposes, and naive polishing with parental data runs the risk of masking de novo variants in the child.)

Conclusion

Oxford Nanopore continues to make impressive strides. Recent software improvements (primarily for base calling) have almost doubled the assembly NG50 size for the GM12878 assembly, without the addition of any new data. In parallel, we are continuing to work on Canu performance and in the meantime recommend Canu + WTDBG as good compromise between speed and accuracy. Extremely continuous assemblies are clearly possible using long-read nanopore data alone. The last remaining hurdle is final consensus accuracy, which continues to come up short of PacBio and requires additional polishing using complementary data (e.g. Illumina) to reach “reference” quality.

Nanopore data also appears well-suited for trio binning. For long-read reference genome assembly, we now suggest collecting parental Illumina data wherever possible. This approach greatly simplifies the assembly of heterozygous genomes and yields accurate representations of both haplotypes. If performed on a hybrid cross, this method will yield two reference genomes from a single sequencing project, as we recently demonstrated using a cattle F1 cross. Human trios can be harder to come by, but the method is applicable and works well for reconstructing heterozygous structural variation.

Finally, a note of caution on Illumina polishing with Pilon. While it can improve consensus statistics overall, it can worsen the assembly in some regions, especially complex repetitive sequence like the MHC. If using Pilon, we recommend limiting the allowable edits and focusing on the primary nanopore error mode (indels). Using a purpose-built variant caller instead, such as FreeBayes, is also advisable but requires custom editing of the consensus. A more sophisticated approach to hybrid Nanopore + Illumina assembly would be ideal, but we have not yet seen a satisfactory solution. Ultimately, resolving the last bit of nanopore error without an additional technology would be best, and it will be exciting to see if Oxford Nanopore can achieve this with future updates to sequencing and/or base calling methods.

You can find a description of our assembly methods in the Jain et al. 2017 paper and the trio binning paper Koren et al. 2018. Sergey will be presenting this work at London Calling this week.

Update 2018-05-24: The basecalled data is available from the NA12878 consortium. Update 2018-05-29: We’ve made the basecalled data split into maternal, paternal, and unclassified bins available. Update 2018-10-28: We’ve made the Canu + Nanopolish + Pilon + Racon assembly, which is 99.99% identity, available. Download the Canu 1.7 + WTDBG + Nanopolish assembly or the maternal and paternal assemblies.