Human Genome Reference Program
The human genome reference is used by essentially all researchers who need to align and assemble experimental or patient genome sequence data. It also serves as a consensus coordinate system for reporting results.
Participants and Structure
Human Genome Reference Center
- Washington University, St. Louis
Principal Investigators (PI): Ting Wang (Contact), Paul Flicek, Ira Hall, Benedict Paten
The Human Genome Reference Center at Washington University in St. Louis serves as the coordinating center. They maintain and update the reference sequence; support state-of-the-art reference representations; and educate and coordinate with the research community (including clinicians and basic research scientists).
High Quality Reference Genomes
- University of California, Santa Cruz}
Principal Investigators (PI): David Haussler (Contact), Evan Eichler, Ira Hall
The High-Quality Human Reference Genomes Center at the University of California, Santa Cruz collects additional DNA samples from populations not represented in the current reference, including the creation of cell lines. They will generate at least 350 high-quality reference genome sequences, a subset of which will be finished, telomere-to-telomere genome sequences. The center also disseminates the data and works closely with the other Human Genome Reference Program components.
Genome Reference Representations
- Dana-Farber Cancer Institute
Principal Investigators (PI): Heng Li (Contact), Benedict Paten
Project Title: The construction and utility of reference pan-genome graphs
- University of Southern California
Principal Investigators (PI): Mark Chaisson (Contact), Evan Eichler, Tobias Marschall
Project Title: Representing structural haplotypes and complex genetic variation in pan-genome graphs
- Stanford University
Principal Investigators (PI): Hanlee Ji (Contact), Tsachy Weissman
Project Title: K-mer indexing for pan-genome reference annotation
The Genome Reference Representations (GRR) projects support research and development for a next-generation genome reference representation that can capture all human genome variation and support research on the full diversity of populations.
Informatics Tools for the Pangenome
- Purpose: To develop informatics tools that can apply the new pangenome representation for analysis and enable use of the high-quality genome reference by clinical and basic researchers.
Technology Development for Complete Genome Sequencing
- NHGRI will accept applications for Technology Development for Complete Genome Sequencing on an ongoing basis (see NOT-HG-19-011)
- Purpose: Develop technologies for complete de novo sequencing of phased diploid human genomes.
- Washington University, St. Louis
Since the origin of the human reference in the completion of the International Human Genome Project, there has been a need to maintain and improve the human reference and to make it available to the community. This has included resolving error reports, adding information to the reference from new high-quality genomes as they became available, and developing ways to represent alternative haplotype information derived from them. Improved or updated reference versions are curated and released to the community.
On March 1, 2018, NHGRI convened a web meeting of over 65 basic research, clinical, and bioinformatic scientists to discuss scientific opportunities for the genome reference. The meeting addressed key research and resource opportunities for improving the human reference; activities necessary to keep the reference relevant and useful; clinical and research community needs (including education); related resources; and collaborations.
The high-level conclusion of the meeting was that the current version of the human reference does not adequately represent human haplotype variation, that the existing tools to include alternative haplotype information in analyses are not well-used, and that there is an opportunity to significantly improve the human reference by developing it into a “pan-genome”. One goal of a pan-genome reference is to represent as much as possible of human haplotype variation, implying that any newly sequenced experimental or patient haplotype will be readily alignable to the reference. This would include the multiple types of human genomic variation phased in chromosomal regions. This would require addition of many more high-quality human genome assemblies chosen to maximize haplotype diversity, for instance by incorporating samples collected under 1000 Genomes . This would also require the adoption of better ways of representing the data (e.g., as a genome graph), along with the development of new informatics tools to make use of the new reference.
As a result of these discussions, NHGRI will re-organize and re-focus its contribution to the genome reference to create a multi-component Human Genome Reference Program (HGRP) intended to enable an improved human genome reference for the community, and to foster its long-term sustainability and improvement.
Based on the Concept for this program presented to the National Council on Human Genome Research the components will be:
- A Human Genome Reference Center (HGRC; RFA-HG-19-004)
- High Quality Human Reference Genomes (HGRQ; RFA-HG-19-002)
- Genome Reference Representations (GRR; RFA-HG-19-003)
- Informatics tools for use of the human genome reference (see Concept documents)
- Technology development for complete sequencing of genomes (NOT-HG-19-011)
NHGRI manages the HGRP as a consortium. Grantees for the Human Genome Reference Center, High Quality Reference Genomes, and Genome Reference Representations components interact closely on several aspects of the program such as prioritizing new samples, resolving reference errors or ambiguities, establishing quality metrics, transitioning to graph representations or new reference “builds”, and others.
NHGRI believes that the human reference will be more broadly useful if it can be integrated with, or is part of an effective ecosystem with, other existing databases and resources that present human variation information in different contexts (i.e. ClinVar, EGA, Human Genome Structural Variation Consortium, gnomAD, Bravo, etc.)
Data Release and Access Policies
NHGRI data release policies for genome sequence data evolved from the original Bermuda and Ft. Lauderdale policies which were suited for the Human Genome Project data and organismal sequence data. With the advent of projects involving large numbers of samples from human subjects, this area is under continuous evaluation, much of it at the NIH, rather than the NHGRI level.
See: NOT-OD-13-119 for a discussion of the latest NIH policy proposals in this area.
Select Working Groups
Working Group Chairs Role Assembly Team Evan Eichler
Generate “high quality production grade” assemblies; generate “finished” T2T assemblies; QC and validate assemblies; develop methods and pipelines Pangenomes Ira Hall
Variant calling; pangenome framework, construction, and tools Resource Improvement and Maintenance Paul Flicek
Functional annotation; handling error reports; resolving errors through targeted re-assembly and/or sequencing Resource Sharing and Outreach Ting Wang
Resource sharing; outreach & education; browsers Samples Eimear Kenny
Collect, identify, and prioritize samples for inclusion in the project Technology and Production Bob Fulton
Coordinate data production across sites; develop, optimize, troubleshoot, and share protocols; engage with technology companies; test and adopt new technologies and protocols
Last updated: June 10, 2022