National Institutes of Health U.S. Department of Health and Human Services
NHGRI Sequencing Goals
NOTE: The complete five-year plan was published in the October 23, 1998 issue of Science. The excerpt below of Goal 1 of the five-year plan is essentially identical to the same section that appeared in the draft version of the plan released at the open NHGRI Council meeting that occurred on 9/14/98, and is supplied as a necessary reference document for applicants to RFA HG-98-002.
RFA Research Network for Large-Scale Sequencing of the Human Genome
GOAL 1 - The Human DNA Sequence
Providing a complete, high-quality sequence of human genomic DNA to the research community as a publicly available resource continues to be the Human Genome Project's (HGP) highest priority goal. The enormous value of the human genome sequence to scientists and the considerable savings in research costs its widespread availability will inspire are compelling arguments for advancing the timetable for completion. Recent technological developments and experience with large-scale sequencing provide increasing confidence that it will be possible to complete an accurate, high-quality sequence of the human genome by the end of 2003, two years sooner then previously predicted. the National Institutes of Health (NIH) and the Department of Energy (DOE) expect to contribute 60 - 70 percent of this sequence, with the remainder coming from the Wellcome Trust-funded effort at the Sanger Centre and other international partners.
This is a highly ambitious, even audacious, goal, given that only about 6 percent of the human genome sequence has been completed thus far. Sequence completion by the end of 2003 is a major challenge, but within reach and well worth the risks and effort. Realizing the goal will require an intense and dedicated effort, and a continuation and expansion of the collaborative spirit of the international sequencing community. Only sequence of high accuracy and long-range contiguity will allow a full interpretation of all the information encoded in the human genome. However, in the course of finishing the first human genome sequence by the end of 2003, a working draft covering the vast majority of the genome can be produced even sooner, within the next three years. Though that sequence will be of lower accuracy and contiguity, it will nevertheless be very useful, especially for finding genes, exons and other features through sequence searches. These uses will assist a large number of current and future scientific projects and bring them to fruition much sooner, resulting in significant time and cost savings. However, because this sequence will have gaps, it will not be as useful as finished sequence for studying DNA features that span large regions or require high sequence accuracy over long stretches.
Availability of the human sequence will not end the need for large-scale sequencing. Full interpretation of that sequence will require much more sequence information from many other organisms, as well as information about sequence variation in humans. Thus, the development of sustainable, long-term sequencing capacity is a critical objective of the HGP. Achieving the goals below will require a capacity of at least 500 megabases of finished sequence per year by the end of 2003.
Finish the complete human genome sequence by the end of 2003. The year 2003 is the 50th anniversary of the discovery of the double helix structure of DNA by James Watson and Francis Crick. There could hardly be a more fitting tribute to this momentous event in biology than the completion of the first human genome sequence in this anniversary year. The technology to do so is at hand, although further improvements in efficiency and cost effectiveness will be needed, and more research is needed on approaches to sequencing structurally difficult regions.1 Current sequencing capacity will have to be expanded 2 to 3 times but should be within the capability of the sequencing community.
Reaching this goal will significantly stress the capabilities of the publicly funded project and will require continued enthusiastic support from the administration and the U.S. Congress. But the value of the complete, highly accurate, fully assembled sequence of the human genome is so great that it merits this kind of investment.
Finish one-third of the human DNA sequence by the end of 2001. With the anticipated scale-up of sequencing capacity, it should be possible to expand finished sequence production to achieve completion of 1 gigabase of human sequence by the worldwide Human Genome Project by the end of 2001. As more than half of the genes are predicted to lie in the gene-rich third of the genome, the finishing effort during the next three years should focus on such regions if this can be done without incurring significant additional costs. A convenient, but not the only, strategy would be to finish BAC clones detected by cDNA or EST sequences. In addition, a rapid peer-review process should be established immediately for prioritizing specific regions to be finished, based on the needs of the international scientific community. This process must be impartial and must minimize disruptions to the large-scale sequencing laboratories.
To best meet the needs of the scientific community, the finished human DNA sequence must be a faithful representation of the genome, with high base-pair accuracy and long-range contiguity. Specific quality standards that balance cost and utility have already been established. One of the most important uses for the human sequence will be comparison with other human and non-human sequences. The sequence differences identified in such comparisons should, in nearly all cases, reflect real biological differences rather than errors or incomplete sequence. Consequently the current standard for accuracy, an error rate of no more than 1 base in 10,000, remains appropriate. While production of contiguous sequence without gaps is the goal, any irreducible gaps must be annotated as to size and position. In order to assure that long-range contiguity of the sequence will be achievable, several contigs of 20 Mb or more should be generated by the end of 2001. These quality standards should be re-examined periodically; as experience in using sequence data is gained, the appropriate standards for sequence quality may change.
Achieve coverage of at least 90 percent of the genome in a working draft based on mapped clones by the end of 2001. The current public sequencing strategy is based on mapped clones and occurs in two phases. The first, or shotgun phase, involves random determination of most of the sequence from a mapped clone of interest. Methods for doing this are now highly automated and efficient. Mapped shotgun data are assembled into a product ("working draft" sequence) that covers most of the region of interest but may still contain gaps and ambiguities. In the second, finishing phase, the gaps are filled and discrepancies resolved. At present, the finishing phase is more labor intensive than the shotgun phase. Already, partially finished, working-draft sequence is accumulating in public databases at about twice the rate of finished sequence.
Based on recent experience, the rate of production of working draft sequence can be further increased. By continuing to scale-up the production of finished sequence at a realistic rate, and further scaling up the production of working-draft sequence, the combined total of working draft plus finished sequence will cover at least 90 percent of the genome at an accuracy of at least 99 percent by the end of 2001. Some areas of the genome are likely to be difficult to clone or not amenable to automated assembly due to highly repetitive sequence, thus coverage is expected to fall short of 100 percent at this stage. If increased resources are available and/or technology improves, greater than 90 percent coverage may be possible.
The individual sequence reads used to generate the working draft will be held to the same high-quality standards as those used for the finished genome sequence. Assembly of the working draft should not create loss of efficiency or increases in overall cost.
Recently, two private ventures announced initiatives to sequence a major fraction of the human genome, using strategies that differ fundamentally from the publicly funded approach. One of these ventures is based upon a whole genome shotgun strategy, which may present significant assembly problems.2 The stated intention of this venture to release data on a quarterly basis creates the possibility of synergy with the public effort. If this privately funded data set and the public one can be merged, the combined depth of coverage of the working-draft sequence will be greater, and the mapping information provided by the public data set will provide critically-needed anchoring to the private data. The NIH and DOE welcome such initiatives and look forward to cooperating with all parties that can contribute to more rapid public availability of the human genome sequence.
Make the sequence totally and freely accessible. The Human Genome Project was initiated because its proponents believed the human sequence is such a precious scientific resource that it must be made totally and publicly available to all who want to use it. Only the wide availability of this unique resource will maximally stimulate the research that will eventually improve human health. Public funding of the Human Genome Project is predicated on the belief that public availability of the human sequence at the earliest possible time will lead to the greatest public good. Therefore, NIH and DOE continue to endorse strongly the policy for human sequence data release adopted by the international sequencing community in February, 1996,3 and confirmed and expanded to include genomic sequence of all organisms in 1998.4 This policy states that sequence assemblies 1-2 kilobases in size should be released into public databases within 24 hours of generation, and that finished sequence should be released on a similarly rapid time scale.
The finished genome sequence refers to the portion of human DNA that can be stably cloned and sequenced by current technology. The small proportion of highly repeated sequence represented by the centromeres and other constitutive heterochromatic regions of the genome may not be finished by 2003. In addition, it is possible that a small fraction of other parts of the genome may present unanticipated and serious challenges. Such regions are expected to be rare.
J. C. Venter et. al., Science 280, 1540 (1998). A whole genome shotgun strategy has been proposed previously (J. Weber and E.W. Myers, Genome Research 7, 401, 1997), but major concerns have been raised (P. Green, Genome Research 7, 410, 1997) about the difficulties expected in obtaining correct long range contig assemblies. It will not be possible to evaluate the feasibility, impact, or quality of the product of this approach until more data are available, which is not estimated to occur for about 12 to 18 months.