National Institutes of Health U.S. Department of Health and Human Services
Large-Scale Sequencing Workshop Report
September 3-4, 1998
Strategy for accelerated production of a 'rough draft' sequence of human DNA
Introduction: the summer sequencing exercise
Data gathered during the summer sequencing exercise
Discussion and conclusions
Approaches for cost analysis for large-scale sequencing
Prioritizing regions of biological interest
1. Strategy for accelerated production of a rough draft sequence of human DNA
The Summer Sequencing Exercise
At the National Human Genome Research Institute-Department of Energy (NHGRI-DOE) five-year planning meeting (May 1998), participants recommended rapid production of a draft version of the human genome sequence, based on mapped, large-insert clones. During the past summer, the NHGRI-supported sequencing centers assessed strategies to generate a shotgun product that would constitute a 'working draft' of the human genome. Sequencing centers produced several versions of possible working drafts, using a sample of four BAC clones containing inserts whose finished sequence is already known and annotated. The sequence data from these four clones were evaluated by six scientists, who analyzed them from the point of view of the community of scientists who use genomic data. With the actual sequence available, the fraction of information that could be extracted from the various working draft versions could be determined. Several of the sequence producers also retrospectively analyzed data from previously completed projects. Presentation and discussion of these results were the focus of this P.I. meeting.
Data gathered during the summer sequencing exercise
Summary of reports of sequence producers
There was very good agreement among the results obtained by different centers about the relationship between coverage and representation, and the number of contigs.
Results on representation agreed with theoretical expectations. At 3X coverage, at least 70-80 percent of the bases in contigs >1 kb are of high quality (the best reported at 3X was over 95 percent). At 4-6X coverage, 90-99 percent of the bases in contigs are of high quality. The definition of coverage should be standardized.
The number of sequence contigs was, in general, found to decrease with increased coverage, up to a point; at least two groups found that the number of contigs decreased as coverage increased from 3X to 6X, but that there was little further decrease with little improvement at higher coverages.
Determination of the order and orientation of the contigs appeared to depend on a number of factors, including the depth of coverage, the precise ratio of plasmid forward and reverse reads, and the size of the sublibrary insert (up to a point, the larger the better). One group reported that at 3-3.5X coverage, the clones could be stored in one or two freezers per several hundred Mb, to be maintained for later finishing.
Analysis of the data
There were many misassemblies in the intermediate products. The point at which the optimum tradeoff existed between contig number and sequence quality was hard to determine. It was noted that smaller contigs (such as those obtained from 3X coverage) are useful for finding exons, but would be much less useful finding complete genes. Most of the misassemblies occurred in Alu repeat regions.
Gene content was examined in two ways, by using BLAST to compare the draft sequence to the finished, annotated sequence, and by analyzing the draft sequence with a genefinding program (Genscan). It was found that 4-5X coverage is enough to give the maximum ability to find exons in the draft sequence (~90 percent of exons were found in the Blast comparison, ~70-80 percent with Genscan). It was pointed out that ESTs can help with ordering of contigs.
Several of the discussants emphasized that coding sequences or genes are by no means the only sequence features that will be studied. It was clear that the utility of draft sequence will depend on the uses to which it is put, in particular on the size of the sequence feature (e.g., exons vs. long-range duplications) one is interested in studying.
It will be important for the data used to generate the draft sequence to be annotated as to quality, especially as quality tends to be worse at contig ends.
No file format exists to annotate forward/reverse reads, or the position of the read. This is needed to allow users of the information to gain the most from a working draft. (The current version of the Phrap assembly program cannot incorporate double-stranded reads into assemblies, nor can it assemble shotgun data reliably at low coverage.) It was pointed out that reverse reads could help resolve misassemblies.
Finally, it was noted that it is often sufficient for users to have information about an exon or small portion of a gene as a clue; information about the complete gene can be obtained by further experimentation.
Quality standards for the draft sequence.
While quality standards for draft sequence should be established, caution must be used to avoid a situation in which meeting over-defined standards will be a significant distraction from finishing, violating one of the principals set forth for an intermediate product.
Discussion and conclusions
After discussion of the point above, Dr. Collins appointed a subgroup to construct a proposal based on the discussions to that point.
The subgroup proposal:
Any combination of intermediate and finished product should maximize output of finished sequence. The NHGRI contribution of finished sequence will be ~600 Mb by the end of 2001. (Will require ~14 million reads)
Focus finishing efforts on gene-rich regions.
Solve the long-range contiguity problem for finished sequence in 2 or 3 years. (Produce several contigs greater than 20 Mb.)
Achieve intermediate product coverage of the remainder of the genome by the end of 2001. For NHGRI, this will be 1.2 Gb. Obtaining a tiling path is unsolved, and is a critical problem.
Minimum standards for an intermediate product:
>90 percent of sequence present
>99 percent accuracy (proposed surrogate:3X coverage in Phred ~20 bases)
The product is not a throw-away; read quality should be the same as for a project to be finished. (This will require a minimum of 10 million reads.)
Cost analysis is critical.
Must achieve 24 million reads to accomplish goals 1, 4 and 5.
NHGRI must provide incentives for collaboration/collegiality/sharing of expertise.
There was general agreement with this proposal. It was noted that:
The draft should not be a defined target, because of the real concern about its ability to distract from finishing. Rather, the draft should be considered as a useful by-product that will emerge because of an improved strategy for finishing. Thus, the definition of the draft should not be too rigid at this point. However, lack of definition raises a concern that the draft will not be of good quality. To avoid this, the draft initiative will have to be closely managed.
A means of assessing whether the draft sequence is truly "on the path to finishing" rather than a "throwaway" will be to assess the data regularly for 'finishability.'
The cost of the rough draft should be closely linked to the cost of the finished product.
The mapping issues are significant and need to be addressed.
It is likely that only a few big centers are currently capable of producing the large number of lanes of shotgun sequence per year that will be needed to meet the proposed schedule. In general, all the centers will have to work more cooperatively to produce the rough draft sequence.
2. Mapping Resources
The group discussed the benefits of construction of a database of restriction fragment fingerprints for a BAC library. The value of this database will be to reduce the redundancy of the clone sets with which the mappers have to work, thereby simplifying the mapping problem. End sequences from the fingerprinted clones would be very valuable as an additional data set. There is no evidence that fingerprints are a true measure of clone fidelity.
The issue was raised as to whether further investment should be made in generating and mapping additional random STSs to assist in long-range mapping. It was agreed that additional markers will be needed but that the most useful ones will be those generated from the ends of contigs, rather than random markers. There was some agreement that there may be a need for rapid RH mapping of such directed markers in a year or so. More importantly, there was general agreement that a long range mapping plan is needed.
3. Cost Analysis
An NHGRI consultant has visited five sequencing centers to discuss sequencing costs. The consultant reported a number of observations:
The average (weighted) sequencing costs have dropped about three-fold from 1996 to what is projected for the coming year.
Cumulative sequence production during this time has risen from 16 Mb to a predicted total of >180 Mb by July 1999.
The current average (weighted) cost per lane is $6.77; each such lane generates about 15 finished base pairs to GenBank.
In the next funding period, 265 kb of finished sequence will be produced per FTE, averaged over the entire NHGRI-supported sequencing effort. Each FTE costs about $135,000.
The consultant emphasized that these numbers are weighted averages. The consultant proposed a cost collection tool for use by the centers (and by the committee that will review cooperative agreement applications) to assess costs.
4. Prioritizing Sequencing for Biologically Interesting Regions
At the NHGRI-DOE five-year planning meeting, it was suggested that a process be developed to enable biologists to request that regions in which they are interested be sequenced on a priority basis by the large-scale sequencing centers. NHGRI presented a plan for how such requests could be handled at the funding agency level, with the intent of generating a priority list for regions to be sequenced. There was general support for this plan and it was agreed that it should now be discussed with the international partners.