Human Genome Sequence Quality Standards
October 2, 2002
A critical aspect of the Human Genome Project's large-scale sequencing program has been the production of very high quality data. This has required the establishment of quality standards for genomic DNA sequence. This document, in three sections, describes: I) the most up-to-date quality standards and definitions for genomic sequence; II) an account of the rationale for, and history of, the establishment of those quality standards; and III) the descriptions and results of sequence quality assessment exercises that have been done to determine the quality of the data produced by the public sequencing community so far.
I. Current Quality Standards and Definitions
The goal of the Human Genome Project is to produce a sequence of human DNA that is of sufficiently high accuracy and long-range contiguity to allow a full interpretation of all the information encoded in the human genome, and that will stand the test of time. The HGP sequencing effort has also adopted an intermediate product for human genomic sequence termed "working draft" sequence because, although it cannot be used for all conceivable methods of DNA analysis, draft quality sequence can be very useful in identifying a number of features in DNA, e.g. coding sequences and repeats. In establishing different sequence "products," it was also important to develop quality standards appropriate to each. In addition, some of the quality standards that have been developed are specific for the particular strategy employed in generating the sequence data. As different strategies are implemented to improve large-scale DNA sequencing (for example, for future target genomes), it is likely that additional or different quality standards will need to be developed. Accordingly, the HGP regularly revisits quality standards to ensure that they remain relevant to the overall sequencing program.
A. Finished Sequence
"Finished Sequence" is a term used to refer to a region of DNA sequence on which, after a best-faith effort by the sequencing laboratory to resolve all difficult regions and to generate a high quality, completely continuous representation, no further sequencing will be done. The actual definition is provided by the following set of criteria. These standards were derived for the mapped BAC-by-BAC strategy undertaken for human sequencing by the International HGP, but are readily adaptable to finished sequence obtained by other strategies.
- The sequence should be no less than 99.99% accurate (an error rate of no more than 1/10,000); ambiguities (places where bases cannot be called unambiguously even though sequence reads are available) are to be counted as errors.
- The sequence should ideally contain no gaps; gaps are regions for which no sequence reads can be obtained for biological reasons (as opposed to lack of interest on the part of the sequencing laboratory). In regions where gaps cannot be closed, the database entry should be annotated as to the size of the gap, the orientation of flanking regions, and the efforts that have been made to try to close the gap.
Since the standards above refer primarily to the individual finished BAC projects used in a BAC-by-BAC strategy rather than a large assembly, it is also important to have standards that define a finished chromosome. Note that these standards generally apply to the euchromatic portions of the chromosome, as cloning of heterochromatic regions is still technically unreliable. The biology of the target organism should be considered in applying these guidelines. For example, C. elegans is holocentric, which affects the ease of obtaining long-range contiguity.
- The sequence assembly across the euchromatic portions of each chromosome arm must be contiguous. If complete contiguity cannot be obtained for biological reasons (e.g., if the appropriate clone cannot be recovered despite all efforts to do so), all remaining gaps must be sized, oriented, and annotated. The minimum effort that must be applied to declare a gap to be unclosable is that 30X coverage of available BAC libraries must have been screened with appropriate probes.
- At least 95% of the chromosome must be represented in contiguous sequence.
C. Working Draft Sequence (Human)
The current standards for working draft were developed for the mapped BAC-by-BAC sequencing strategy.
- The number of phred 20 bases divided by the total project length (calculated both as a sum of fingerprint fragments or as a sum of sequence contigs over 1 kb) should be at least 4 as an average of all of a producer?s working draft BAC projects and no lower than 3 for any individual BAC project.
- Contamination of the target sequence with sequence from other organisms (e.g., E. coli or vector DNA) should be as low as possible.
Additional information to be included with human draft accessions:
- Human working draft database entries corresponding to each project should be labeled with an "HTGS_Draft" tag.
- Quality scores should be included in the database entry.
- The end sequences of the insert in each BAC clone should be identified.
D. Whole Genome Shotgun and Hybrid Strategies
It is anticipated that, after the human, large genome sequences are likely to be determined by a mixed mapped BAC/whole genome shotgun (WGS) strategy. The standards for finished sequence obtained by this strategy are the same as those above for finished sequence determined by the mapped BAC approach. The quality standards for a working draft sequence that has been determined by such a hybrid strategy have not yet been finalized. However, the working draft standards will necessarily have to be modified from those established for the human working draft (among other reasons, it will not be appropriate to assess coverage on a per-BAC clone basis for this strategy, at least until the assembly is undertaken). Nonetheless, the overall coverage standard for working draft sequence obtained by WGS or hybrid methods of 4X in Q20 bases is an appropriate one at this time. Variables that should be considered in setting a draft coverage standard include the state-of-the-art in genome assembly programs, and the biological utility of the information at different levels of coverage, versus cost. Precisely how this standard will be implemented and assessed will need to be resolved in further discussion. In the interim, it may be appropriate to set standards for the quality of elements that lead to the desired product such as read quality, data tracking, and assembly quality.
II. Rationale and History
The quality of genomic sequence produced by the HGP was understood to be important at the beginning of the program. In its 1988 report, the NRC's Committee on Mapping and Sequencing the Human Genome stated "A mechanism of quality control is needed for the groups that are contributing sequencing information". As DNA sequencing began to move into a production mode in the mid-1990's, specific quality standards were defined. At the February 1996 meeting of the National Advisory Council for Human Genome Research, the Council recommended that, in developing means to address the productivity of the pilot project program for large-scale human DNA sequencing, the quality of the DNA sequence data produced was an important criterion. Shortly thereafter, the issues were again discussed at the First International Strategy Meeting on Human Genome Sequencing, where the necessity of developing clear standards and effective means for measuring quality were re-affirmed. It was also suggested that it was critical to identify regions of sequence known to be of relatively lower quality.
To follow up these calls for attention to quality assessment, the then-NCHGR held a workshop on DNA Sequence Validation on April 15, 1996. The major conclusions at this workshop related specifically to finished sequence (the only sequence product being pursued at that time), and were critical in establishing and inculcating quality standards and routine assessment into production sequencing in the public effort. The major conclusions at this workshop were:
- that the then NCHGR-funded pilot DNA sequencing projects should strive for an error rate not to exceed 1 in 10,000 bases
- the pilot projects should participate in a validation study to assess the quality of individual base-calls and sequence assembly. To the extent possible, methods for validation were to be independent from those used in initial sequence determination
- it would be desirable to check sufficient data from each grantee to corroborate their representation of data quality
- the minimum criterion for demonstrating fidelity of clones used as sequencing templates should be clear evidence that the genomic region is represented by the same restriction digest pattern in at least two clones derived from independent transformation events. Greater depth of coverage is highly desirable, and is expected to be achievable in most cases
- to support validation and for use by the community, the public databases should store quality measures on each base pair, along with map data, including depth of cloned coverage available and identification of map landmarks included in the sequence.
Subsequent to that workshop, the NHGRI continued to discuss the formulation of standards based on the recommendations above, along with input from the community. Other discussions, both within and outside of NHGRI, continued to emphasize the importance of establishing quality standards and measurements. At its February 1997 meeting, the National Advisory Council for Human Genome Research agreed that it would be important for the Institute to establish a sequencing standard. The Council agreed that the goal of 99.99% accurate sequence was an appropriate quality standard for human genomic sequence. However, the Council recognized that the means for measuring sequence quality were inadequate to determine whether that standard was being met. The Council instructed NHGRI staff to develop a metric, however interim, for measuring sequence quality.
In March, 1997, the attendees at the Second International Strategy Meeting for Large Scale Sequencing agreed that the goal for sequencing standards should be an accuracy level of 99.99%, and that there were no good metrics that an outside group could use to confirm the quality of sequence. A second criterion was defined at this meeting, that the sequence should have no gaps. Recognizing, however, that this goal was far from being achieved, there was agreement that when a group was unable to fill a gap, information about the size of the gap, the orientation of its flanking regions and the effort made to close it, should be included in the sequence annotation in the public database. The attendees suggested that it might be possible to assess data quality if sequencing groups were to exchange sequence data files and attempt to reassemble the completed sequence from that information. Based on these several discussions, NHGRI adopted its current sequencing standard for grantees engaged in large-scale sequencing of human DNA, and began to conduct regular quality assessment exercises.
Sequence quality issues have also been discussed at two meetings of the Principal Investigators of the NHGRI-sponsored large-scale sequencing grants, which took place in July and December of 1997. These investigators affirmed that contiguity is an important quality parameter for NHGRI to monitor. At the July 1997 meeting, the participants agreed that a contig should be at least 30 kb in size to be reported as finished sequence to NHGRI. They further agreed that this minimum size should increase over time, to a standard of having an average contig length of 500 kb.
Gaps and Ambiguities
- The definitions of gaps and ambiguities have also been refined at these meetings. At the July 1997 PI meeting, the discussion concluded that ambiguities should be considered as gaps. At the December 1997 PI meeting, it became apparent that there should be a distinction between ambiguities and gaps, but this distinction should not lessen the incentive to produce sequence without ambiguities or gaps. It was proposed that gaps should be defined as regions where there were no sequence reads, whereas ambiguities were where such reads had occurred but the base could not be called unambiguously. Ambiguities are to be counted as errors, against the standard of an error rate of no more than 1 in 10,000.
B. Finished Chromosomes
Before the completion of human chromosome 22, discussion at the Sixth International Strategy Meeting for Large Scale Sequencing adopted the working standard above (I.2) for what should be considered an "essentially completed" chromosome.
As the genome was moving towards completion, additional measures were established to assess quality of the underlying chromosome clone maps. At the May, 2001 International Meeting, and as refined in subsequent International Meetings, a tracking system for map gap closure was instituted. This relied on two reports produced by the chromosome coordinators (individuals at the sequencing centers responsible for the chromosome): the Tiling Path Format (TPF) is a spreadsheet denoting each clone in the tiling path in order (regardless of state of completion) and any gaps in that tiling path. The gap report listed and annotated (with regard to size, flanking clones, and closure efforts) all remaining map gaps according to the following definitions:
Type 1: a gap known to be anchored by sequence on either side. For Type 1 gaps there should be a clone(s) in the pipeline that will close it.
Type 2: a gap without identified spanning sequences, but with a spanning clone (or clones) in the map.
Type 3: a gap without a spanning clone of any type (between fingerprint contigs).
Type 4: a gap without a spanning clone of any type and for which all known libraries have been screened. Libraries include all publicly available BAC or PAC libraries. Probes must be screened to at least 30X depth of BAC libraries before a Type 3 gap can be declared to be Type 4.
This reporting system, updated at regular (bi-monthly or monthly) intervals, allowed continuous assessment of many aspects of the quality of the underlying maps as the genome progressed towards closure, including number of remaining gaps, finishing/draft status of individual clones, missing markers from other maps, and assessment of quality of the joins between individual clones.
C. Working Draft Sequence
During 1998, NHGRI and DOE held a number of workshops to discuss goals for an updated Five-year Plan. In the course of these workshops, the value and importance of finished sequence was reconfirmed. However, there was also discussion of the high value of producing a "working draft" sequence which was defined as an intermediate sequence product that would cover the vast majority of the genome, be available to the community in two years or less and would in the end contribute to the goal of producing a finished sequence. This sequence was envisioned to be of lower accuracy and contiguity than finished sequence but it would nevertheless be very useful, especially for finding genes, exons, and other features through sequence searches. In order to evaluate the utility of such a working draft sequence a set of computational analyses was undertaken by several biologists. The results were discussed at a meeting in September, 1998 and indicated that a working draft sequence would be of enormous immediate benefit to those searching for genes and other features in the human genome sequence. The participants suggested that quality standards be set for the working draft but they urged caution to avoid a situation in which meeting over-defined standards will be a significant distraction from finishing.
The participants proposed the following minimum standards for an intermediate product:
>90% of sequence present
>99% accuracy (proposed surrogate: 3X coverage in Phred 20 bases)
In March, 1999 the five sequencing groups with the largest capacities met to discuss undertaking a project to produce a working draft of the sequence by Spring, 2000. These groups had successfully scaled their centers over the previous six months to the point that their combined projected capacities could accomplish this feat. As the group's plan developed, it was discussed with the International community at the May 1999 International Strategy Meeting. This group of sixteen sequencing centers agreed to participate in the production of the working draft. During the summer of that year, the five largest centers met again to discuss a quality standard, which was then presented to and approved by the International group in September 1999.
The quality standard agreed upon and reported in the final report of the September 1999 International meeting was: "The total number of Phred-20 bases in contigs greater than 1 kb must be at least four times the size of the clone insert. This is the bulk expectation of a center's output, but should be reported on a clone by clone basis. It was also suggested that centers should report the fraction of BACs that fall below this standard in their working draft, with the expectation that this fraction will be small. Although there was not a consensus agreement that this should be done, centers who are interested in doing this are encouraged to do so."
One outcome of the discussions summarized in Section II of this document was that, by 1997, NHGRI had decided to conduct quality assessment (QA) of sequence data produced by the centers it funds, to ensure standards are met. Several of these QA exercises have now occurred for both finished and working draft sequence. Each of these assessments was designed in consultation with the principal investigators of either the NHGRI-funded sequencing centers or of the international sequencing centers. In most of these assessment exercises, randomly selected sequencing projects and corresponding trace data from one center were assessed by another (either two other sequencing centers among the international consortium, or an outside group). For human working draft, it was possible to assess the quality of the data computationally, with reference to the publicly deposited information only.
A. First QA Exercise (Spring 1997)
Following the discussion at the Second International Strategy Meeting, NHGRI initiated a short experiment to determine whether reanalysis of raw sequencing data would be useful to reveal weaknesses in finished sequence and to estimate the quality of the data. In this experiment, each of the NHGRI-funded sequencing groups submitted raw sequence data and ancillary information representing two completed clones. These data were analyzed by two other groups, which each used their own methods to reassemble the sequence. The results were reported to NHGRI and distributed to all of the participants. Then, on May 13-14th, 1997, the participants met to discuss the results.
In general, the exercise was considered to have been very useful and successful in finding discrepancies and potential weak areas in the data. Because each group used different assembly tools, it was not surprising that the results from the two checkers were not identical. But it was reassuring that the checkers' data basically agreed on which sequence data were stronger and which were weaker. The participants endorsed an expansion of the exercise to improve the process and assess the quality of the sequence being generated by NHGRI grantees over the following year. The following outline was agreed upon for the expanded exercise:
- Clones to be checked should be chosen at random from sequence deposited in the public database over the entire output of the sequencing group. At least 4 clones from each, representing at least 200 kb, should be checked.
- Each clone should be checked by two other groups. The checker's workload should be indexed according to the group's production level, but should be no less than 4 clones.
- In order to standardize data exchange, each center must define its naming convention for clones. Eventually it would be helpful to have all the data in the same file format.
- In addition to the raw sequence trace files, DNA representing the clone and a bacterial culture containing the clone should be exchanged.
- The checking strategy should include biochemical analysis, unless the error rate in the sequence were found to be greater than 5 x 10-4 (after initial biochemical analysis), in which case the checker need go no further. If the sequence quality were better than 5 x 10-4, the checker should proceed to attempt to resolve the ambiguities. The assembly should be checked by restriction analysis.
- After the checking is complete, the data should be sent to NHGRI, who would send it to the original producer of the data. That investigator would then have the opportunity to talk with the checkers, and the checkers would also be able to compare their data.
- It was estimated that five months would be needed to complete this exercise.
- Sequencers must disclose the criteria they use for assessing the fidelity to the genome of the clones that have been sequenced.
B. Second QA Exercise (data produced before September 1997)
NHGRI selected four finished clones, at random, totaling 200 kb, from each participating sequencing group (all NHGRI human plus D. melanogaster). Data eligible for being checked was selected from that deposited as 'finished' as of September, 1997.
NHGRI assigned each set of four clones to two checkers chosen from among the participants; groups exchanged data files and bacterial isolates/DNA. Checkers re-assembled files and analyzed the data. If the error rate was better than 1 in 2000, checkers resolved discrepancies by further analysis (resequencing). Most groups also checked assembly by restriction analysis, although this was not in the original instructions.
Each group was given the opportunity to respond to the checker's reports.
Total number of clones available for checking as of 9/97: 420
Total number of clones selected for the exercise: 37 (a total of 1.7 Mb tested)
TABLE 1: Single-base discrepancies--number of clones at error ratea:
- These numbers are based on the higher error rate between the two checker's reports, for each individual clone; these numbers do not take into account the producer's responses.
- For 7 out of the 10 clones in this category, one of the two checkers actually evaluated those clones as having fewer than 1 in 10000 errors.
Total number of single-base discrepancies (conservative aggregate of two checkers): 230/1.7 Mb.
Total excluding the clones worse than 1 in 2000: 120/1.59 Mb
About 2/3rds (133) of the single-base discrepancies were substitutions, 1/3rd (73) were insertions or deletions, based on 206 cases of single-base errors where precise information was provided.
Other errors (not exclusive of single-base errors)
4 misassemblies, some likely to be due to small deletions (~250-1900 bp) in the large-insert clone
1 annotated gap closed (75 bp)
* 1 wrong clone sent (clone tracking error)
Caveats: Variability due to sampling; variability in checking
Most groups are sequencing at or very close to standards: Most groups were achieving 1 in 10000 or better, summed over all clones. Numbers in the table are conservative and do not include the producer's responses, consideration of which will improve the error rates. However, most of the producers responses agree with the checkers' reports.
Good concordance between checkers' reports: For single-base errors, both checkers agreed on the general quality of the project (according to the bins in Table 1) 28 of 37 times, and were very close in all other cases. In 11 of 19 clones where error type and location appear in the report, there is at least a 50% overlap in the precise identified errors. But there were still some puzzling differences between the identified errors in an individual clone, especially when there were a lot of errors or trace data were considered poor by checker. For other types of error (deletions, etc.), both checkers agreed in all but one case.
The exercise revealed useful information about the kinds of error: Clone instabilities (small deletions) were a small but significant problem-small deletions may be hard to detect with routine protocols. (Note that this exercise included cosmids as well as BAC and PAC clones-several of the small deletions were in cosmids.) Single-base errors often occur in regions where sequence data quality is good-more than half could be resolved unambiguously by re-editing the original data without need to re-sequence (36/53 errors; some of this was confirmed by resequencing).
C. Third QA Exercise (data produced before November 1998)
This exercise used the same protocol as the previous exercise: 4-5 projects from each group totaling ~200 kb were tested; each set of clones checked by two checkers; reassemble data; find discrepancies with GenBank entry; resolve by resequencing or re-editing; confirm assembly with restriction analysis. The exercise focused on recently produced data (within the previous year). There were 17 participants including the Berkeley Drosophila genome sequencing center, the DOE sequencing labs, the Sanger Centre, and others within and outside the NHGRI sequencing program.
TABLE 1. What was checked
ESTIMATE of percentage of human clones checked:
Amount of finished human data in GenBank as of Nov 3, 1998: ~210 Mb. At ~110 kb per clone, =~1900 clones.
@ 52 clones checked, ~2.7 % of total human finished projects in GenBank as of October 1998 were checked.
@ 2.6 Mb checked, ~1.2 % of total human finished sequence in GenBank as of October, 1998 was checked.
Projects were checked against the finished GenBank or EMBL versions as of Nov 3, 1998.
TABLE 2. Combined summary of results for all participants
1The numbers after the "plus" sign correspond to clones (cosmids) that were less than 50 kb, but contained no errors.
TABLE 3. Combined summary of results for human data only
Note: These data are an aggregate of the results of human data being produced by NHGRI-funded centers, the DOE, and the Sanger Center. It also includes human data from other producers.
1 The numbers after the "plus" sign correspond to clones (cosmids) that were less than 50 kb, but contained no errors.
- There was a marked improvement in data quality compared to that seen in the second QA assessment. NHGRI-funded centers, the Sanger Centre, and the DOE routinely exceeded the finished sequence quality standard by at least factor of 10, attaining a 1/100,000 single base error rates. Most other groups assessed also met or exceeded the quality standard.
- Poor quality underlying data did not necessarily mean final quality is wrong; apparently good quality data may have errors. Resequencing appeared to be valuable in quality assessment of finished data, in addition to computational checking.
- It was important to confirm apparent small deletions in BACs by PCR, but labor intensive?there were several cases where preliminary suspicions of a misassembly were not confirmed by PCR or resequencing, and a case where the origin of a small deletion still remains to be determined.
- The version of Phrap available at the time of the assessment consistently overestimated error rates?there were multiple examples of Phred error estimates exceeding actual errors by 10-fold and more.
- Proportionally fewer errors were detected by re-editing alone than in the previous QA.
- Less variability in checking strategies was seen than in the previous QA. Most resequencing was done by PCR directed reads on weak and/or discrepant regions. It appears that an effective check, where data quality is generally good and poor regions isolated, is to reassemble and resequence weak regions and high quality discrepancies by PCR-directed sequencing. But if the project has extensive problems, it may be better to add shotgun reads to assembly before comparing with GenBank entry.
D. QA of Human Working Draft Sequence
Two quality assessments of human working draft sequence have been conducted so far.
1. Independent Check (November 1999)
Trace files from 10 recent (Fall, 1999) projects from each of the three largest NHGRI-funded centers (those producing the bulk of the working draft sequence) were transferred to an outside laboratory (Dr. Richard Myers and Mr. Jeremy Schmutz, Stanford University) for re-assembly and assessment against the coverage standard established for human working draft sequence. Read data quality and assembly was also assessed (even though no standards for these had been established). In all cases, the submitted data met or exceeded standards.
1 P20 Insert refers to usable Phred 20 bases divided by estimated insert size; P20 Assem is the usable Phred 20 bases divided by assembled bases. These correspond to the two methods for calculating working draft coverage specified in Section 1 of this document.
2 Number of contigs >1kb.
3 This value is an overage of the values for two centers, since an estimation of insert size was not available for one center's data.
2. Computational Check (data of May 24, 2000)
Dr. John Bouck (Baylor College of Medicine) computationally assessed a majority (nearly 9,000) of human projects in the public database that were labeled ?HTGS_draft? on May 24, 2000. These accessions contain associated quality data enabling the assessment of coverage in Q20 bases by purely computational means.
- Coverage in Q20 bases: 5.01 for 8878 projects tested.
- If the analysis excluded projects with a coverage of greater than 6, average coverage was 4.4 (7339 projects).
- There were 26 draft projects that had reported coverage less than 3
Additional computational assessments of the human working draft sequence will be carried out leading up to publication of the results (projected for Fall, 2000). Those responsible for the 26 draft projects with coverage <3 were asked to improve or withdraw them from the HTGS_draft category.
To complement these computational assessments, NHGRI and GenBank monitored HTGS_draft depositions for contaminating sequences (e.g., vector, E. coli) so that these could be removed. It was clear from the these and other informal assessments that it would be useful to monitor contamination in HTGS_draft accessions on a continuing basis. GenBank maintains an updated list, available by FTP, of projects it has identified as being contaminated. Centers are responsible for checking this list and correcting their own contaminated projects. NHGRI and GenBank also monitor depositions to ensure that the standard information accompanies each accession (quality scores, end sequences identified.
Last Updated: March 9, 2012