NHGRI Workshop on DNA Sequence Validation

National Human Genome Research Institute

National Institutes of Health
U.S. Department of Health and Human Services


National Human Genome Research Institute
Workshop on DNA Sequence Validation

Natcher Conference Center
National Institutes of Health

April 15, 1996

Introduction and Synopsis

The National Advisory Council for Human Genome Research (NACHGR), at its January 1996 meeting, considered several applications to initiate pilot projects to scale up human DNA sequencing. NACHGR recommended that the monitoring of grants supporting the pilot projects should include an evaluation of the quality (completeness and accuracy) of the DNA sequence product. Subsequently, the National Human Genome Research Institute (NHGRI) convened a workshop to consider scientific approaches to the assessment of DNA sequence quality for these grants. This report summarizes the workshop discussion and incorporates recommendations made at the subsequent (May 1996) meeting of NACHGR.

The sequence quality issues that bear on the management of the pilot project grants may be different from longer-term needs for monitoring human DNA sequence data quality. Therefore, the discussions focused on the near-term issues. The major conclusions were:

The workshop established principles for sequence validation but did not reach specific conclusions on the methods or mechanisms to be used. NHGRI staff will use the insights developed through these discussions to formulate a validation plan for the pilot sequencing grants. NHGRI plans to obtain continuing feedback on many of the issues discussed at the workshop.

NHGRI's efforts to validate grantees data should be viewed as a test of the quality of choices made by each pilot DNA sequencing group. At the same time, the groups need to be assured that the validation study will be only one aspect of the review process. Given the fact that some clones are harder to sequence than others, the validation study should assess the product not only relative to the 1:10,000 standard, but also to the submitter's own stated confidence in the data.

Background

NHGRI has awarded six grants to support projects that were submitted in response to Request for Application HG-95-005, "Pilot Projects for Sequencing of the Human Genome." The purpose of the RFA was to solicit applications for pilot projects to test strategies that can scale up to sequence the human genome.

In its second-level review of the applications, the NACHGR considered several issues related to the monitoring of the grants supporting the pilot projects. NACHGR recommended that, about two years after the initial award date, NHGRI should examine carefully the progress made by the grantees. The purpose of the two-year evaluation is to facilitate decisions about the distribution of third-year funds for the pilot-project program. Information obtained in assessing the progress of the pilot projects will also inform the subsequent decision, approximately three years from the initial award date, as to whether scale-up to truly large-scale, human DNA sequencing would be appropriate at that time.

Two important factors in making those assessements will be the amount of sequence produced and its cost. Assessing these factors would be meaningless without also objectively measuring the quality of the data, because productivity, cost and quality are so closely interrelated. Additionally, obtaining such a measure of quality will provide the evidence needed for the public, the research community and the NHGRI to consider the broader question of the appropriateness of investment in large-scale human DNA sequencing in early 1999.

Data Quality

At the workshop, it was generally agreed that, at present, an error rate not to exceed 1:10,000 in all sequence is a reasonable target for the pilot projects. There may be genomic regions for which that level of accuracy cannot be obtained. But, only after we learn more about the problems involved in reaching this level should lowering the bar for such regions be considered.

What types of data should be validated and what criteria can be used in the study?

The quality of both the individual base-calls and of the sequence assembly should be assessed. Most of the participants thought that both base-calls and assembly should, to the extent possible, be validated by means that are independent from those that were used to generate the data.

All of the pilot sequencing groups will be using software that assigns error probability values to individual base-calls. Use of such software has already been found to reduce the amount of re-sequencing that needs to be done to achieve high-confidence sequence data. However, even with software of this sort in use, selective re-sequencing may be an important mechanism to use for validating the per-base error probability measures generated by specific hardware-software combinations because (1) even if several labs are using the same software, they may be applying it differently; (2) none of the software systems being used is perfect and the different applications will reveal the strengths and weaknesses of particular software packages, leading to software improvement; (3) some of the software is tuned to specific reaction/gel conditions, but the user may not be using those conditions; and, importantly, (4) independent sequencing using a different library/vector/chemistry and/or DNA strand will provide a complementary dataset that will fill potential weak points in the initial experiment.

Restriction mapping was not considered to be a useful approach to validating base-calls. While it would theoretically be possible to employ a large enough number of restriction enzymes to query the majority of the nucleotides in the sequence, this would be costly, and many difficult regions (e.g., homopolymer repeats) would not be interrogated by this method.

However, comparison of restriction fragment pattern data to the sequence is an attractive approach for checking sequence assembly. It has the advantage that the methodology is relatively simple and completely independent from that used to collect the sequence data. Even if groups are already using this method as a component of their internal quality checking procedures, additional enzymes could be used for validation. Furthermore, different groups will be using different hardware and software for restriction analysis, and will be applying different stringency to the analysis of this type of data. The stringency needed for the validation study needs further consideration.

Computational re-analysis of all of the data available for a clone provides an additional way to assess the quality of sequence assembly. There are issues of data formats and their compatibility with analysis tools as implemented by any particular group (e.g., the point in the datastream at which data is captured from the sequencing machine or archived, filenaming conventions used in scripts, computer platform [Mac or Unix] used by different groups at different stages in the process) that would have to be solved in order to implement this type of validation study.

The possibility of limiting the validation analysis to computational re-analysis of the original data was discussed, but such an approach was considered not sufficient. Biases that result from the original data could influence the data validator in the same way that they influenced the original production group; discrepancies that might result in alternative assemblies may not be identified in the absence of some re-sequencing.

The use of the polymerase chain reaction (PCR) for checking assembly was also discussed, but this approach was not considered to be reliable as a validation tool. The primary difficulty with PCR checking would be how to interpret a negative PCR reaction - as being due to an incorrect assembly or to technical problems with the reaction (which are known to occur with a finite frequency).

The workshop participants concluded that, at present, there is no single method or approach which is clearly superior for validating DNA sequence data. Re-sequencing is clearly the most informative but is very expensive to do on any scale. Restriction fragment analysis and computational re-examination of data each have advantages and disadvantages. NHGRI will actively monitor progress in the development of such methods and, to the extent possible, incorporate any new information into the grants-monitoring process. Additionally, the participants strongly recommended that each pilot project develop, implement, and publicly document its criteria and methodology for data validation.

How much DNA should be checked?

Participants noted that enough of each group's DNA sequence needs to be checked to make sure that both easy and hard clones are included. Rather than defining the amount of DNA to check in terms of a proportion of a sequencing group s total production, one might define the error rate, or deviation from 1:10,000, that one wishes to be able to detect. Sequential sampling statistical analysis may help to determine how much DNA should be checked to achieve this. To keep costs down, one might conduct electronic rechecking of a relatively large amount of data, and select apparently problematic regions for de novo re-sequencing.

Who will conduct the validation exercise?

This question presents a serious challenge. At the workshop, participants were unable to identify any group not already involved in large-scale genomic sequencing that they knew to have the experience needed to be able to conduct the entire validation task. They discussed the possibility of separating the validation task into different components, such as identifying possible errors and resolving discrepancies, or electronic (computational) approaches compared with wet lab approaches. While it may be possible to engage different entities which have the appropriate expertise for different aspects of a validation study, this was not considered to be the most productive approach.

As an alternative, an approach that employed the pilot sequencing laboratories themselves to check each other's data was considered. Each of these labs will already be set up to sequence and to validate their own data, so this would be relatively efficient. Further, each clone that is checked could be checked by two labs, reducing the potential burden of revealing errors and also providing a way of judging which result is most likely to be correct when there is a discrepancy. Although there are some potential problems with this approach, such as the appearance of conflict of interest on the part of the pilot project laboratories, the workshop participants concluded that appropriate safeguards against the problems could be developed, and all else considered, this would be the best option, unless an outside group with appropriate capabilities emerged and could be engaged at a reasonable cost.

What are the criteria for determining that a cloned fragment represents genomic DNA?

Most practical current and foreseeable sequencing methods use cloned genomic fragments as a source of template for DNA sequencing. One problem with this practice is that the cloning process and subsequent clone growth can introduce DNA deletions and rearrangements. It is currently possible to verify, by PCR and restriction enzyme experiments on genomic DNA, that sequence determined from clones corresponds to that of the genome. However, these experiments are complex and expensive. It is, therefore, important to develop better strategies to demonstrate that the clones faithfully represent the genome.

On the other hand, experience shows that when the members of a collection of overlapping clones all contain the same DNA, then that DNA has a high probability of faithfully representing the genome. Certainly, use of this criterion would reveal cloning artifacts that arise from large deletions (probably the most significant artifact to be guarded against), although it would will not give information about point mutations or small rearrangements that arise during cloning. Therefore, the participants agreed that, before a clone is selected for sequencing, the DNA content of that clone should be demonstrated to be consistent with the content of other, independent, overlapping clones. It would be preferable to have deep coverage (10-fold) of clones having consistent fragment patterns. However, libraries that could provide such coverage may not be available any time soon, and in some regions it may be difficult to achieve this coverage even in deep libraries. Therefore, the workshop participants recommended that the minimum requirement for proceeding to sequencing should be that one can show at least two-fold coverage in clones that are derived from different bacterial transformation events (1) and that have consistent restriction fragment patterns. The stringency with which this determination should be made (number of enzymes, resolution) was not discussed. In regions in which it is impossible to achieve two-fold coverage, but that are needed to maintain contiguity, additional effort will be required to demonstrate that the clone represents the genome (e.g., by using PCR). In all cases, the database entry for the clone should include map information (depth of coverage in independent clones having consistent fragment patterns, and information on the clone sources). As stated above, each pilot project grantee should develop, and make publicly available, the criteria it uses to demonstrate clone fidelity. These criteria should be evaluated during subsequent reviews of the projects.

Adding value to the validation test

It was suggested that, by re-sequencing DNA from a different library/individual, validation could be combined with a search for polymorphisms. This would offer added scientific value to the validation study, both in terms of revealing polymorphisms and in checking the sequence against the genome, because it would necessarily involve an independent conversion of genomic DNA to cloned DNA. While scientifically attractive, however, this would introduce numerous complications to the validation study, and the proposal was ultimately rejected by the workshop participants. It was also suggested that validation data could be obtained by sequencing full length cDNAs, or by sequencing mouse DNA. Again, however, each of these approaches would also be costly and complex in terms of achieving the immediate goal of validation of the sequence products of the pilot projects. In summary, there was agreement not to try to build in added value coincident with this attempt to evaluate sequence quality.

What annotation is required to support validation?

NHGRI is conducting a parallel workshop to discuss the annotation items that should accompany DNA sequence in the public databases. The validation workshop consultants discussed annotation items that are needed for assessing sequence quality. They recommended that these should include quality measures on each base-call and mapping information (e.g., depth of coverage, identification of map landmarks). The base-call quality measures should be represented in the databases as they are generated by the submitter. For some users, it may be helpful if the databases offer filtering of the quality representation (e.g., reducing various statistical measures to simple terms such as high, medium or low quality), but unfiltered data should reside in the public databases.

The participants also strongly supported the notion that quality includes contiguity. They recommended that, if the sequence submitted to the database is not contiguous, it should be accompanied by map information that orders the contigs. But if the data represent contiguous sequence, it is not necessary to show detailed underlying map information because that can be generated from the sequence. In the latter case, as noted above, information on the depth of clone coverage with consistent fingerprints should be provided.

Advice on related issues

NHGRI was encouraged to support the development of additional DNA sequence assembly software. To the extent possible, such software should include interfaces that allow efficient comparison of results from different assembly tools.

There is also a need for software that can incorporate mapping data (e.g., bidirectional reads, contig size, restriction fragment patterns, etc.) into the assembly process. Optimally, one should be able to turn this feature on/off so the assembly can be done based on sequence overlaps alone, and then checked against the independent map data.

Conclusions

The workshop established principles for sequence validation but did not reach specific conclusions on the methods or mechanisms to be used. NHGRI staff will use the insights developed through these discussions to formulate a validation plan for the pilot sequencing grants. NHGRI plans to obtain continuing feedback on many of the issues discussed at the workshop.

NHGRI s efforts to validate grantees data should be viewed as a test of the quality of choices made by each pilot DNA sequencing group. At the same time, the groups need to be assured that the validation study will be only one aspect of the review process. Given the fact that some clones are harder to sequence than others, the validation study should assess the product not only relative to the 1:10,000 standard, but also to the submitter's own stated confidence in the data.

Top of page

Footnotes

The requirement for clones from different transformations is meant to definitively ensure that the clones are from independent cloning events. If sufficient data exist to clearly show that two clones from the same library were derived from different transformations, this would be sufficient. Otherwise, the clones should be from different libraries. Return to text.

NHGRI Workshop on DNA Sequence Validation
Natcher Conference Center, National Institutes of Health
April 15, 1996
PARTICIPANTS

MODERATOR
Leroy E. Hood, M.D., Ph.D.

Department of Molecular Biotechnology
Univ. of Washington School of Medicine
Box 357730
4909 25th Avenue, NE GJ 10
Seattle, WA 98195-7730

Mark D. Adams, Ph.D.
The Institute for Genomic Research
9712 Medical Center Drive
Rockville, MD 20850

David R. Cox, M.D., Ph.D.
Stanford University School of Medicine
Department of Genetics
Stanford, CA 94305-5120

Glen A. Evans, M.D., Ph.D.
Univ. of Texas Southwestern Medical School
Department of Internal Medicine
5323 Harry Hines Boulevard
Dallas, TX 75235-8591

Richard Gibbs, Ph.D.
Institute of Molecular Genetics
Department of Molecular and Human Genetics
Baylor College of Medicine
One Baylor Plaza
Houston, TX 77030

Philip Green, Ph.D.
Department of Molecular Biotechnology
University of Washington
Box 357730
Seattle, WA 98195-7730

Trevor Hawkins, Ph.D.
Whitehead Institute for Biomedical Research
MIT Center for Genome Research
One Kendall Square, Building 300
Cambridge, MA 02139

LaDeana Hillier, Ph.D.
Genome Sequencing Center
Washington University School of Medicine
Box 8501
4444 Forest Park Boulevard
St. Louis, MO 63108

Christopher Martin, Ph.D.
Human Genome Center
Lawrence Berkeley National Laboratory
1 Cyclotron Road
Building 74- 3110E
Berkeley, CA 94720

Bruce A. Roe, Ph.D.
Department of Chemistry and Biochemistry
Oklahoma University
Norman OK 73019

Robert H. Waterston, M.D.
Genome Sequencing Center
Washington University School of Medicine
Box 8501
4444 Forest Park Boulevard
St. Louis, MO 63108

NHGRI Staff

Francis Collins, M.D., Ph.D.
Elke Jordan, Ph.D.
Mark Guyer, Ph.D.
David Benton, Ph.D.
Carol Dahl, Ph.D.
Linda Engel, M.S.
Elise Feingold, Ph.D.
Bettie Graham, Ph.D.
Kenji Nakamura, Ph.D.
Jane Peterson, Ph.D.
Rudy Pozzatti, Ph.D.
Jeffery Schloss, Ph.D.
Robert Strausberg, Ph.D.

DOE Staff

Marvin Stodolsky, Ph.D.
Jay Snoddy, Ph.D.

Wellcome Trust Representative

Michael J. Morgan
The Wellcome Trust
183 Euston Road
London, NW1 2BE
Great Britain

Top of page

Last Reviewed: December 2005