Workshop on Human DNA Sequence Variation

March 31-April 1, 1997

The National Human Genome Research Institute (NHGRI) convened a Workshop on Human DNA Sequence Variation on March 31 and April 1, 1997, on the campus of the National Institutes of Health (NIH), to address scientific issues concerning how the reference human DNA sequence might be annotated by information on DNA sequence variation. This was the initial workshop in a series of planning meetings that will be organized over the next year by the NHGRI to help NHGRI identify the most important scientific questions for future investigation and investment in anticipation of the completion of the first reference human DNA sequence by the year 2005, and the need to consider making investments now to maximize the beneficial consequences of genome research.

The agenda for the workshop, the questions discussed, and a list of participants, is attached. In summary, the workshop discussions led to the following conclusions:

There is a critical and immediate need for NIH to stimulate and support pilot projects investigating a number of important questions in population genetics. Research needs include quantifying DNA variation, understandings how it varies across the genome, and how DNA variation (deleterious, advantageous and neutral) arises and is maintained in human populations.
A defined resource of cell lines and/or DNA, that would be appropriate for studying a variety of questions pertaining to human sequence variation, and would be generally available to the scientific community, would potentially be of very great value. There are a number of important scientific and ethical questions that must be addressed before such a resource is developed, and further discussion of this possibility is enthusiastically encouraged.
Further research into efficient methods of detecting DNA sequence variation and of genotyping, particularly research aimed at increasing the sample throughput and decreasing the cost of analysis, is necessary.

As progress is made toward the determination of the complete genomic DNA sequences of the human and of several non-human organisms, consideration about ways in which to augment information about the genome sequence is increasing. Although sequence data can be annotated in several ways, two obvious areas of interest are determination of sequence variation and the effect of such variation on functions encoded within the genome.

Human genetics is critically concerned with how variation in gene sequences is related to variation in the function of genes and gene products. Neutral variation which does not affect gene function, provides information on human population structure and the history of chromosomes. Thus, the identification, classification, quantification and analysis of sequence variation is expected to constitute one of the most powerful, and direct, approaches to the study of a wide range of important biological questions. The reference genome sequence will provide scientists with the basis for measuring sequence variation, assessing how variation in specific genes is associated with complex phenotypes (and common diseases), how sequence variation affects gene function and biochemical pathways, and how human genetic variation has been shaped by biological processes of natural selection and evolution. At present, there are neither significant data on the nature and extent of DNA sequence variation in the human genome, nor much discussion on the gamut of biological studies that could benefit from broad knowledge of genomic variation. Thus, the workshop was organized to discuss three primary scientific issues: (1) the kinds of research that require or benefit from information about genetic variation, (2) the characterization of DNA sequence variation in the human, and (3) the technologies necessary for determining, assaying, and interpreting variation across the genome.

The study of variation in human genes, either through the analysis of phenotypes known to have a genetic basis or through known gene products (blood groups, serum proteins, enzymes), has a long history. This has led to our current understanding of the nature of genetic variation in humans. However, there are many inherent problems in such classical studies, including: studies have largely involved genes already known to be variable in human (often European) populations, only a small number of genes has been studied and only some types of genomic changes have been monitored. Moreover, as these studies were largely concerned with variations that affect protein composition, they cannot accurately reflect the degree and nature of genetic variation at the DNA sequence level.

The development of DNA sequencing technology and the initiation of large-scale DNA sequencing projects has now led to the ability to directly measure variation in genomic DNA. This means that the entire genome, and not simply the recognized coding sequences, is accessible for analysis. The initial results of genomic sequence analysis indicate that, on average, there is one variant nucleotide (nt) per 1000 nt (one kilobase, or kb) screened, confirming the results of many studies performed in the 1980's using RFLPs. For example, one laboratory found one single nucleotide polymorphism (SNP) per 973 nt (comparing approximately 300 kb in three individuals by one method; similar numbers came from a study comparing one megabase in eight individuals by a different technique). In this study, the evidence for clustering of variation was, at most, slight. However, another laboratory reported considerably more difference in the local frequency of variation which ranged from one difference per 860 nt to as little as one difference per 10 kb. As more human DNA sequence is determined, and as methods for resequencing the same region from several individuals and populations are improved, more data of this type will be collected.

Much of the discussion at the workshop focused on the use of sequence variation information in the identification of the loci underlying multigenic traits, in particular, complex diseases. Specifically, the usefulness of generating a high resolution ("third generation") map of sequence variation at single nucleotide positions for the identification of the genes underlying such traits was a topic of primary interest. The discussion of this subject focused on questions such as the marker density necessary for the maps to be useful in such studies, the types of human populations that should be studied for complex disease mapping, the extent of haplotype information and the types of family structures that might be used. The issue of marker density is unresolved and requires additional theoretical study.

An important issue in complex disease genetics, about which little is currently known, is the nature of genetic variation which underlies such phenotypes. Are these diseases like rare mendelian phenotypes, in that the multiple loci harbor rare and generally new mutations of large to intermediate allelic effects? Or, are they due to a combination of common genetic variants each of which has a small allelic effect? If the latter alternative is true, then it would be advantageous to screen the human population and catalogue common variants (as has been previously done for the HLA system); however this will not be a practical approach until information about full-length coding sequences for many genes is available.

Another unanswered question relates to whether common variants, known to be susceptibility factors in some complex diseases, are likely to be recurrent or have one or a few origins. If gene variants have few origins, then for each variant a considerable segment of DNA surrounding the variant allele (haplotype) will be shared between individuals harboring the variant. For gene mapping, then, the question of the required marker density centers on the size and ability to recognize these haplotypes or "ancestral blocks (contiguous regions of DNA that have largely been inherited without recombination in human evolution). As the functional information is contained within these blocks, association studies can be used to correlate haplotypes with specific phenotypes. It was suggested that approximately 10 to 20 single nucleotide polymorphisms (SNPs) per block would be sufficient for characterizing the human genome. This could translate to a map of 30 to 60,000 markers to analyze blocks of a megabase in size, which might be the case in studying a unique population with a recent ancestry. More typically, for an outbred population (with a lot of mixing and old mutations), the blocks will be considerably smaller, requiring a comparably larger number of makers. Block size is critically dependent on the population under study. Such dense SNP maps would enable the study of diseases from appropriate patient samples by association, without the necessity of family samples. This would reduce the cost of disease studies and, more importantly, genetic studies could be better designed with respect to the phenotype.

Human disease studies are best performed in populations in which genetic heterogeneity, both locus and allelic heterogeneity, can be minimized, and populations that satisfy these criteria need to be identified. Unfortunately, these characteristics for the specific disease loci cannot be readily measured. In general, culturally and geographically isolated populations satisfy this rule, but often do not have a sufficient number of cases for disease mapping to proceed with precision. Better methods and better parameters for characterizing human populations are necessary. As at other meetings, the issue of complex disease mapping and gene identification generated many opinions, but it was clear that there is not enough information available to answer the crucial questions conclusively. Thus, there is a critical need for NIH to stimulate and support pilot projects investigating a number of important issues, including: how to make SNP maps; how to survey common sequence variants (How do we detect all common variants? Is there benefit to restricting analysis to only those in coding sequences or regulatory regions?); how to explore the power of different "populations" to identify disease genes (What is the effect of a population's size and age on its usefulness in detecting disease genes?); how to survey ancestral haplotypes (What specific populations should be used in such studies?); and how to develop a bioassay for population heterogeneity.

A second idea that emerged from the workshop was the potential usefulness of a reference set of samples that could be used by scientists with diverse research interests as a resource to characterize and study human sequence variation. It was suggested that a collection of 500 trios (i.e., sampling both parents and one child) comprised of a relatively small number of "groups" (five) constructed in such a way that it could properly "represent" the U.S. population (if this were to be developed by a U.S. funding agency) would be extremely valuable. It was recognized that there are a number of scientific questions that need to be considered in developing such a resource and, beyond the scientific questions, there are very important ELSI issues that must also be addressed in the construction of such a collection. Further discussion of this concept is clearly needed.

The workshop discussed a variety of technologies that currently exist and which can be applied to genetic variation studies on a genome-wide scale. These methods, in their current implementation are efficient either for the detection of genetic variation or for assaying specific variants in multiple samples, but not for both purposes. Further research into efficient detection and genotyping methods, particularly research aimed at increasing the sample throughput and decreasing the cost, is critically necessary.

PARTICIPANTS

Aravinda Chakravarti, Ph.D. - Organizer
Case Westem Reserve University
Cleveland, OH 44106

Christopher Becker, Ph.D.
GeneTrace, Inc.
Menlo Park, CA 94025

Dr. David R. Bentley
The Sanger Centre
Welcome Trust Campus
Hinxton, Cambridge CB10 ISA

Michael Boehnke, Ph.D.
University of Michigan
Ann Arbor, MI 48109-2029

Kenneth Buetow, Ph.D.
Fox Chase Cancer Center
Philadelphia, PA 19111-2412

Mark Chee, Ph.D.
Affymetrix
Santa Clara,CA 95051

John Clegg, Ph.D.
Institute of Molecular Medicine
University of Oxford
John Radcliffe Hospital
Headington OXFORD OX3 9DU

Francis S. Collins, M.D., Ph.D.
National Human Genome Research Institute
Bethesda, MD 20892

David Cox, M.D., Ph.D.
Stanford University
School of Medicine - M336
Stanford, CA 94305

Anna Di Rienzo, Ph.D.
University of Chicago
Chicago, IL 60637

Nicholas Dracopoli, Ph.D.
Sequana Therapeutics, Inc.
La Jolla,CA 92037

Georgia M. Dunston, Ph.D.
Howard University
College of Medicine
Washington, DC 20059

Geoffrey Duyk, M.D., Ph.D.
Millennium Pharmaceuticals
Cambridge, MA 02139-4815

Nelson Freimer, M.D.
University of California
San Francisco, CA 94113-0984

Richard Gibbs, Ph.D.
Baylor College of Medicine
Houston, TX 77030

Dr. Michel George
University of Liege
B4000 Liege, Belgium

Jody Hey, Ph.D.
Rutger University
Nelson Labs
Piscataway, NJ 08855-1059

Leroy Hood, M.D., Ph.D.
University of Washington
School of Medicine
Seattle, WA 98195-7730

Richard Hudson, Ph.D.
University of California
Dept. Ecology/Evol Biology
Irvine, CA 92717-0001

Dr. Michael James
The Wellcome Trust Centre for Human Genetics
University of Oxford
Oxford OX3 7BN

Dr. Martin Kuiper
Keygene n.v.
The Netherlands

Eric Lander, Ph.D.
Whitehead Institute
Cambridge, MA 02139-1561

Kenneth Lange, Ph.D.
University of Michigan
School of Public Health
Ann Arbor, MI 48109

Charles H. Langley, Ph.D.
Center for Population Biology, UCD
Davis, CA 95616

Wen-Hsiung Li, Ph.D.
Univ Texas Health Science Center
Houston, TX 77225-0334

Kenneth Morgan, Ph.D.
Montreal General Hospital
Montreal, PQ H3G 1A4
Canada

Deborah Nickerson, Ph.D.
University of Washington
Seattle, WA 98185

Peter Owner, Ph.D.
Stanford University
Stanford, CA 94305-5307

Val Sheffield, M.D.
University of Iowa
Iowa City, IA 52242

M. Anne Spence, Ph.D.
University of California
Orange, CA 92668

Dr. Jean Weissenbach Genethon
Human Genome Research Centre
91002 EVRYCedex FRANCE

Alexander Wilson, Ph.D.
National Human Genome Research Institute
Baltimore, MD 21224

Bettie Graham, Ph.D. - Co-organizer
National Human Genome Research Institute
Bethesda, MD 20892

Mark Guyer, Ph.D., - Co-organizer
National Human Genome Research Institute
Bethesda, MD 20892

INSTITUTE/AGENCY PARTICIPANTS

Lisa Brooks, Ph.D.
National Science Foundation

Dr. Feliz De La Cruz
National Institute of Child Health and Development

Irene Eckstrand, Ph.D.
National Institute of General Medical Sciences

Elise Feingold, Ph.D.
National Human Genome Research Institute

Marvin Frazier, Ph.D.
Department of Energy

Soumitra Ghosh, M.D., Ph.D.
National Human Genome Research Institute

Judith Greenberg, Ph.D.
National Institute of General Medical Sciences

Dr. Richard Hodes, Director
National Institute on Aging

Elke Jordan, Ph.D.
National Human Genome Research Institute

Robert Karp, Ph.D.
National Institute of Alcohol Abuse & Alcoholism

Eric Meslin, Ph.D.
National Human Genome Research Institute

Dr. Stephen C. Mockrin
National Heart, Lung and Blood Institute

Kenji Nakamura, Ph.D.
National Human Genome Research Institute

Jane Peterson, Ph.D.
National Human Genome Research Institute

Jerry Roberts, Ph.D.
National Human Genome Research Institute

Jeffery Schloss, Ph.D.
National Human Genome Research Institute

Jack Taylor, Ph.D.
National Institute of Environmental Health Sciences

Ms. Elizabeth Thompson
National Human Genome Research Institute

Last updated: August 01, 2005