National Human Genome Research Institute Workshop on
Developing Guidelines for Choosing New Genomic Sequencing Targets
July 9-10, 2001
|Introduction and Summary|
The large-scale DNA sequencing component of the Human Genome Project (HGP) has made enormous strides in the past several years. Complete sequences of the yeast S. cerevisiae, the roundworm C. elegans, the fruit fly D. melanogaster, and many prokaryotes and archaea have been determined and published. A draft sequence of the human has been published and the sequencing of the mouse and rat genomes is well under way. The availability of these genome sequence data is transforming the ways in which computational and experimental biology using these organisms are being done. The demand for the genomic DNA sequence from still other organisms is growing by leaps and bounds, and already demand has outgrown even the prodigious public sector sequence capacity (currently estimated at more than 50 billion "raw" bases per year in the United States and the United Kingdom combined).
While decisions about sequencing targets, during the initial years of the HGP, were made by the HGP collaborators, with advice from formal advisory committees, the NHGRI believes that it is now appropriate to move toward a process in which investigator-initiated proposals, peer review, and competition will play a larger role in determining the use of sequencing resources. If this shift in decision making is to be successful, the community of interested investigators, the scientists who are running the large-scale sequencing laboratories, and those who make the decisions about distribution of resources will all need to have input into the decisions, as well as clear guidelines on which to base and evaluate the scientific arguments; in other words, we need a systematic process for selecting genomes for sequencing, one that focuses on the scientific issues that can be addressed by having sequence data from a new organism. Such a process will allow all investigators who wish to propose genomes to sequence for specific, well-defined purposes to take advantage of the rigor and robustness provided by scientific discussion and peer evaluation.
To address this issue, the National Human Genome Research Institute (NHGRI) convened a workshop, entitled "Developing Guidelines for Choosing New Genomic Sequencing Targets." The purposes of the workshop were to 1) discuss broadly the opportunities available from comparative genome analysis using whole genome data sets and 2) develop a set of guidelines that investigators proposing to use NHGRI funds for large-scale sequencing of a genome should address in their proposals.
The participants in the workshop (attendance list attached) ratified the premise that future choices of sequencing targets should be based on the scientific opportunities that would arise from having the sequence of the genomes of additional organisms. They also identified a number of reasons that would justify the selection of a new organism for genomic sequencing, and noted that these rationales fell into two distinct groups. One set was biological and addressed the ways in which the resulting sequence information could contribute to future biomedical and biological research. The other was strategic, involving pragmatic issues, particularly the feasibility of obtaining the sequence and the size of the research community available to use the information. The workshop participants agreed that both sets of objectives needed to be addressed to allow evaluation of the priority of the sequence of a given organism to be established on a cost-benefit basis. Thus, the workshop recommended that all who propose new organisms for genomic sequencing must address both sets of issues.
|Top of page|
July 9-10, 2001
The extraordinary progress that has been made in genomic research in the past few years, particularly in genomic DNA sequencing, has led to both new opportunities for acquiring the sequences of the genomes of many organisms, along with the need to choose sensibly which genomes to sequence among the many available from which to choose. During the initial years of the HGP, decisions about sequencing targets were made by the HGP collaborators, with advice from formal advisory committees. However, the NHGRI believes that it is now appropriate for it to move toward a process in which investigator-initiated proposals, peer review, and competition will play a larger role in determining the use of the NHGRI-supported sequencing capacity. A systematic process that focuses on the scientific issues that can be addressed by sequence data from a new organism will allow all investigators, the sequencers and the NHGRI to select additional organisms for genomic sequencing on the basis of specific, well-defined goals, taking advantage of the rigor and robustness provided by scientific discussion and peer evaluation.
To address this issue, the NHGRI convened a workshop, entitled "Developing Guidelines for Choosing New Genomic Sequencing Targets." The purposes of the workshop were to 1) broadly discuss the opportunities available from comparative genome analysis using whole genome data sets, and 2) develop a set of guidelines that investigators proposing to use NHGRI funds for large-scale sequencing of a genome would have to address in their proposals.
The workshop attendees quickly agreed on several key, underlying points.
- The initial sequencing goals of the HGP (the complete sequences of E. coli, S. cerevisiae, C. elegans, D. melanogaster, and the human) are almost assuredly going to be reached and even exceeded (i.e., with the addition of the mouse, rat, and zebrafish sequences) within the next few years.
- NHGRI should continue an active large-scale sequencing program because medical benefits will ultimately accrue from the availability of a deep, robust genomic sequence data set.
- Complete genomic sequences from many more organisms are needed.
With respect to sequences of additional organisms, there are two ways in which sequence information about additional organisms will be of interest to the NHGRI and the NIH.
- In addition to those organism whose genomic DNA sequence is already, or will shortly, be known, many others are of intrinsic scientific and/or medical significance. Certain organisms are ideal systems for understanding a basic biological process with high relevance to human biology (the role of studies of C. elegans in the analysis of apoptosis and of Drosophila in elucidation of the development of the body plan exemplify this point), while other organisms are of great medical significance in that they provide a robust entry into disease gene identification or are themselves pathological (examples include the contributions of yeast genetic research to identification of the genes involved in the human peroxisomal biogenesis disorders and of the sequencing of the plasmodium organisms responsible for malaria currently under way). For many of the other organisms whose genomes have not yet been sequenced, sequence data will provide additional insight into important medical and scientific issues.
- In other cases, the genomic sequence of certain organisms will be of substantial value in comparative studies with those of organisms for which the genomic sequence is already known. Among other applications, this appears to be the most powerful experimental strategy to identify small segments of human genomic sequence that regulate gene expression.
Much of biological and biomedical research has always been based on comparative analyses of different types, and thus has always included the study of a variety of non-human organisms. Sequence analysis is no different. Having multiple sequences with which to do different kinds of comparisons is essential for the investigation of different questions. For example, a question such as "in what pathway does a certain gene act?" might be best analyzed based on a comparison of the sequences of the human, nematode and/or fruit fly versions of that gene, whereas a question such as "how did alterations in the expression of that gene evolve?" might require comparison of the sequence of the human gene with that of the mouse or baboon gene.
Sequence comparison is becoming an increasingly valuable tool for contemporary biological and biomedical research because it is currently the most effective technique available for identifying candidate functional regions in genomic DNA. The range of "function" associated with DNA sequence is broad, and includes specification of proteins and non-protein-encoding RNAs, controlling the timing and location of the expression of transcribed sequences, controlling chromosome structure and chromosome mechanics and, potentially, other biological properties as well. Different comparisons will be required to gain insight into different functions. Comparing genome sequences from a variety of evolutionary distances, e.g. within a species, between close species, and between more distant species, will be necessary to gain a full understanding of the information content of human genomic DNA.
The techniques for performing sequence comparisons are still at an early stage of development, and two critical issues must be kept in mind as this approach is applied and is developed further. First, the computational problems differ with comparison of sequences at different evolutionary distances. Second, differential rates of evolution of species and of different genomic regions within the same species must be considered. The species that are most informative for explicating the human sequence will thus differ with the locus being analyzed.
There already are many examples in which comparative sequence analysis has provided important clues to understanding sequence function and some generalizations are slowly emerging. For example, only small sequence blocks are conserved in regulatory regions and the ordering and spacing of those conserved sequences are themselves much less conserved. However, experimental analyses have demonstrated the significance of such conserved sequences in a number of instances, including transgenic studies that show that genes from one species can be regulated properly in another. From these initial results, is it quite clear that there is significant value in having the sequence of a second, closely related species for comparison and there is a growing agreement that sequence from yet one (or more) other, more distantly species will add even additional value.
The acquisition of DNA sequence data is still relatively expensive (ca. $0.1 per base for high accuracy "finished" sequence to $0.025 for 4-fold coverage, or "draft," sequence). Thus, to approach the use of comparative sequence information effectively, we have to make intelligent choices about the priorities of sequencing projects in the next few years. In the mid to long-term, sequencing technology will undoubtedly improve and the efficiency of determining DNA sequence will increase so that, as time passes, more sequence data will be generated with a given amount of funding. However, for the near term, it remains important to identify those organisms that would be the most experimentally interesting and useful and to be clear about how we would use the genomic information about those organisms. For instance, what kind of and how much genomic information will be necessary to answer the questions that need to be studied? And, how can we envision using genomic information in the most productive way?
The discussions at the workshop assumed that there would be so much interest in generating sequence data from so many more organisms that the sequencing capacity that the NHGRI currently supports will be insufficient to meet the demand. Given that, the participants agreed that all scientists interested in the additional genomic sequence should have competitive access to the sequencing capacity supported by the NHGRI and that the decisions of which genomes to sequence should be based on priorities determined by a peer review process. Another factor that is relevant to the development of a policy for determination of sequencing targets is that the duration of any particular sequencing project will be considerably shorter than the five-year project period for the grants that support the sequencing centers. As a result, NHGRI staff have proposed that the peer review that will be involved in establishing a priority order for sequencing be dissociated from that involved in the evaluation of the grant applications for support of the sequencing centers.
Because of the rapidity with which centers will be able to complete sequencing projects (from as short as a month for small genomes to one to two years for larger genomes) and because the selection of organism will not involve funding decisions, the standard NIH study section model for peer review does not seem appropriate for the process of selecting new sequencing targets. Rather, a process involving the National Advisory Council for Human Genome Research (NACHGR), acting in its program advisory capacity, seems preferable. At the workshop, the primary subject of discussion was the criteria that such a peer review process should use to evaluate the proposals.
The participants identified a number of factors that could influence the selection of a new organism for genomic sequencing, and noted that these fell into two distinct groups. One set was biological in that they were relevant to the ways in which the resulting sequence information would contribute to future biomedical and biological research. The other set was strategic and concerned a number of pragmatic issues relevant to sequence acquisition, such as the feasibility of obtaining the sequence as well as the size of the research community available to use the information. It was agreed that both sets of factors would have to be addressed to effectively allow the setting of the priority of obtaining the sequence of an organism to be established on a cost-benefit basis.
- Specific biological rationales for the utility of sequence data from new organisms.
- Informing human biology. (How will the genomic sequence of a particular organism lead to a better understanding of biological functions in the human?)
- Informing the human sequence. (How will the genomic sequence of a particular organism lead to a better description of the functions of specific sequence features of the human genome?)
- Informing the sequences of non-human organisms ("model organisms") used in the study of human biology. (How will the genomic sequence of a particular organism lead to a better description of the functions of specific sequence features of the genomes of particular model organisms?)
- Providing a better connection between the sequences of non-human organisms and the human sequence. (How will the genomic sequence of a particular organism increase our ability to identify orthologs in the sequences of well-studied model organisms and how will that deepen our understanding of the human sequence?)
- Facilitating the ability to do experiments, e.g "direct" genetics or positional mapping, in additional organisms.
- Expanding our understanding of basic biological processes relevant to human health, e.g. developmental biology, neurobiology, cancer biology stem cell biology.
- Expanding our understanding of evolutionary processes (biological innovation, selection) in general, and human evolution in particular.
- Providing additional surrogate systems for human experimentation, e.g. new disease models, improved opportunities for drug testing, or other medical procedures, such as transplantation.
- Strategic issues involved in the acquisition of sequence data from new organisms.
- The demand for the new sequence data. What is the size of the research community that will use it? What is the community's enthusiasm for having the sequence? Will the availability of the of the new sequence data affect the size of the research community using that organism and, if so, how?
- The suitability of the organism for experimentation. How will the new sequence data enhance the experimental use of the organism? What genomic resources and technologies (e.g. gene transfer, ability to go from molecule to mutation) are available that will allow the new sequence information to be effectively used?
- The rationale for the complete sequence of the organism. Why would the complete sequence be more useful than the sequences of specific regions, or only the coding sequences, or only ESTs? Are there alternative ways to get the necessary information?
- The cost of sequencing the genome and the state of readiness of the organism's DNA for sequencing. What is the size of the genome? What quality of sequence product is needed (finished sequence? Draft? Full shotgun?) What sequencing strategy will be used? Is suitable DNA readily available?
The workshop participants recommended that all proposals of new organisms for genomic sequencing by the NHGRI-supported sequencing centers must address both sets of issues.
In the course of the workshop, a number of different organisms were discussed as illustrations of specific points. By the end of the workshop, a consensus had emerged among the participants that sequencing at least one organism at each of the important branch points in vertebrate evolution is very important. Among the candidates of each phylogenetic class, an organism with a small genome would have an advantage as a sequencing target, as would one with tractable genetics. As noted above, it was also generally agreed that it will be useful to have the sequences of some closely related organisms and some that are distantly related to cover genomic regions with different rates of evolutionary change. It was also recognized that DNA sequences from organisms with genomes that show relatively slow rates of evolutionary change would be more useful in defining evolutionary trees that would sequences from organisms showing relatively high rates of evolutionary change.
As a generic mechanism to propose an organism, the participants' suggestions included reports from meetings, conferences and other public discussion forums, pilot projects and published white papers. Not only would these documents be suitable for presentation to the peer review system, they would help to raise the level of awareness of information across the scientific community, including investigators who might become interested in participating in the projects, the sequencing centers, and National Insitutes of Health (NIH) administrators. To be most useful the reports should address what we know about an organism/biological phenomenon/research question, what we don't know, and how the sequence of a specific organism would inform us. It was strongly agreed that no investigator or community of investigators should expect that the DNA of an organism would be sequenced without a well thought out rationale being presented.
It was also noted that NHGRI (and other NIH institutes) could play an important nurturing role for this process by providing support for travel, meetings, and even for the generation of white papers.
|Top of page|
Last Reviewed: April 2006