National Institutes of Health U.S. Department of Health and Human Services
The Comprehensive Extraction of Biological Information from Genomic Sequence
National Human Genome Research Institute
July 23-24, 2002
Purpose of the Workshop: This workshop was held to discuss a National Human Genome Research Institute (NHGRI) proposal to initiate a new, highly interactive public research consortium to carry out a pilot project for testing and comparing existing and new methods to identify functional sequences in DNA.The premise for the workshop was as follows:
A number of computational and experimental ¿wet bench¿ approaches have been developed to identify and characterize the encoded functional elements in genomic DNA sequence. However, no single strategy is yet able to identify all of the coding sequences in the human genome, and different approaches actually identify different, only partially overlapping sets of coding sequences. As for other sequence-based functional elements (e.g., regulatory regions, non-coding RNA sequences, elements involved in chromosome structure and function), few if any comprehensive approaches have even been devised for testing.
NHGRI¿s intent in proposing the new consortium is to encourage discussion and comparison of existing computational and experimental approaches and to stimulate the development of new ones. The institute hopes, that by working together in a highly cooperative effort to rigorously analyze a defined portion of the human genome sequence, investigators with diverse backgrounds and expertise will be able to evaluate the relative merits of each of a diverse set of techniques, technologies and strategies in identifying all the functional elements in human genomic sequence, to identify gaps in our ability to annotate genomic sequence, and to consider the abilities of such methods to be scaled up for analyzing the entire human genome.
The consortium, as envisioned, would be open to all academic, government and private sector scientists interested in participating in an open process to facilitate the comprehensive interpretation of the human genome sequence. By initially concentrating on a limited region of the human genome, the institute hopes that all of those who have experience and insight into the problem will be willing to participate, whether or not their approaches are proprietary or have already generated proprietary data. In this way, the activities of the consortium could be influential in helping to guide the planning for a complete public elucidation of functional elements within the entire human genome.
Summary of Workshop Discussion: The workshop began with a series of presentations on computational annotation and experimental approaches to biological confirmation of functional elements in the genomes of both model organisms and the human. Subsequent to those discussions, NHGRI outlined its proposal for a pilot project to exhaustively determine all functional elements in a small fraction (~1 percent) of the human genome, which the workshop participants strongly endorsed. The workshop participants then discussed a number of issues that need to be addressed in order to implement the proposed pilot project. The following outlines the highlights of these discussions.
Initial Inventory of Functional Elements to Identify: The participants recommended that both protein-coding genes and non-protein-coding genes need to be identified. For each of these, the complete (full-length) coding sequence and all variants, as well as the transcriptional regulatory elements (e.g., promoters and enhancers) and post-transcriptional regulatory elements (e.g. cis-acting RNA elements) should be described. All pseudogenes should be identified. A number of global sequence features, such as sites of methylation, sequence variation, evolutionary history of sequence blocks and repetitive elements were suggested for inclusion, as were a number of chromosomal elements, such as origins of replication, nuclease hypersensitive sites, matrix attachment sites and histone modifications.
Initial List of Technologies/Approaches that Could be Utilized: Several ¿wet bench¿ technologies and resources were discussed. These included DNA array studies, RT-PCR/cDNAs, in situ hybridization, chromatin immunoprecipitation, RNAi, knockout mice, and antibody analysis of protein function. A broad range of c omputational approaches were also considered to be critical for inclusion. These included both comparative sequence analysis of multiple genomic sequences to identify conserved elements and automated prediction of functional elements, including coding sequences, promoters, alternative splice variants and other highly conserved regions. The importance of ensuring close collaboration between experimental and computational approaches was stressed.
Process for Selection of Genomic Targets: The establishment of a working group charged with selecting the target sequences for analysis was recommended. The following factors were suggested as criteria for target selection: regional size, gene density, GC content, repeat content, evolutionary rate, recombination frequency, cytogenetic location and long-range chromosomal structure. Some sequence from the X chromosome should be considered for inclusion and the availability of existing data sets should be considered.
Criteria for Participation: To participate in the consortium, investigators must agree to analyze the entire set of target regions, offer a substantial contribution, share all results and participate in group activities Participation in the consortium should not be determined by funding source. Those who already have funding, or obtain funding from a non-NHGRI source, should be eligible; it is also reasonable to expect that NHGRI will provide funding for participation in the consortium. The timeframe for an NHGRI-sponsored, open competition needs to be worked out; however, there is interest in getting this project started right away and not waiting for a whole funding cycle to go by.
Organizational Issues: A steering committee (comprised of project participants) and an advisory committee (comprised of non-participants) should be organized. Working groups that are needed to rapidly address certain issues, such as data management, should also be established.
Several suggestions were made to publicize the project and to help recruit participants to the project. Names of individuals with appropriate expertise were solicited from meeting participants and should be solicited from others; those names should be sent to Mark Guyer. NHGRI should develop a project Web site, and should also put a notice in Science with the Web address.
Data Management: A working group, with both computational and experimental expertise, should be established to begin consideration of several issues. A database needs to be established to allow data display and distribution. The use of a controlled vocabulary and the development of a Sequence Ontology (SO), analogous to the Gene Ontology, should be encouraged. Data should be fully open so that others can redisplay and reanalyze them. Open source software should be used for data management. There should be early coupling of experimental and computational data.
The point was made that the project will involve large sets of complex data, so priority setting will be critical. Guidance on how to prioritize the data is needed even before the project is initiated. As the data will be richer than can be accommodated by just placing it on a sequence alignment, it is likely that more than one data management effort will be involved. This will raise the need for coordination among the different groups.
Data Release: All data generated by consortium members must be immediately available to all consortium members. In addition, the algorithms and source code for software developed for the project must be disclosed for scientific evaluation. Early access to ongoing results by entire research community is also desirable, while at the same time it will be important to preserve the opportunity for participants to publish their analyses. In other words, a balance must be struck among the issues of data release, appropriate credit and publication opportunities.
Other Issues: A process for evaluation of the progress of the effort will be needed; this will involve some kind of independent group and/or the advisory committee. Some concerns were expressed that the time for evaluation should not be too short, which can be somewhat risky. Quality control issues will need to be addressed. It is likely that access to a common set of reagents (e.g., arrays) will become an issue.