On April 2 and 3, the Department of Energy's Office of Biological and Environmental Research (DOE/OBER ) and the National Institutes of Health National Human Genome Research Institute (NIH/NHGRI) convened a workshop to identify informatics needs and goals that could be part of the next genome five-year plan and that would begin to craft a vision for genome informatics over the next five years and beyond. In attendance were 46 invited informatics and genomics experts, and six DOE, eight NHGRI, two National Institute of General Medical Sciences (NIGMS) and one National Science Foundation (NSF) staffers. The meeting was held at the Dulles Hilton in Herndon, VA.
Conclusions of the Meeting
A reference genome map and sequence database. The sequence data should be assembled into continuous sequence, with links to the maps. The sequence should be annotated and the information should be structured so that all sorts of queries can be run on the database. The data should be updated and curated by sets of editors rather than by anybody who wishes to correct or annotate it.
Integrated and linked databases
Variation database - organized by individual genotype and haplotype and by population.
The genetic variation database should include or link to information on individual phenotypic variation.
Functional/expression database, including pathway/regulatory databases (e.g. WIT, KEGG, Eco Cyc).
Comprehensive data capture - raw data and the summary or processed data should be captured in standard formats. The data should be well-structured using controlled vocabularies.
The breakout groups had been asked to address four sets of issues, and their conclusions on these and some other issues are summarized:
Queries: Users want to be able to ask everything conceivable about sequences, genes, markers, regions, relationships, maps, proteins, functions, interactions, regulatory pathways, variation, phenotypes, and inter-species comparisons. How the data were derived, under what experimental conditions, by whom, the raw data (ABI traces, gel lanes, etc), what methods were used to process the raw data into database entries (e.g. sequence), QA/QC measures - everything! It should be possible to answer all queries that could be supported by the data.
The need for all the underlying data arises especially for individual phenotypic data. Given the expense of phenotyping, it is important to be able to go back and check whether a particular SNP is really there. The ABI traces are not needed for the reference sequence since questionable regions can be sequenced again.
Tools: DNA sequencing has a bottleneck at finishing; tools to speed up this process are a critical need. Others needed are production tools, research tools (for analysis, for visualization, etc.), access tools (for visualizing data objects, for extracting objects from different databases, etc.), annotation tools, data capture tools, functional genomics tools, data mining tools. Development and hardening of tools to promote easier dissemination finishing and exporting, QA/QC of the different tools, tools that are interoperable, map integration tools, and outreach tools. A web site that collects and annotates these tools would be very useful.
Standards: There was strong support for intelligent standards that various constituencies of the genome project, academic, government, and industry, could join in defining and implementing. These include a variety of controlled vocabularies for various objects that would be entered into appropriate databases. Today, industry standards are very distinct from the few that exist (e.g. Phred/phrap for sequence QA/QC) in the Human Genome Project (HGP). A current group (the OMG, Object Management Group) is composed mostly of industry representatives, but should involve academic and government representatives. Explicit object definitions and access methods are desperately needed. Component-oriented software standards would promote systems integration, interoperability, flexibility and responsiveness to change (e.g. CORBA). It was recognized that there is a balance between having standards and allowing change and flexibility.
Annotation: Automated annotation analyses should be done using clearly defined standard operating procedures, consistent application, and sufficient documentation. Automated annotation is a good place for biologists to start for more detailed understanding of particular chromosome regions. Human participation in the annotation process is still important, however, for getting the most out of genomic information.
Quality checks: There were suggestions that the databases be subject to regular checks of quality. Users are frustrated by incorrect data and the unwillingness or inability of database providers to correct these mistakes. Official editors who curate information could resolve errors and improve the data quality. The success of the quality assessment exercise for sequence centers provided a model for the usefulness of database quality assessments.
Training/Environment issues: NSF Science and Technology (S&T) centers are models for needed genome informatics centers. Three to five such centers were proposed, where there would be a critical mass to allow interactions among various disciplines and training of students.
The workshop closed with some policy recommendations:
There should be open competition for supplying most database/informatics needs.
No one database can be expected to do everything for everybody; however, users should feel that they are interacting with only one entity. Data submission should be uniform.
Existing frameworks (database schema, submission tools, etc.) should be used where possible.
There should be continued support for the model organism databases.
Raw data should be captured to the maximum extent possible before it is irretrievably lost.
There should be investments made in hardening and exporting software tools from genome centers.
DOE/NIH Genome Informatics Meeting
Dulles Hilton Hotel
April 2-3, 1998
To discuss the types of queries that will be important in genome informatics, and what types of data, tools, and databases will be needed to address them. The emphasis is on setting priorities for current and future user needs. The results of this meeting will contribute to the five-year plan for the HGP that DOE and NIH are formulating. The results will also influence the agencies? plans for informatics projects and funding.
Questions to address in talks and breakout groups:
Queries: What scientific questions will you want to answer? What types of data will you need to answer these questions? Which of these data types are permanent, which are temporary but important, and which will need to be regularly updated? What uses will you have for genomic sequence data in the next 5 years?
Tools: What protocols and tools for data submission, viewing, analysis, annotation, curation, comparison, and manipulation will you need to make maximal use of the data? What sorts of links among datasets will be useful?
Infrastructure: What critical infrastructures will be needed to support the queries you want to perform and what attributes should these infrastructures have? In what ways should they be flexible, and how should they stay current? How should they be maintained?
Standards: What kind of community-agreed standards are needed, e.g. controlled vocabularies, datatypes, annotations, and structures? How should these be defined and established?
First afternoon breakout groups (the first name is the moderator):
Sequencing, mapping for sequencing, gene maps:
Raju Kucherlapati, LaDeana Hillier, Eric Green, David Lipman, Takashi Gojobori, Peter Schad, Elbert Branscomb, Ray Gesteland, David Smith, Peter Cartwright, Rainer Fuchs, Peter Weinberger
Gene finding, OMIM, variation:
Ken Buetow, David Nelson, Anne Spence, Jim Ostell, Aravinda Chakravarti, David Valle, Bob Cottingham, Bruce Weir, Deborah Nickerson, Chuck Langley, Stan Letovsky
Chris Overton, Roger Brent, Martin Ringwald, Joanna Amberger, Mark Boguski, Manfred Zorn, Ed Uberbacher, Temple Smith, Richard Mural, David Balaban, Dixon Butler, Barbara Wold, Randall Smith
Carol Bult, Michael Cherry, Tony Kerlavage, Jean-Francois Tomb, Terry Gaasterland, Frederique Galisson, Reinhold Mann, Janan Eppig, Bill Gelbart, Katie Thompson, Paul Gilna
Thursday, April 2
Aristides Patrinos, Associate Director, (unable to attend)
Office of Biological and Environmental Research, DOE
Francis Collins, Director,
National Human Genome Research Institute, NIH
Moderator: Aravinda Chakravarti
David Thomassen, DOE
Aravinda Chakravarti, Chair of the NHGRI Planning Subcommittee
LaDeana Hillier, Large-Scale Sequencing
Takashi Gojobori, DNA Data Base of Japan
Anne Spence, Medical Genetics
Deborah Nickerson, Genetic Variation
Roger Brent, Functional Genomics
Rainer Fuchs, Industry
Bettie Graham, Training
Adjourn for day
Friday, April 3
Moderator: Aravinda Chakravarti
Reports from the four breakout groups
David Lipman, National Center for Biotechnology Information