Resources for assembling the sequence: We need to annotate clones and contigs as the sequence is being assembled. Much of this need will disappear when the genome sequence is finished, although researchers will still need information about individual sequenced clones so they can select particular ones to study.
International Gene Index: A major immediate need is an authoritative list of human genes. This should be a collaborative effort of NCBI and EBI, which should produce the one central list providing canonical gene models. It is essential to have standard names for genes and other genomic objects. There can be aliases, but everybody must be able to use the standard name. (These names are for tracking the genes. Functionally informative names can be provided later.) NCBI and EBI can set up a process to compare gene predictions by contig, and produce a correspondence table:
IGI name, NCBI id, EBI id, Evidence (cDNA, EST, gene-predicting program, etc.).
NCBI and EBI should announce this project soon. They expect to have the process for reconciling gene models set up by April 1, and full production efforts a month or so later. At the May CSHL Genome meeting they will describe the process and the first version across the genome. The draft sequence can be used, although we do not want a draft IGI. There should be two-way links to the lists of other groups.
Computational methods of gene-finding are quite error prone. Some exons of a given gene can typically be identified, but the likelihood of finding all exons and of correctly stitching them together into the correct predicted coding sequences or mRNA is low. Thus, the gene models will change over time, and a robust nomenclature and versioning system needs to be in place so that users can move between old and new datasets. For example, the genes used on a chip should be trackable five years later. As the list is updated, we need to be able to transfer the biological information that has accumulated.
NCBI will work with its advisory committee. The IGI needs its own advisory group, with academic, industry, and international representatives.
Definitive sequence: We need one definitive sequence, with annotation. The international sequence databases and the genome sequencing centers need to work together to produce it. The sequence needs to be computable; researchers need to be able to access it so they can query the sequence many ways. Some types of information can be pre-computed, and there should be views to answer popular questions. The sequence and annotation data have to be freely available in an easily downloadable way.
Automatic annotation: We need to have one authoritative view, for users who want just one view, provided by the international sequence databases. Many other users will be interested in various views, in order to validate models, to understand differences among models, and to choose particular useful views. Various methods of viewing the data will have different advantages. It will be important that the international sequence databases link to and from views provided by other databases, to make them easily accessible. Annotation needs to be done uniformly across the genome, and the annotation must be updateable.
Function annotation: The hard part will be keeping track of gene function. Some function information is predicted computationally; some is verified experimentally. Some researchers will be interested in any information on a gene; others will want only verified information. There is much interest in proteins, gene structure, and regulatory regions.
Biological curatorial annotation: We need to determine what the community needs and the best ways of meeting those needs. Models in use or proposed include:
Database editor: An editor of a database summarizes what is known about a gene. This model is used by OMIM (Online Mendelian Inheritance in Man), FlyBase, and the yeast database Saccharomyces Genome Database (SGD). The editors need to be Ph.D. level biologists, to synthesize the literature. In yeast, the community contacts the editors about problems in these entries. In humans, the community does not do a good job of pointing out mistakes.
Annotation meeting: The Celera annotation jamboree of the completed Drosophila sequence is an interesting model of how to get the community involved in annotation. The software and analytical tools for doing the annotation need to be provided. Some of the information coming out of the meeting was put directly into the database; some of the information was more complex biology that will be published.
Sequence-linked reviews program: NHGRI could have a small competitive grants program where PIs write book-chapter-type reviews on genomic topics. Such reviews would summarize the biology, and would be closely linked to the database. This approach allows flexibility, PI initiation of topics, PI credit, diversity of views, and scalability.
Controlled vocabularies for function: FlyBase, SGD, and the Mouse Genome Database (MGD) are working out a common controlled vocabulary for cellular location, biochemical function, and biological process. The vocabularies should be extended to include information on humans and other organisms.
|Top of page|
|Top of page|
|Top of page|
Last Reviewed: May 2006