July 25, 2000
The purpose of this workshop was to develop recommendations on the need for and feasibility of establishing a sequence trace repository. There were two components to this discussion, and the participants were encouraged to maintain the appropriate distinctions between them: 1) establishment of an archive for sequence traces from new projects, such as the sequencing of the mouse genome, and 2) establishment of an archive for "legacy" data, i.e., data already generated and archived at the centers as part of the human sequencing project.

1. Data From New Projects

John Bouck began the discussion with a presentation of an overview of the findings of the Mouse Trace Warehousing Working Group. The group had been organized last winter and charged with considering the mouse genome sequencing project's need for a trace archive and with developing specifications and estimating the cost of setting up and running such an archive for mouse data. The working group's final report* focused on an archive whose primary function would be to serve the sequencing centers as a resource for assembly and finishing activities. The needs identified for such a resource included large withdrawals of Bacterial Artificial Chromosome (BAC)-sized data sets; regular requests; and staged withdrawals allowing a near-line system. The working group also considered a secondary role for the archive in meeting the needs of a limited number of investigators from the wider biological community. For this mode, the requirements would be different as such a resource would be expected to handle small as well as large withdrawals, accept irregular requests and have a rapid response time requiring an on-line system. The working group recommended that the incoming data be in SCF3 format (to be as close to the original data as possible), with ancillary information such as lab, vector used and sequencing chemistry included in an associated file. A number of searching capabilities and outgoing data formats were recommended to be included in the system. Some additional unresolved considerations raised by the group include: 1) response time needed for the different requests, 2) longevity of the archive, 3) search specifications, and 4) open-source policies of the algorithms.

Discussion Points:

Mike Zody reported on the progress of the recently convened Nomenclature Working Group, made up of representatives from the G5 labs. In anticipation of the Trace Archive Workshop, this group discussed how reads are named in the different centers and how the differences would affect submission to an archive. The group compiled a list of current naming conventions and concluded that the read names should be treated only as a unique identifier; each trace should also have a separate, attached file, with a defined format, containing the necessary ancillary data about the trace. The group recommended that the archive be capable of accepting multiple trace file formats and of converting them to a standard format. A draft format* for the ancillary data had been formulated and circulated to the working group; it was also made available to the workshop participants. It was agreed that another iteration is needed for the working group to finalize the report, which should then be distributed to participating large-scale sequencing centers for additional comment. This should be done in the next two weeks.

Discussion Points:

Jean Thierry-Mieg presented an National Center for Biotechnology Information (NCBI) proposal for a trace repository. He offered the following reasons why a repository is needed and why it should archive traces:

The proposed archive would store compressed SCF files and accept multiple formats via FTP or tape from the sequencers. The system would cost roughly $250 thousand to set up, per 30 million sequence reads, with added cost for ongoing operations. It would be a part of the currently available NCBI resources and would have multiple query tools and export formats developed for it to be useable by the community.

As a guide to thinking about the design of a trace repository, the following table was constructed of the foreseeable uses, types of data sets and response time needed for different requests that could be made for whole genome shotgun and BAC by BAC sequence data from mouse, rat and other organisms.

Use/Users Type of Data Size
(of demand)
Response Needed
Human Annotation Now: FASTA files w/Quality Scores all Batch data with delay
Large Scale Later: piecemeal access to traces 3% - 5% Nearly real time, with local caching of data
Small Scale Traces small Rapid access
Sequence Variation
Large Scale Now: FASTA files w/Quality Scores all Batch data with delay
Later: access to traces small Nearly real time, with local caching of data
Small Scale Later: All traces all Batch data with delay
Traces small Rapid access
Large Scale Traces large Batch data with delay
Small Scale Traces small Rapid access
Development of WG Assembly Methods Traces all Batch data with delay
Discussion Points:

The above table generally applies to the user needs for the human trace data as well as the mouse but there are, in addition, unique near-term issues that need to be considered in the case of the human sequence data. The most urgent of these is the possibility of collecting and using the trace data to improve the Golden Path in time to meet publication deadlines. It was agreed that such data would be valuable for this purpose, but the issue is one of feasibility. Representatives of each of the G5 labs estimated that it would take a month or two to de-archive and prepare the legacy human draft data for submission to an archive. However, such efforts will involve some of the same staff, and therefore compete with the on-going effort within each center to clean up the data going into the Golden Path. The workshop participants agreed that the latter effort was of higher priority, but referred the issue to the G5 principal investigators (PI) for further discussion. It will probably also need to be taken to the G16.

Beyond the whole genome shotgun data, the participants also agreed that EST and BAC-end sequence reads should also be included in the archive. However, each of these will have some unique properties (in terms of ancillary data) and the format for collecting these data will have to be worked out. However, these reads can be added to the archive later, and so the establishment of the archive should not be delayed to work out this issue. NCBI will formulate a proposal for dealing with EST reads that have already been accessioned and provide feedback to the nomenclature group.

The participants also raised the question of archiving finishing reads, which are unique and need to be clearly identified as such. This issue was referred back to the Finishing Working Group to develop a proposal for dealing with the archiving of finishing reads.

Major Action Items

*The two documents referred to above can be made available to anyone interested. Please contact Kris Wetterstrand to request a copy.

Rick Myers, Chair
Stanford University

John Bouck
Baylor College of Medicine

Asif Chinwalla
University of Washington Genome Sequencing Center

Eric Green
NIH Intramural Sequencing Center (NISC)

Phil Green
University of Washington

David Lipman
National Center for Biotechnology Information, NIH

Dick McCombie
Cold Spring Harbor Laboratory

Jill Mesirov
Whitehead Institute for Biomedical Research, MIT

Chad Nusbaum
Whitehead Institute for Biomedical Research, MIT

Jim Ostell
National Center for Biotechnology Information, NIH

Marc Rubenfield
Genome Therapeutics Corp

Greg Schuler
National Center for Biotechnology Information, NIH

Lincoln Stein
Cold Spring Harbor Laboratory

Jean Thierry-Mieg
National Center for Biotechnology Information, NIH

George Weinstock
University of Texas Medical School

Mike Zody
Whitehead Institute for Biomedical Research, MIT


Ewan Birney
The European Bioinformatics Institute

Jane Rogers
The Sanger Center


Francis Collins, Director
Elke Jordan, Deputy Director
Mark Guyer, Assistant Director for Scientific Coordination
Jane Peterson, Program Director, Large-Scale Sequencing
Adam Felsenfeld, Program Director, Large-Scale Sequencing
Lisa Brooks, Program Director, Informatics
Kris Wetterstrand, Scientific Program Analyst

