NHGRI Policy for Release and Database Deposition of Sequence Data

December 21, 2000

The National Human Genome Research Institute's (NHGRI) policy for release and deposition of DNA sequence data was devised to make sequence data available to the research community as soon as possible for free, unfettered use. To achieve this objective, NHGRI adopted as policy a practice that the sequencing community imposed on itself, that data were to be deposited in a public database within 24 hours of generating a sequence assembly of 2 kb or larger. Data release according to this practice is far more rapid than the standard scientific practice of releasing data only upon publication.

In general, this practice has been enormously successful and has achieved its objective - the number of individual research projects that have used genomic sequence data generated by the public sequencing effort is already very large, even though a paper describing the entire genomic sequence has yet to be published. However, the policy now needs to be updated for two reasons - as originally stated, it does not address certain issues, and sequencing practices have advanced beyond the specific scope it addressed. Therefore, the NHGRI statement of policy for release and deposition of sequence data is being updated.

  1. Continued applicability of current policy. The current NHGRI policy on sequence data release (March 7, 1997) was developed early in the sequencing of the human genome and, as written, applies just to early stage data. Thus, the policy only addresses the release of the initial sequence assemblies, calling for the submission of sequence of 2kb or longer to GenBank within 24 hours of assembly. This current policy will remain in effect, but is extended as described under B.

  2. Extension of sequence data release policy.

    1. Data generated during finishing of working draft sequence. During the upgrade of working draft to finished sequence, both additional shotgun data and "finishing reads" will be acquired and assembled with the working draft data to produce finished sequence. As the additional data are incorporated, the new assemblies will often contain only minor changes from the existing working draft. It does not necessarily make sense to require a new submission within 24 hours every time a new assembly is done. At the International Strategy Meeting on Human Genome Sequencing at Cold Spring Harbor in May 2000, the participants agreed that it would be sufficient for the policy to call for updating accessions within 24 hours of a significant change, with the decision as to what a "significant" change is to be left to the sequence producer. Some examples of significant changes include achievement of full shotgun coverage of a clone, definitive closure of a sequence gap with concomitant reduction in the number of contigs, and finishing the sequence. NHGRI concurs with this recommendation and adopts it as part of its Data Release policy.

    2. Whole genome shotgun data. Increasingly, sequencing a large genome will involve a strategy that combines whole genome shotgun sequencing data with map-based framework data. In this approach, data producers will not start to assemble sequence until a significant amount of data has been collected -- this will likely be several months after data collection begins, and may be as much as a year later. However, the individual sequence reads or read pairs will be immediately useful to biologists for many purposes, e.g. in the annotation of the human sequence and in studying other genomes. Making these data publicly available prior to their incorporation into sequence assemblies would be consistent with the objectives of the NHGRI approach to data release.

      However, such very early release must also recognize the widely accepted ethic in the scientific community that those who generate the primary data freely should have both the right and responsibility to publish the work in a peer-reviewed journal. NHGRI believes that a reasonable approach is to recognize the opportunity and responsibility for sequence producers to publish the sequence assembly and large-scale analyses, while not restricting the opportunities of other scientists to use the data freely as the basis for publication of all other analyses, e.g. of individual genes, gene families and other projects at a more limited scale. To date, in many cases, the sequencing laboratory that has produced the data involved in a particular analysis was acknowledged and was actually a collaborator on some projects. This is a reasonable practice and NHGRI encourages its continuation.

In summary, to achieve a balance between the interests of the scientific community and those of the sequence producers, NHGRI adopts the following policy:

Sequence trace data, and all ancillary information specified in a standard format provided by the database, should be released weekly into the NCBI Trace Repository. The information deposited will consist of the sequence trace and ancillary data. The submissions to the Trace Repository will carry the following notice:

"As a public service to the biological research community, these data are being made available by the sequence producers before assembly and before scientific publication. Once deposited, but prior to the publication of the complete sequence of the relevant genome, the data are available to all as follows:

  1. The data may be freely downloaded by all users, for use in all types of analyses (with the single exception described in item iv).

  2. The data may be repackaged in other databases, provided that appropriate acknowledgement is given.

  3. Users are free to use the data for publication in scientific papers analyzing particular genes and regions; the source of the DNA sequence data should be appropriately acknowledged.

  4. The producing laboratories intend to publish the sequence of the genome and certain large-scale analyses of the sequence in a timely manner upon the completion of sequence data acquisition. Therefore, the sole exception to the unrestricted use of these unpublished data is that the data may not be used for the initial publication of the complete genome sequence assembly or other large-scale analyses. In this context, "large-scale" refers to regions the size of the whole genome or individual chromosomes and examples of "large-scale analyses" include identification of regions of evolutionary conservation across an entire genome and identification of complete sets of genomic features such as genes, repeat structures, GC content, etc. The producing laboratories will, however, be open to the possibility of collaboration on such assemblies or analyses."

  5. Any redistribution of the data should carry this notice.

Current NHGRI Policy for Release and Database Deposition of Sequence Data

March 7, 1997

At the Second International Strategy Meeting on Human Genome Sequencing (Bermuda, 1997), attendees affirmed the principle that was set out at the First (1996) International Strategy meeting, that primary genomic sequence should be rapidly released. Specifically, the report of the first meeting stated that "sequence assemblies should be released as soon as possible; in some centres, assemblies of greater than 1 kb would be released automatically on a daily basis." The discussions at the 1997 meeting confirmed NHGRI's conclusions that it is extremely important for its large-scale sequencing program to be functioning in a manner consistent with this principle, that such rapid release is technically feasible, and that such unfinished DNA sequence data have already been found to be useful by the larger scientific community. NHGRI has determined, therefore, that its grantees engaged in large-scale genomic DNA sequencing should now be automatically releasing sequence assemblies of 2 kb or larger within 24 hours of their generation. (the trigger for data release is 2 kb, instead of 1 kb, in order to ensure that the released sequence be comprised of at least two sequence reads. Investigators who wish to release smaller assemblies may do so.) Any laboratory funded by NHGRI for large-scale human genomic sequencing must develop and submit to NHGRI a plan to implement such a data release program, which must be implemented within one month of its being approved by NHGRI. No non-competing or competing renewal will be funded until an acceptable plan has been approved. Mandatory data release as described above will be made a condition of the award for any grant funded by NHGRI for large-scale human sequencing.

