Last updated: October 01, 2012
Reaffirmation and Extension of NHGRI Rapid Data Release Policies:
Large-scale Sequencing and Other Community Resource Projects
A guiding principle of the Human Genome Project has been that the data and resources it has generated should rapidly be made available to the entire scientific community. In practical terms, this has involved the release of data and materials prior to publication, i.e. much more rapidly than is traditional in the scientific community.
- In 1991, the National Human Genome Research Institute (NHGRI) and the Department of Energy (DOE) jointly developed a data release policy that called for release of data and materials within 6 months of their generation.
- In 1996, the International Human Genome Sequencing Consortium adopted the "Bermuda Principles" that expressly called for the automatic, rapid release of sequence assemblies of 1-2 kb or greater to the public domain. To implement the Bermuda Principles, in April 1997 the NHGRI adopted a data release policy that called upon those of its grantees engaged in large-scale genomic DNA sequencing to release DNA sequence assemblies of 2 kb or greater within 24 hours of their generation.
- By the late 1990's, the NHGRI recognized that the April 1997 policy was no longer comprehensive because it did not pertain to randomly generated whole genome shotgun data. Such data sets typically are not assembled until late in a project, so tying data release to assembly would not, in this instance, have ensured rapid access to the underlying data set. Accordingly, in December 2000, the NHGRI extended its data release policy to include weekly submission of the raw sequence traces to a public sequence trace repository (a new type of public sequence database established by the National Center for Biotechnology Information (NCBI) [ncbi.nlm.nih.gov] and the European Bioinformatics Institute (EBI) [www.ebi.ac.uk] specifically for this purpose). At the same time, the NHGRI recognized that this very early data release model could potentially jeopardize the standard scientific practice that the investigators who generate primary data should have both the right and responsibility to publish the work in a peer-reviewed journal. Therefore, NHGRI agreed to the inclusion of a statement on the sequence trace data permitting the scientific community to use these unpublished data for all purposes, with the sole exception of publication of the results of a complete genome sequence assembly or other large-scale analyses in advance of the sequence producer's initial publication.
This restriction attracted little attention until early 2002, when a community debate began about the merits of any limitation on the use of whole genome assemblies that have been submitted to the public sequence databanks (GenBank, EMBL and DDBJ). To discuss the issue and try to resolve their differences, the Wellcome Trust convened an international group of data producers, users, database personnel, journal editors and funding agency representatives in Fort Lauderdale, Fla. in January 2003.
The meeting attendees unanimously agreed that pre-publication release of large-scale genome sequence data has been of tremendous benefit to the scientific research community, and that it is very important to ensure that such rapid release of sequence data continues. The group therefore reaffirmed the Bermuda Principles and recommended that they be extended to all types of sequence data.
Furthermore, the attendees at the meeting recognized that other large efforts, designated as "community resource projects," would increasingly be generating data and other resources that should also be rapidly released to the community in an unrestricted manner (a "community resource project" was defined as a research project specifically devised and implemented to create a set of data, reagents or other material whose primary utility will be as a resource for the broad scientific community). To ensure the continuing effectiveness of the system of rapid, pre-publication release of data from community resource projects, the meeting attendees concluded that each of the three stakeholders in the system - data producers, data users and funding agencies - has an active role to play in promulgating this tradition of open and rapid data release.
In response to the recommendations of the Fort Lauderdale meeting, the NHGRI is proposing to modify its data release policy to implement the system of "tripartite responsibility."
Proposed Update of the NHGRI Policy for the Release of Large-Scale Genomic DNA Sequence Data
The NHGRI reaffirms and extends its commitment to the Bermuda Principles for all types of large-scale DNA sequence data sets, including those that were not considered when the Bermuda Principles were originally devised.
- Large-insert clone-based projects: DNA sequence assemblies of 2 kb or greater are to be deposited in a public nucleotide sequence database (GenBank, EMBL or DDBJ) within 24 hours of generation. Sequence traces from these projects are to be deposited in a trace archive (NCBI Trace Repository or Ensembl Trace Server) within one week of production.
- Whole genome shotgun projects: Sequence traces from whole genome shotgun projects are to be deposited in a trace archive (NCBI Trace Repository or Ensembl Trace Server) within one week of production. Whole genome assemblies are to be deposited in a public nucleotide sequence database as soon as possible after the assembled sequence has met a set of quality evaluation criteria.
The deposited data should be available for all to use without restriction.
The NHGRI recognizes that the successful maintenance of the system of rapid, unrestricted, pre-publication data release requires constructive behavior on the part of both sequence producers and users. Sequence producers are in a unique, central position. The community is dependent on the success of their efforts and they typically face relatively little direct competition. However, it is not possible to guarantee them the standard scientific incentive of publishing the initial analysis of the data they generate without applying restrictions that might inhibit the broadest possible use of the data by the scientific community. Accordingly, the sequence producers must recognize that even if the sequence data are occasionally used in ways that violate normal standards of scientific etiquette, unconditional release of sequence data from large-scale sequence production centers is a necessary risk set against the considerable benefits of immediate data release.
Sequence users, in turn, must accept that they have significant responsibilities consistent with standard scientific norms. Users of unpublished genomic sequence data are expected to acknowledge the source of the sequence data through the use of appropriate citations. Users also need to recognize that the sequence producers have a legitimate interest in publishing peer-reviewed reports describing and analyzing the sequence that they have produced. Data depositions in the public sequence databanks are not the equivalent of such publications. The entire scientific community can also help ensure that the system works fairly for all participants through the peer review systems of both journals and funding agencies.
The NHGRI will encourage the sequence producers to publish a project description for each new genome sequencing project, beginning with new projects initiated in 2003. The purpose of the project description, which will be a new type of scientific publication, is to inform the scientific community about the sequencing project at its inception, and to provide a citation that can be used to reference the source of the sequence. A project description should describe the plans for and scope of the project, as well as any analyses that the sequence producer intends to undertake. It should also include a timeline for sequence production goals and data release. However, the NHGRI does not consider the project description to be the equivalent of the first peer-reviewed published analysis of the results of the sequencing project.
NHGRI strongly encourages the entire scientific community to recognize that the continued success of the system of pre-publication data release requires active community-wide support. There should be no restrictions on the use of the genomic sequence data, but the best interests of the community are served when all act responsibly to promote the highest standards of respect for the scientific contribution of others.
Other Community Resource Projects
Large resource data sets are becoming an increasingly critical component of biomedical and biological research and, as such, will be more frequently produced specifically as community resources. NHGRI will encourage, as an integral component of the development of the new community resources it will support, planners and participants to devise appropriate approaches to implement the principle and achieve the advantages of rapid pre-publication data release. While addressing important considerations as data quality standards, data storage and dissemination modes, protection from parasitic intellectual property claims, and producer and user interests, the development of effective means to achieve the objectives of the community resource concept will maximize the benefit to the entire scientific community and to research.