NHGRI logo

Guiding Principles and Expectations for Data Sharing

  • All data produced with NHGRI support should be shared with the community rapidly, completely, and in  NIH-designated data repositories (e.g., NIH database of Genotypes and Phenotypes (dbGaP), the NHGRI Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL)Sequence Read Archive (SRA)Gene Expression Omnibus (GEO) for expression data, and ClinVar for phenotype and pathogenicity information about genetic variants) or NIH Established Trusted Partnerships, such as the National Cancer Institute's Genomic Data Commons.
  • When NIH data resources are not available, other central data resources may be used (e.g., UniProt, FlyBase, or databases at the European Bioinformatics Institute (EBI) that do not have equivalents at NIH).
     
  • If a central data resource does not exist, then Principal Investigators (PIs) should indicate alternative data sharing approaches (e.g., consortia databases or PI-hosted website for non-human data) that provide appropriate access to the research community and conform to all relevant policies on the distribution of data.
     
  • Expectations for data sharing will be clearly communicated in all NHGRI funding opportunity announcements and within the NHGRI intramural research program. For extramural projects, data sharing plans will be evaluated by NHGRI Program Staff in the Extramural Research divisions, and, if needed, more appropriate plans may be negotiated with applicants before funding is provided; these plans may be factored into funding decisions based on program priorities. For intramural projects, the Scientific Director of the Division of Intramural Research or his delegate will approve the data sharing plans.
     
  • The NHGRI Genomic Data Sharing Program Administrator (GPA) will be the point of contact for all staff regarding the implementation of data sharing plans in the extramural and intramural research programs. The GPA will work with NHGRI leadership on questions regarding implementation of Institute and NIH policies.
     
  • Restrictions may be placed on future year funding for non-compliance with the GDS policy.
  • Guiding Principles and Expectations for Data Sharing
    • All data produced with NHGRI support should be shared with the community rapidly, completely, and in  NIH-designated data repositories (e.g., NIH database of Genotypes and Phenotypes (dbGaP), the NHGRI Data Science Analysis, Visualization, and Informatics Lab-space (AnVIL)Sequence Read Archive (SRA)Gene Expression Omnibus (GEO) for expression data, and ClinVar for phenotype and pathogenicity information about genetic variants) or NIH Established Trusted Partnerships, such as the National Cancer Institute's Genomic Data Commons.
    • When NIH data resources are not available, other central data resources may be used (e.g., UniProt, FlyBase, or databases at the European Bioinformatics Institute (EBI) that do not have equivalents at NIH).
       
    • If a central data resource does not exist, then Principal Investigators (PIs) should indicate alternative data sharing approaches (e.g., consortia databases or PI-hosted website for non-human data) that provide appropriate access to the research community and conform to all relevant policies on the distribution of data.
       
    • Expectations for data sharing will be clearly communicated in all NHGRI funding opportunity announcements and within the NHGRI intramural research program. For extramural projects, data sharing plans will be evaluated by NHGRI Program Staff in the Extramural Research divisions, and, if needed, more appropriate plans may be negotiated with applicants before funding is provided; these plans may be factored into funding decisions based on program priorities. For intramural projects, the Scientific Director of the Division of Intramural Research or his delegate will approve the data sharing plans.
       
    • The NHGRI Genomic Data Sharing Program Administrator (GPA) will be the point of contact for all staff regarding the implementation of data sharing plans in the extramural and intramural research programs. The GPA will work with NHGRI leadership on questions regarding implementation of Institute and NIH policies.
       
    • Restrictions may be placed on future year funding for non-compliance with the GDS policy.

Consent for and Exceptions to Broad Data Sharing

  • All NHGRI-funded studies involving human subjects should seek consent for future research use and broad sharing of participant data via central databases, such as dbGaP. Participants who do not consent to future use or broad data sharing may still participate in the primary study, if consistent with study design.
     
  • Whenever possible, studies should seek consent for General Research Uses of participant data instead of placing disease-specific or other data use limitations on future use of the data. Similarly, whenever possible, there should be no restrictions on the types of users who may access the data (i.e., allow academic, commercial, and government researchers to use the data).
     
  • There may be existing participant samples collected without explicit consent for broad data sharing or new sample collections where seeking consent for future use and broad sharing is not feasible. In these instances, an exception for broad data sharing may be requested within the data sharing plan for the study. Before NHGRI funds are committed to these studies, requests for exceptions will be evaluated to determine whether the rationale for the exception is sound and the alternative data sharing plans are adequate.
     
  • Examples of research or research-related activities funded by NHGRI which are outside the scope of the NIH GDS Policy include, but are not limited to, projects that do not meet the criteria in the Supplemental Information to the NIH GDS Policy: instrument calibration exercises, statistical or technical methods development, or use of genomic data for control purposes, such as for assay development. Data generated by these types of projects are not expected to be shared.
  • Consent for and Exceptions to Broad Data Sharing
    • All NHGRI-funded studies involving human subjects should seek consent for future research use and broad sharing of participant data via central databases, such as dbGaP. Participants who do not consent to future use or broad data sharing may still participate in the primary study, if consistent with study design.
       
    • Whenever possible, studies should seek consent for General Research Uses of participant data instead of placing disease-specific or other data use limitations on future use of the data. Similarly, whenever possible, there should be no restrictions on the types of users who may access the data (i.e., allow academic, commercial, and government researchers to use the data).
       
    • There may be existing participant samples collected without explicit consent for broad data sharing or new sample collections where seeking consent for future use and broad sharing is not feasible. In these instances, an exception for broad data sharing may be requested within the data sharing plan for the study. Before NHGRI funds are committed to these studies, requests for exceptions will be evaluated to determine whether the rationale for the exception is sound and the alternative data sharing plans are adequate.
       
    • Examples of research or research-related activities funded by NHGRI which are outside the scope of the NIH GDS Policy include, but are not limited to, projects that do not meet the criteria in the Supplemental Information to the NIH GDS Policy: instrument calibration exercises, statistical or technical methods development, or use of genomic data for control purposes, such as for assay development. Data generated by these types of projects are not expected to be shared.

Extent of Data to be Broadly Shared

  • All final datasets produced by the study should be shared, not just datasets generated to support a publication.
     
  • Large resource projects (e.g., 1000 Genomes or ENCODE) should generally share their initial data (e.g., reads), intermediate data (e.g., assemblies), and final data (e.g., variant calls, genotypes, haplotypes).
     
  • As much phenotypic data (stripped of HIPAA identifiers) as possible should be shared, beyond the variables used for the first study publication.
     
  • All supporting meta-data should be well documented (e.g., data element dictionaries, data collection protocols, study inclusion and exclusion criteria).
     
  • When possible, the use of standard formats and vocabularies/ontologies to describe data elements, such as sequence data, variants, and phenotypic characterization, is encouraged.
  • Extent of Data to be Broadly Shared
    • All final datasets produced by the study should be shared, not just datasets generated to support a publication.
       
    • Large resource projects (e.g., 1000 Genomes or ENCODE) should generally share their initial data (e.g., reads), intermediate data (e.g., assemblies), and final data (e.g., variant calls, genotypes, haplotypes).
       
    • As much phenotypic data (stripped of HIPAA identifiers) as possible should be shared, beyond the variables used for the first study publication.
       
    • All supporting meta-data should be well documented (e.g., data element dictionaries, data collection protocols, study inclusion and exclusion criteria).
       
    • When possible, the use of standard formats and vocabularies/ontologies to describe data elements, such as sequence data, variants, and phenotypic characterization, is encouraged.

Timeline for Data Sharing

  • Some large resource projects are producing data rapidly; these projects will develop project specific timelines for data release in conjunction with Program Directors or appropriate Intramural leadership.
     
  • PIs should note the following NHGRI data release expectation for non-human genomic data that differs from the NIH expectation. Projects generating non-human data submitted on or after January 25, 2016 should include pre-publication timelines for data submission and release consistent with the expectations for human data (including a possible holding period before data release not to exceed six months). 
     
  • All other studies should, at a minimum, adhere to standard data release timelines outlined in the GDS Policy and further described in the table below.
  • Timeline for Data Sharing
    • Some large resource projects are producing data rapidly; these projects will develop project specific timelines for data release in conjunction with Program Directors or appropriate Intramural leadership.
       
    • PIs should note the following NHGRI data release expectation for non-human genomic data that differs from the NIH expectation. Projects generating non-human data submitted on or after January 25, 2016 should include pre-publication timelines for data submission and release consistent with the expectations for human data (including a possible holding period before data release not to exceed six months). 
       
    • All other studies should, at a minimum, adhere to standard data release timelines outlined in the GDS Policy and further described in the table below.

NHGRI Guidance for Data Submission and Data Release

Level General Description of Data Processing Example Data Types Data Submission Expectation Data Release Timeline
0 Raw data generated directly from the instrument platform Instrument image data

Human data: Not expected.

Non-human data: Not expected.

Human data: NA.

Non-human data: NA.
1 Basic data after initial processing of raw input data DNA sequence reads, ChIP-Seq reads, RNA- Seq reads, SNP array data, array CGH data

Human data: Not expected.
 
Non-human data: Not expected, except for de novo sequence data (unless it is included with Level 2 aligned sequence files). Submission of de novo sequence data is expected nolater than the time of initial publication.

Human data: NA.  

Non-human data: No later than the time of initial publication; an earlier release date may be designated for certain data types or NIH projects.
2 Data after an initial round of processing or computation to clean the data and assess basic quality measures DNA sequence alignments to a reference sequence or de novo assembly, RNA expression profiling

Human data: After data cleaning and quality control, which is generally within 3 months after data were generated. Project specific.
 
Non-human data: Data submission expected at the time of initial publication; an earlier submission date may be designated for certain data types or NIH projects.

Human and Non-human data: Up to 6 months after data submission is initiated or at the time of acceptance of initial publication, whichever occurs first.

3 Analysis to identify genetic variants, gene expression patterns, or other features of the dataset SNP or structural variant calls, genotypes, expression peaks, epigenomic features

Human data: After cleaning and quality control, which is generally within 3 months after data have been generated. Project specific.
 
Non-human data: Data submission expected at the time of initial publication; an earlier release date may be designated for certain data types or NIH projects.

Human and Non-human data: Up to 6 months after data submission is initiated or at the time of acceptance of initial publication, whichever occurs first.

4 Final analysis that relates the genomic data to phenotype or other biological states Genotype-phenotype relationships, relationships of RNA expression, or epigenomic patterns to biological state

Human data: Data submitted as analyses are completed.
 
Non-human data: Data submission expected at the time of initial publication.

Human and Non-human data: Data released with publication.

Level 0 Data: These data are the raw images and generally have limited value to secondary data users. NIH policy does not expect submission of these data.
 
Level 1 Data: These data are the initial sequence reads and generally have limited value to secondary data users. NIH policy does not expect submission of these data, except for de novo sequence data from non-human organisms (unless it is included with Level 2 aligned sequence files). Submission of array-based data, such as gene expression, ChIP chip, ArrayCGH, and SNP arrays can be submitted to GEO as level 1 data, which will not be accessible until a manuscript describing the data is published. If PIs choose to submit level 1 human data to an NIH-designated data repository, it is the submitting institution’s responsibility to protect participant privacy by ensuring that data submission is consistent, as appropriate, with all applicable national, tribal, and state laws and regulations as well as relevant institutional policies, and the GDS Policy.
 
Level 2 Data: These data constitute a computational analysis in the form of higher order assembly or placement of the sequencing reads on a reference template. The level 2 file comprises the reads “piled” on a reference genome. A submission would be a file (e.g., binary alignment matrix (BAM) files) that contains the unmapped reads as well. GWAS and other types of projects (e.g., RNA expression profiling or de novo sequencing) would also generate a level 2 placement or assembly file.
 
Preparation of level 2 data generally requires substantial data cleaning, analysis, and quality checks related to both breadth of coverage of the targeted region and accuracy of assembly. Sufficient time will be allowed to clean the data by removal of extraneous or poor-quality sequence, complete quality-control analyses, and generate the assembly, up to the coverage and quality thresholds specified by a project or investigative team. It is anticipated that this work could generally be completed within three months, and data submission would follow shortly thereafter, but this may vary depending on the data type or specific program design.
 
After submission of human data begins, the data may be held in an exchange area accessible only to the submitting PIs and collaborators for a period not to exceed six months. Following this period of exclusivity, the data will be available for research access without restrictions on publication.
 
Phenotype or clinical data should be submitted to the NIH-designated data repository at the earliest opportunity, but no later than the date of level 2 genomic data submission (or levels 2 and 3 for GWAS datasets), especially for studies in which all phenotype data have already been gathered. For studies in which phenotype data collections are ongoing and/or may be regularly updated, data files should be submitted to NIH-designated data repositories as early as possible considering the practical needs for ensuring data accuracy; generally speaking, this time should not exceed three months after data cleaning begins.
 
Level 3 Data: These data include analyses to identify variants or to elucidate other features of the genomic dataset, such as gene expression patterns in an RNA-seq assay. Level 3 data may be generated from a single level 2 data file (e.g., variant sites versus the human reference genome) but will often derive from a compilation of sequencing assemblies (e.g., in a genome study of a specific cancer type). Data submission expectations for level 3 files will vary substantially by project and therefore will require consultation with NIH program staff.
 
As in level 2 data submission, level 3 files for human data will be date stamped and the data producer may request a period of exclusivity not to exceed six months, after which time the datasets will be released through unrestricted- or controlled-access mechanisms as appropriate and without publication limitations.
 
Level 4 Data: These data constitute the final analysis, relating the genomic datasets to phenotype or other biological states as pertinent to the research objective. Data in this level are the project findings or the publication dataset. PIs should submit these data prior to publication, and the data will be released concurrent with publication.

  • NHGRI Guidance for Data Submission and Data Release
    Level General Description of Data Processing Example Data Types Data Submission Expectation Data Release Timeline
    0 Raw data generated directly from the instrument platform Instrument image data

    Human data: Not expected.

    Non-human data: Not expected.

    Human data: NA.

    Non-human data: NA.
    1 Basic data after initial processing of raw input data DNA sequence reads, ChIP-Seq reads, RNA- Seq reads, SNP array data, array CGH data

    Human data: Not expected.
     
    Non-human data: Not expected, except for de novo sequence data (unless it is included with Level 2 aligned sequence files). Submission of de novo sequence data is expected nolater than the time of initial publication.

    Human data: NA.  

    Non-human data: No later than the time of initial publication; an earlier release date may be designated for certain data types or NIH projects.
    2 Data after an initial round of processing or computation to clean the data and assess basic quality measures DNA sequence alignments to a reference sequence or de novo assembly, RNA expression profiling

    Human data: After data cleaning and quality control, which is generally within 3 months after data were generated. Project specific.
     
    Non-human data: Data submission expected at the time of initial publication; an earlier submission date may be designated for certain data types or NIH projects.

    Human and Non-human data: Up to 6 months after data submission is initiated or at the time of acceptance of initial publication, whichever occurs first.

    3 Analysis to identify genetic variants, gene expression patterns, or other features of the dataset SNP or structural variant calls, genotypes, expression peaks, epigenomic features

    Human data: After cleaning and quality control, which is generally within 3 months after data have been generated. Project specific.
     
    Non-human data: Data submission expected at the time of initial publication; an earlier release date may be designated for certain data types or NIH projects.

    Human and Non-human data: Up to 6 months after data submission is initiated or at the time of acceptance of initial publication, whichever occurs first.

    4 Final analysis that relates the genomic data to phenotype or other biological states Genotype-phenotype relationships, relationships of RNA expression, or epigenomic patterns to biological state

    Human data: Data submitted as analyses are completed.
     
    Non-human data: Data submission expected at the time of initial publication.

    Human and Non-human data: Data released with publication.

    Level 0 Data: These data are the raw images and generally have limited value to secondary data users. NIH policy does not expect submission of these data.
     
    Level 1 Data: These data are the initial sequence reads and generally have limited value to secondary data users. NIH policy does not expect submission of these data, except for de novo sequence data from non-human organisms (unless it is included with Level 2 aligned sequence files). Submission of array-based data, such as gene expression, ChIP chip, ArrayCGH, and SNP arrays can be submitted to GEO as level 1 data, which will not be accessible until a manuscript describing the data is published. If PIs choose to submit level 1 human data to an NIH-designated data repository, it is the submitting institution’s responsibility to protect participant privacy by ensuring that data submission is consistent, as appropriate, with all applicable national, tribal, and state laws and regulations as well as relevant institutional policies, and the GDS Policy.
     
    Level 2 Data: These data constitute a computational analysis in the form of higher order assembly or placement of the sequencing reads on a reference template. The level 2 file comprises the reads “piled” on a reference genome. A submission would be a file (e.g., binary alignment matrix (BAM) files) that contains the unmapped reads as well. GWAS and other types of projects (e.g., RNA expression profiling or de novo sequencing) would also generate a level 2 placement or assembly file.
     
    Preparation of level 2 data generally requires substantial data cleaning, analysis, and quality checks related to both breadth of coverage of the targeted region and accuracy of assembly. Sufficient time will be allowed to clean the data by removal of extraneous or poor-quality sequence, complete quality-control analyses, and generate the assembly, up to the coverage and quality thresholds specified by a project or investigative team. It is anticipated that this work could generally be completed within three months, and data submission would follow shortly thereafter, but this may vary depending on the data type or specific program design.
     
    After submission of human data begins, the data may be held in an exchange area accessible only to the submitting PIs and collaborators for a period not to exceed six months. Following this period of exclusivity, the data will be available for research access without restrictions on publication.
     
    Phenotype or clinical data should be submitted to the NIH-designated data repository at the earliest opportunity, but no later than the date of level 2 genomic data submission (or levels 2 and 3 for GWAS datasets), especially for studies in which all phenotype data have already been gathered. For studies in which phenotype data collections are ongoing and/or may be regularly updated, data files should be submitted to NIH-designated data repositories as early as possible considering the practical needs for ensuring data accuracy; generally speaking, this time should not exceed three months after data cleaning begins.
     
    Level 3 Data: These data include analyses to identify variants or to elucidate other features of the genomic dataset, such as gene expression patterns in an RNA-seq assay. Level 3 data may be generated from a single level 2 data file (e.g., variant sites versus the human reference genome) but will often derive from a compilation of sequencing assemblies (e.g., in a genome study of a specific cancer type). Data submission expectations for level 3 files will vary substantially by project and therefore will require consultation with NIH program staff.
     
    As in level 2 data submission, level 3 files for human data will be date stamped and the data producer may request a period of exclusivity not to exceed six months, after which time the datasets will be released through unrestricted- or controlled-access mechanisms as appropriate and without publication limitations.
     
    Level 4 Data: These data constitute the final analysis, relating the genomic datasets to phenotype or other biological states as pertinent to the research objective. Data in this level are the project findings or the publication dataset. PIs should submit these data prior to publication, and the data will be released concurrent with publication.

Last updated: December 6, 2019