Frequently Asked Questions
Data sharing is an important aspect of the research endeavor that ensures maximal use of participants’ data for scientific discovery, rigor and reproducibility, and more. NHGRI-funded researchers subject to NIH’s data sharing policies are expected to share comprehensive metadata and phenotypic data, and to use standard data collection protocols and standardized notation, to ensure the data are maximally useful to the broader scientific community.
How does NHGRI define metadata and phenotypic data?
NHGRI uses the NIH definition of metadata provided in the Final NIH Policy for Data Management and Sharing:
Metadata are “data that provide additional information intended to make scientific data interpretable and reusable (e.g., date, independent sample and variable construction and description, methodology, data provenance, data transformations, any intermediate or descriptive observational variables).”
Phenotypic data are the observable characteristics or traits of an organism or a cell line (i.e., the physical manifestation of a genotype).
NOT-HG-21-022 articulates NHGRI’s expectations for capturing and sharing both metadata and phenotypic data.
What is the scientific value of sharing metadata?
Providing sufficient and well-structured metadata is a key component of abiding by the FAIR Principles for scientific data management and sharing. Findable, Accessible, Interoperable, and Reusable datasets maximize public investments in biomedical research.
In order to ensure that data generated with funding by NHGRI is maximally useful to the broader research community, the Institute encourages its grantees to comprehensively share both metadata and phenotypic data sharing along with genomic datasets that are required to be shared, in accordance with the FAIR Principles.
How do the expectations of NOT-HG-21-022 relate to the requirements of the NIH Genomic Data Sharing (GDS) Policy?
The NIH GDS Policy states that NIH-funded researchers generating genomic data are expected to deposit “relevant associated data (e.g., phenotype and exposure data)” to a publicly accessible data repository. However, researchers often share the minimum metadata and phenotypic data required for submission of the dataset, rather than a comprehensive set of information to make the shared data more useful to secondary users. NOT-HG-21-022 builds upon the expectation outlined in the NIH GDS Policy and emphasizes the importance of sharing comprehensive metadata and phenotypic data associated to the dataset. Importantly, this Notice applies to all NIH data sharing policies, not just the NIH GDS Policy. This Notice states that NHGRI-funded and supported researchers will be expected to:
- share the metadata and phenotypic data associated with the study.
- use standardized data collection protocols and survey instruments for capturing data, as appropriate.
- use standardized notation for metadata (e.g., controlled vocabularies or ontologies) to enable the harmonization of datasets for secondary research analyzes.
NHGRI is working with NHGRI-funded data resources and data coordination centers, to ensure that adequate metadata and phenotypic data are deposited by NHGRI-funded researchers.
What types of research does NOT-HG-21-022 apply to?
NOT-HG-21-022 applies to all NHGRI-funded research, including investigator-initiated research projects. Studies that do not result in a publication are also expected to share data and associated metadata.
Where should I describe my plans for sharing metadata and phenotypic data?
In addition to plans to share genomic data, investigators should describe any data vocabularies, ontologies, data models, and dictionaries they plan to use in the Resource Sharing Plan of their grant application(s).
Where should I deposit metadata and phenotypic data?
Metadata and phenotypic data should be submitted along with the dataset to a NIH-designated data repository. The Trans-NIH BioMedical Informatics Coordinating Committee (BMIC) maintains a useful list of NIH-supported data repositories, including NHGRI-supported repositories such as AnVIL and various model organism databases. Metadata and resources such as study protocols, informed consent form templates, results report templates, methodologies used, and bioinformatic tools, as appropriate, to be made available through an open access section of a database such as AnVIL, dbGaP, other public websites, and publication in the scientific literature.
How much metadata and phenotypic data do I need to share?
As stated by Wilkinson et al., data and metadata should be “richly described with a plurality of accurate and relevant attributes.” Statisticians have also published guidelines that may be helpful for data generators to consider when submitting a dataset and its associated metadata to a repository (e.g., How to share data for collaboration, Ellis & Leek). At a minimum, the submitted metadata and phenotypic data should be sufficient for a secondary user to fully replicate the analysis/findings of the original study.
Does NHGRI endorse specific data standards and ontologies?
NHGRI strongly encourages the use of existing data standards and ontologies that are generally endorsed by the community of your research area, although it does not require the use of any particular one. Investigators should use data standard(s) and ontologies that facilitate comparison across similar studies within their research community.
Where should I start?
If you have questions about what data standard(s) or collection protocols to use, or which metadata and phenotypic data to share, contact your NHGRI Program Director.
Here are some useful links with additional information that may help you to get started:
- BioPortal is a repository of biomedical ontologies.
- The PhenX Toolkit is an online catalog of well-established and vetted phenotypic measurement protocols, to promote the collection of comparable data across studies.
- The NIH Common Data Elements (CDE) Repository provides access to structured human and machine-readable definitions of data elements that have been recommended or required by NIH Institutes and Centers for use in research.
- The Human Genome Variation Society (HGVS)’s Sequence Variant Nomenclature is the most common standard for describing sequence variants in DNA and protein sequences.
- BioPortal is a repository of biomedical ontologies.
Last updated: November 2, 2022