NHGRI-Specific GDS Policy FAQs
How do I know if my NHGRI project will be subject to the NIH GDS Policy?
NHGRI encourages sharing of all genomic data and data types. However, at this time the NIH GDS Policy and NHGRI implementation plans apply particularly to single nucleotide polymorphism (SNP) array data, genome sequence data, transcriptomic data, epigenomic data, or other molecular data produced by array-based technologies or high-throughput sequencing technologies.
Data pertinent to the interpretation of genomic data — such as associated phenotype data (e.g., clinical information relevant to the disease under study), exposure data, and descriptive information (e.g., protocols or methodologies used) — are expected to be shared. All data sets should include the appropriate metadata to allow efficient sharing and integration with other data sets.
Amount of Data/Type of Study Design
NHGRI finds value in and encourages the sharing of smaller project sizes that do not meet the definition of ‘large-scale’ according to the NIH guidance regarding scope of the GDS Policy. Investigators should consult with appropriate NIH Program Officers as early as possible to determine whether the GDS Policy applies to their research study.
How does the NHGRI expectation for non-human data release differ from the NIH GDS Policy Supplemental Information? Why does this expectation differ?
Data sharing plans for NHGRI-funded or -supported projects to generate non-human genomic data proposed after January 25, 2016 should include pre-publication timelines for data submission and release consistent with NIH GDS Policy expectations for human genomic data (including a possible holding period before data release not to exceed six months).
Data sharing progress reports will be expected, consistent with trans-NIH processes, as they are implemented, or through other NHGRI consortia reporting mechanisms, as applicable. Program directors will monitor progress against the timelines established through the data sharing plans.
This expectation is consistent with the institute’s program priorities as it makes data available for access at the earliest appropriate point to promote maximum public benefit from federal investment in genomics research. As a leader in GDS Policy development and implementation, NHGRI supports timely data release through widely accessible data repositories.
How does NHGRI’s expectation for participant consent regarding the secondary use of generated data differ from the NIH GDS Policy expectation?
The NIH GDS Policy stipulates that, “for studies proposing to use genomic data from cell lines or clinical specimens that were created or collected after the effective date of the Policy, NIH expects that informed consent for future research use and broad data sharing will have been obtained even if the cell lines or clinical specimens are de-identified.” The NHGRI expectation goes further and is that, “whenever possible, NHGRI strongly encourages studies involving human data to use data generated from sources with participant consent for unrestricted access or for general research uses through controlled access.”
Furthermore, the NIH GDS Policy states that “for studies initiated after the effective date of the GDS Policy, NIH expects investigators to obtain participants’ consent for their genomic and phenotypic data to be used for future research purposes and to be shared broadly.” NHGRI specifies that consent language should avoid restrictions on the types of users who may access the data in order to ensure the broadest possible sharing.
NHGRI acknowledges that conforming with these two expectations will not always be possible or appropriate. In addition, individual participants who do not consent to future use or broad data sharing may still participate in the primary study, if consistent with study design.
What is the difference between ‘consent for future research use and broad data sharing’ and ‘broad consent’?
Broad consent is a consent model that allows for current and future access and use of samples or data for research without necessarily specifying what the focus of such studies might be. Consent for broad data sharing is specific to the sharing of data and indicates that data can be shared with others, often through databases (which can be open-access or controlled-access). Broad data sharing does not preclude the ability of participants to limit the future research uses. Per the GDS Policy, data submission and subsequent data sharing for research purposes must be consistent with the informed consent of study participants from whom the data were obtained. NHGRI supports the broadest appropriate genomic data sharing with timely data release through widely accessible data repositories.
For a discussion of considerations in developing informed consent processes for genomics research, see the NHGRI Informed Consent Resource and the section on Special Considerations for Genomics Research.
What is an NIH-designated data repository? Does NHGRI have one besides dbGaP?
An NIH-designated data repository is any data repository maintained or supported by NIH either directly or through collaboration. In January of 2019, the NHGRI Analysis, Visualization, and Informatics Lab-Space (AnVIL) became an NHGRI-designated data repository. Currently, this does not change the process for submitting data to the NHGRI.
How do I find a data repository?
The NIH Office of Science Policy has a list of examples of NIH data repositories, NIH-funded databases, and NIH database collaborations, however this list is not exhaustive.
What happens when a research participant withdraws consent?
Per the NIH GDS Policy, submitting investigators and their institutions may request removal of data on individual participants from NIH-designated data repositories in the event that a research participant withdraws or changes his or her informed consent preferences. If a participant withdraws or changes his or her informed consent preferences, data are removed from any future data releases and a new version of the dataset is released. It’s important to know that some data that have been distributed for research cannot be retrieved.
The NHGRI Informed Consent Resource’s ‘Required Elements of the Consent Form’ webpage contains considerations and sample language that addresses the practical limits on the ability of participants to withdraw samples, genomic data, or health information that have been contributed to genomics research that can be useful to include when drafting informed consent documents.
Still have questions?
Investigators who plan to submit a grant application that proposes the generation of genomic data should consult with appropriate NHGRI Program Officers as early as possible.
- For questions about submitting data under the GDS Policy or about NHGRI’s implementation of the GDS Policy, you may consult the NHGRI Genomic Program Administrator (GPA), Jennifer Strasburger.
- For questions about the GDS Policy, contact the NHGRI GDS Policy Analyst, Elena Ghanaim, or the NIH Office of Science Policy’s GDS Mailbox.
- For other questions about NHGRI Genomic Data Sharing Governance or the NHGRI Data Access Committee, see NHGRI’s list of GDS Policy contacts.
Genomic Summary Results (GSR) Update FAQs
What are Genomic Summary Results?
Genomic summary results (GSR) are the output of analyses of genomic data across the many individual participants included within a specific study’s dataset or across many studies. For most studies in NIH-designated data repositories, for example, this means that GSR represent a summary of the information generated from hundreds, or thousands, of research participants. There are two broad classes of GSR information: allele frequency information1 and association analysis statistics2.
How are GSR different from individual-level genomic data?
“Individual-level data” provide the specific DNA sequence for a single research participant and are usually only available through controlled-access pathways. The privacy risks for individual-level data are greater than those for GSR because they refer to the unique pattern in the DNA code of a single participant, rather than calculations about the patterns seen across a group of people.
How are GSR shared or used?
Currently, some GSR are included by investigators in the manuscripts that they publish to share the key findings from their research studies with the scientific community.
After May 1, 2019, GSR from most studies that are shared through NIH-designated data repositories, such as the database of Genotypes and Phenotypes (dbGaP), will be shared through open access (unrestricted) pathways. This means that dbGaP, and other NIH-designated data repositories, may begin to share publicly more of the statistical findings for most of the studies hosted within the repository. This will allow more GSR to be used by the broader scientific community to promote scientific research or health. Investigators requesting access to individual-level data through controlled-access will continue to be able to share GSR calculations that they generate through their research for others to use (e.g., through a publication). However, if investigators wish to disseminate GSR more broadly (e.g., through an online resource), this should be described in a data access request, which will be reviewed by the Data Access Committee.
Accessibility of GSR is beneficial because these analyses can be used to assess the validity and potential significance of results seen in other studies. They can also be useful for assessing the frequency of an individual genomic variant in different populations and for interpreting the possible pathologic importance of specific genomic test results in patients. While publications only share a small number of GSR relevant to the specific research questions discussed, sharing the complete set of GSR across a dataset or many datasets creates the opportunity for the information to be used to answer many different research questions.
Why did the NIH change the way it manages access to GSR?
NIH has considered the risks and benefits of access to GSR carefully since it was first described in 2008 that individuals could potentially be ‘re-identified’ through their use. Specifically, the agency held public workshops and solicited stakeholder comments through requests for information on the risks and benefits of different models of GSR access.
Public input over the years increasingly noted that the benefits of expanded access to GSR from most genomic studies outweighed the potential risks. Respondents highlighted the significant scientific value of GSR and the fact that there would be minimal risk to most participants if GSR were to be moved from controlled-access to an unrestricted access model. Based on this input, NIH changed the data access model for most GSR to make it more proportional to the risks for this type of information. However, because there are some studies where there might be additional privacy concerns, such as those that include populations from isolated geographic areas or with rare or stigmatizing traits, the access model includes a pathway for GSR from some studies to have additional protections.
What are the privacy risks associated with sharing GSR?
GSR can be used to determine whether an individual was in a particular group of a study (e.g., the disease group vs. the control group) but ONLY IF someone already has access to the research participant’s genomic information. While the risk is very low, it is possible that knowing that a person is part of group (e.g., a disease group) could potentially reveal sensitive information that was not already known from the individual-level genomic information itself.
It is possible that certain study populations may be more vulnerable to this privacy risk if they are from a small or isolated population or have a rare condition or trait. In other cases, the potential stigma of certain conditions or traits included in a study population may also increase privacy concerns.
What are the benefits of sharing GSR through open (unrestricted) access?
Sharing GSR through openly accessible mechanisms means that these summary findings can be used to address many different research questions or to inform the interpretation of clinical test results by health care providers. When GSR are available through unrestricted access, they also become easier to use for the development of new methods to interpret genomic information and its connection to phenotypes by a range of scientists from different fields. In addition, since GSR can be used to assess the validity or potential significance of results seen in other studies, the need to request individual-level data from a study will potentially decrease, thereby focusing access to that depth of data about individual participants to only those secondary studies that truly require it.
Who will calculate GSR and how?
Any GSR shared through dbGaP for each individual study will be generated by the study Principal Investigator(s) (PIs). NIH Institutes and Centers currently vary in what summary statistics they expect PIs to submit. dbGaP also plans to calculate allele frequencies across all non-sensitive datasets within that repository (displayed by population) and share through unrestricted access on dbSNP.
What are the options under the NIH GDS Policy for sharing GSR?
If an institution determines there to be substantive individual privacy or group harm concerns for a particular study population, they may designate the study as “sensitive” when the data sharing plan and Institutional Certification for the study are submitted to NIH during the award process. If the institution designates GSR as “sensitive,” they will only be shared through controlled-access, in conjunction with and under the same terms of access and use as the individual-level data for the study.
For studies that are already submitted to or registered in an NIH-designated data repository, institutions had until May 1, 2019, to notify NIH if they 1) need additional time to consider if a study’s dataset should be designated as sensitive, or 2) that a sensitive designation has been made and GSR should not be made available through unrestricted access. If the institution did not contact NIH before May 1, 2019, GSR will be moved to unrestricted access.
Can approved data users share GSR that they derive from individual-level genomic data in NIH-designated data repositories?
For individual-level human genomic data in NIH-designated data repositories, which are usually only available through controlled-access, a data access request that is reviewed by a Data Access Committee (DAC) is always required. For non-sensitive datasets, data requesters can indicate plans to generate and disseminate GSR in their research use statement, if they wish to post GSR more broadly than publication within the scientific literature as an intrinsic piece of evidence to support a study’s conclusions, and this may be approved by a DAC. Requestors do not need to indicate what specific GSR they plan to generate and disseminate.
For datasets that are designated as sensitive, DACs will not approve research use statements that indicate plans to disseminate GSR more broadly than publication within the scientific literature to support a study’s conclusions.
1 An allele frequency is the proportion of a specific allele, or variation in the DNA code, relative to other possible alleles at the same position in the code in a given population, or in some cases, an entire species. Allele frequency information is used in the fields of Genomics, Population Genetics, and Clinical Genetics to help interpret the potential for links between the presence of specific alleles and observed “outcomes”, such as physical traits or disease risks.
2 In genomics, association analysis statistics are the information generated when investigators evaluate the correlation of genotype to phenotype. Phenotypes studied may be diseases (e.g., diabetes), traits (e.g., height), or molecular traits (e.g., mRNA or protein expression levels). Examples of these kinds of statistics are: p-values, beta values in regression, the odds ratio, and effect size.
Process for Submitting and Releasing Data
Clear milestones for the timing of data deposition should be established for each project and included in the Data Sharing Plan to provide a timeline by which to assess progress toward meeting data submission expectations. Milestones should adhere to standard data release timelines outlined in the NIH GDS Policy Supplemental Information and the NHGRI Guidance for Data Submission and Data Release table below, and should be agreed upon prior to the start of research projects. Large resource projects may develop project-specific timelines for data release, in conjunction with program officers or NHGRI intramural leadership, that exceed the minimum expectations specified in the NIH GDS Policy Supplemental Information and the NHGRI Guidance for Data Submission and Data Release table (see table below).
Unless otherwise specified by project funding announcements, analyses by submitting investigators that are conducted subsequent to the initial data submission, final data sets, or any data updates should be submitted for release concurrent with the first publication analyzing the dataset.
Data sharing progress reports will be expected consistent with trans-NIH processes as they are implemented, or through other NHGRI consortia reporting mechanisms, as applicable. Program directors will monitor progress against the timelines established through the data sharing plans.
Last updated: July 11, 2019