Privacy, Confidentiality and Identifiability
in Genomic Research
October 3-4, 2006
William W. Lowrance, Ph.D., Project Leader
Genomic research is now being broadened to include complex population-based studies, and the results of medical sequencing projects are being assembled into databases. While all this holds great promise to further the understanding of health and disease, it also brings potential threats both to the privacy of the people whose genomes are being studied and to public trust in the burgeoning genomic research enterprise.
Concerned about the risks, the National Human Genome Research Institute (NHGRI) asked Dr. William Lowrance, a consultant on health research ethics and policy, to prepare a white paper (Privacy, Confidentiality, and Identifiability in Genomic Research) and co-chair with Dr. Francis Collins, the NHGRI director, a workshop involving an experienced group of scientists, NIH and other government officials and staff, ethicists, lawyers, lay advocates, and leaders of several large research projects, to discuss the issues.
Generally the analysis of the white paper was accepted - it should be considered part of this summary - and the following conclusions informally emerged.
Increasing amounts of highly detailed genomic data, often linked with health and other data, are being released in both research and non-research settings, and this trend will continue. Although the chances of identity disclosure and possible negative consequences for the data-subjects, researchers, or institutions are difficult to estimate (because they depend heavily on circumstances), probably they are small. But it is clear that access to more data by more diverse accessors for more varied purposes inevitably will increase the risks.
Four Themes Accepted As Granted
- All policies and practices must support pursuit of maximal public benefit, but they
must at the same time protect the privacy and confidentiality of the people whose
genomes are being studied.
- Scientific community data-resource projects and efficient access to data and
biospecimens greatly facilitate important modes of genomic research and must be
- At this stage in the maturation of genomic science, the research community must "bend
over backward" to protect - and be viewed as protecting and respecting - the people
from whom data are derived.
- Responsibility for protecting identifiability, privacy, and confidentiality is shared by
everyone in the chain of data collection, distribution and use.
Approaches by Which Non-Identified Data May Become Identified
Matching: If identified or straightforwardly identifiable reference genotype data are available, matching can be performed with very high reliability.
Linking: When genomic data are associated with such clues as diagnosis, locale, health care or payment information, treatment dates, and so on, there is a possibility that the data can be searched against administrative or other identified data-sets and lead to identification of individuals.
Profiling/Describing: As the phenotypic manifestations of various genes become known, it will increasingly become possible to construct probabilistic descriptions of persons from genomic data. Already a small number of physical attributes and proxies for ethnicity are inferable; soon many chronic-disease susceptibilities will be; and before long some behavioral tendencies will be.
Approaches to Protecting Identities
Limiting the amount of genomic information released from each sample.
Technically this is easy to do and is often done. Precautions can be taken to make sure that individual genotypes or separate sequence-reads from a sample cannot be reassembled into a dataset that might be unique to the individual. But releasing too few SNPs or too-short snippets of sequence may limit research usefulness.
Statistically degrading data before releasing.
Techniques such as micro-aggregating ("binning"), scrambling, and masking can be employed, and they may be acceptable for some analyses, but generally these degrade usefulness.
Sequestering identifiers via key-coding.
This is a pivotally important safeguard. It does not totally obviate the possibility of identifying via matching or profiling, but the latter can be much reduced by carefully removing strongly identifying data from data-sets before key-coding and releasing them.
Shift Toward Controlled Data-Release
Based on the preceding points, it is clear that although a lot of sequence and related data can still be made freely accessible, such as by being posted on publicly accessible websites, increasingly projects will have to manage access via controlled release arrangements in which, among other things, accessors commit to protecting privacy and confidentiality.
Freely open data-release is acceptable only if either: (a) consent to it is ethically and legally legitimate, and granted; or (b) the data are for all practical purposes non-identifiable.
Generally the experience with controlled release has been positive. But the scale and potential international accessibility of many new projects will test the robustness and enforceability of access arrangements.
Possible Follow-On Work
The following are areas of work that individuals, groups, or institutions should be encouraged to pursue further, for the general benefit.
- Pursue technical analyses of the variability within chromosomes and the threshold
"amount" of genome that can be released without leading to identification of persons.
(Preliminary work along these lines was presented at the workshop by Dr. Stephen
Sherry of the National Center for Biotechnology Information.)
- Examine the extent to which consent can legitimate truly open release of person-
unique genetic data, and review the pros and cons of nested and purpose-specific
- In crafting consent processes, continue to explore ways of improving openness and
comprehensibility regarding purposes and risks.
- Review the ethico-legal and operational consequences that increased identifiability
of genomic data and biospecimens may have vis-à-vis construal of "human subject"
under the Common Rule and other regulations.
- Examine the optimal roles for IRB and/or other ethics review (such as by HIPAA
Privacy Boards or special data-use boards) in various data-flow models, and the stages
at which such review can be most constructive.
- Review whether Principal Investigators and secondary-data and biospecimen
distributors are adequately prepared to de-identify data before providing them to others
for research, and if necessary prepare guidance and case examples.
- Develop criteria for deciding whether access to particular data-sets should be by open
or controlled release.
- Continue to improve the terms and procedures of controlled access. As part of this,
evaluate lessons learned from established projects and being learned by new initiatives. Consider developing data-release agreement templates.
- Examine the issues relating to release of U.S.-origin genomic data to researchers
outside the U.S., and to the importation of genomic data to the U.S. from elsewhere.
- Design programs and materials to help sensitize researchers to all of these matters.
- Take stock of the views of the general public, various advocates, and researchers on
selected aspects of this puzzle.
- Review the adequacy of existing legislative and regulatory protections against misuse
or abuse of biomedical research data, and consider whether additional protections are
needed; for instance, to:
(a) Prevent inappropriate probing of research databases by law enforcement agencies.
(b) Prevent unjustifiable use of the Freedom of Information Act to force disclosure of identifying information.
(c) Establish sanctions that could be applied if data recipients violate controlled-release undertakings.
Last Reviewed: March 13, 2012