ENCODE: Deciphering Function in the Human Genome

Genome Advance of the Month

ENCODE: Deciphering Function in the Human Genome

Roseanne F. Zhao, Ph.D.
NIH Medical Scientist Training Program Track 3 Scholar

From uncovering the double helix of DNA to sequencing the roughly 3 billion letters of code that make up the complete genetic blueprint of humans, our inward journey of discovery has been filled with historic milestones. Achieving an understanding the human genome — for example, what information is encoded in the human genome, and how it functions and interacts with the environment — is an exciting scientific undertaking because of its potential to reveal key insights into how our DNA gives rise to all of the proteins required for building a human being. Such knowledge would have broad implications for a myriad of cutting edge questions in biology and medicine, including gene regulation, natural variation between individuals, disease susceptibility, and human evolution.

However, reading and interpreting the human genome sequence has proven to be very challenging. Scientists have been able to identify approximately 21,000 protein-coding genes, in large part by using the long-ago established genetic code. But these protein-coding regions make up only approximately 1 percent of the human genome, and no similar code exists for the other functional parts of the genome. Evidence has accumulated over the years that at least some of the remaining 99 percent of the genome is important for regulating gene expression, yet we lacked a global view of how much of the genome was functional, where these other functional regions were located, and in what cell types they were active.

To address this gap in our knowledge, the Encyclopedia of DNA Elements (ENCODE) was launched in 2003 as one of the next steps to understanding how to interpret the information locked within our genomes. Funded by the National Human Genome Research Institute (NHGRI), the ENCODE Project set out to systematically identify and catalog all functional elements — parts of the genetic blueprint that may be crucial in directing how our cells function — present in our DNA. Initially established as a pilot project focused on 1 percent of the human genome, ENCODE was scaled to whole genome analysis in 2007; that same year, a related project named modENCODE was initiated to map all of the functional regions in the worm (C. elegans) and fly (D. melanogaster) genomes. In its scale-up phase, the ENCODE Project was a massive collaborative effort by a consortium of 32 research groups, comprised of more than 400 scientists.

The main results of this ambitious effort have now been reported in 30 coordinated papers published in the September 6, 2012, issues of Nature, Genome Research and Genome Biology, along with additional ENCODE-funded papers in Science, Cell and Nucleic Acids Research. Together, they highlight an initial analysis of 15 trillion bytes of raw data, generated from 1640 datasets that involve 147 cell types.

Within this treasure trove of data, researchers found that more than 80 percent of the human genome has at least one biochemical activity. Although it is currently unknown whether all of this DNA contributes to cellular function, the majority can be transcribed into RNA. Furthermore, nearly 20 percent of the genome is associated with DNase hypersensitivity or transcription factor binding, two common features used to identify regulatory regions. Both of these measurements are a much higher percentage than the previous estimates that 5-10 percent of the genome was functional.

Significantly, more than 4 million regions that appeared to be regulatory regions, or "switches," were identified. These switches are important because they can be used in different combinations to control which genes are turned on and off, as well as when, where and how much they are expressed. Effectively, this provides precise instructions for determining the characteristics and functions of different cell types in the body. Changes in these regulatory switches, especially those regulating critical biological processes, can thus influence the development of disease. The astounding amount of gene-regulatory activity uncovered in the human genome is striking, as more of the genome encodes regulatory instructions than protein, and prompts an assortment of complex questions on how the genome is involved in health and disease.

As a foundational information resource for biomedical research, the data put forth by the ENCODE Project is openly accessible and available through the ENCODE portal (http://encodeproject.org). More than double the amount of data used in these analyses has now been generated and made available through this portal.

In addition to the individual papers, results have also been organized along "threads" that explore specific scientific themes (www.nature.com/encode). This new approach of incorporating, organizing and presenting data from relevant sections of different papers, in different journals, helps to facilitate better user navigation through the immense amount of data and analyses generated.

The ENCODE results are already influencing the way scientists are thinking about both new and existing data. For example, Thread #12 in the Nature ENCODE site focuses on the impact of functional information in understanding genetic variation within the human genome. Genome-wide association studies (GWAS) have previously been used to comb the genome for regions that are associated with specific human diseases or other traits. By comparing DNA sequences from hundreds to thousands of people either with or without a given disease, researchers have been able to identify regions containing variants that are associated with disease. Interestingly, more than 90 percent of these variants have been found in non-coding regions. However, because genetic variants within a given region may be linked to many other variants within the same region, it has been difficult to determine which variants have a causal contribution to increased disease risk.

But when researchers compared the locations of non-coding functional elements identified by ENCODE with disease-associated genetic variants previously identified by GWAS, they detected a striking correlation between the two: genetic variants associated with diseases or other traits were enriched in regulatory switches within the genome. This is exciting because it provides an overarching framework for looking at many different diseases (including Alzheimer's, diabetes, heart disease, and cancer) — and identifying the numerous genetic variants that cause them — beyond the context of DNA that code for proteins.

Even outside its extraordinary scientific contributions, the structural model of the ENCODE Project is fundamentally changing the way large-scale scientific projects are being conducted. Resources such as the ENCODE analysis virtual machines (www.encodeproject.org/ENCODE/analysis.html) provide access to various stages of analysis, including input data sets, methods of analysis and code bundles. ENCODE software tools, data standards, experimental guidelines and quality metrics are all freely available at the ENCODE portal. This allows other researchers to independently assess and reproduce the data and the analyses — with a focus on scientific access, transparency and reproducibility — or to use similar methods to analyze their own data.

To date, 170 publications from labs that are outside of ENCODE have used ENCODE data in their work on human disease, basic biology, and methods development (see: www.encodeproject.org/ENCODE/pubsOther.html). Through the establishment of a basic reference data set, along with accompanying analytical resources, scientists expect that further breakthroughs will be forthcoming in the upcoming years.

However, this is just the beginning and much work remains to be done before we are able to extract all of the functional and disease-related readouts from a genomic sequence. A glance at the various threads will show that the future challenges are numerous and range from computational and analytical challenges to uncovering the complex mechanisms of gene regulation. Understanding how the linear 2D sequence of DNA code correlates with the intricate 3D fractal patterns of folded DNA, which can be important in shaping regulatory network interactions, will also be essential.

The foundations laid down by ENCODE will be invaluable in helping us to figure out how genetic variation influences gene regulation, human health, and disease. To expand and build up a more comprehensive understanding of the human genome, NHGRI has renewed funding for the ENCODE Project for an additional four years to deepen the catalog of functional elements through the study of additional cell types and factors; this build-out phase will also focus on new methods of data analysis (see: www.genome.gov/27550325/2012-release-nih-encode-grants-advance-effort-to-survey-entire-human-instruction-book/). By achieving an improved understanding of genetics in normal and diseased conditions, we will eventually be able to realize the full potential of bringing individualized genome sequencing and personalized, genomic medicine into the clinic.

ENCODE: Deciphering Function in the Human Genome

Genome Advance of the Month

ENCODE: Deciphering Function in the Human Genome

Roseanne F. Zhao, Ph.D.NIH Medical Scientist Training Program Track 3 Scholar

Further reading and resources

Roseanne F. Zhao, Ph.D.
NIH Medical Scientist Training Program Track 3 Scholar