Genome Advance of the Month

Multi-tasking DNA: Dual-use codons in the human genome

Letters of A T C G
We know that the human genome is the molecular instruction book for building the human body, but exactly what are all the intricacies of how it functions? In 2003, the Human Genome Project (HGP) reached completion, comprehensively sequencing the 3 billion base pairs that make up a full human genome. Yet, having the complete human genome sequence did not mean a complete understanding of what all those As, Cs, Ts and Gs meant in terms of our biology.

Researchers have been hard at work understanding how our genome works, how tiny differences account for the wide diversity among us, how slightly more differences explain why we're so different from our closest primate relatives, and how differences in our genomes contribute to health and disease. December's Genome Advance of the Month highlights a paper published by Andrew Stergachis, Ph.D., professor of epidemiology at the University of Washington and his colleagues, in the December 13, 2013, issue of Science. So what's all the fuss about?

A surprising finding from the HGP was just how few genes are in the human genome. Estimates were much higher than the eventual answer, which was 21,000 genes. These 21,000 genes produce different proteins at different times to generate myriad cell types (e.g., muscle cells, skin cells, brain cells, etc.) that make a human being. But the 21,000 genes in the human genome make up only one percent of the overall DNA sequence.

So, what does the other 99 percent do? Scientists also know that approximately five percent of the 3 billion bases of the human genome is highly conserved (left unchanged) by evolution and therefore thought to be very important. Genes are only part of this five percent; but what about the rest?

Before we continue, a little background. Encoded within each gene is information about the sequence of amino acids required to make that gene's protein. Each amino acid is specified by a sequence of three base pairs, called codons. In addition to the protein-coding information, the genome also contains lots of sequences of DNA (known as functional elements) that control when individual genes are switched on or off.

In most situations, scientists assumed that codons just contained protein coding information and functional elements just contained regions of DNA that regulated gene expression, with each only doing a single job. While scientists observed that some genes contained codons that did both jobs, it wasn't known how widespread this was in the genome.

Dr. Stergachis's study used data generated by NHGRI's Encyclopedia of DNA Elements (ENCODE) Project, an ongoing effort to find and catalog all the different functional elements in our genome. It includes all the locations along the genome of a particular kind of functional element called a transcription factor (TF) recognition site. Transcription factors are proteins that attach to specific sequences of DNA and control whether or not a gene (which may be nearby or many thousands of base pairs away) produces a protein (i.e., is expressed).

Of the roughly 11 million transcription binding sites found by Dr. Stergachis's team, more than 200,000 were actually located within protein coding sequences. In other words, the same stretches of As, Cs, Ts and Gs can both bind a regulatory protein that controls if the gene is on or off and can specify the amino acids of the protein that is made. This means that approximately 14 percent of all human amino acid-determining bases are also regulatory binding sites. On the flip side, this means that about 87 percent of genes contain these dual-use codons, called duons, and are, therefore, very much more frequent in the human genome than previously thought.

The authors then used the data to look for evidence of evolutionary selection in these duons. With just four different bases, it's theoretically possible to have 64 different codons (4 x 4 x 4), but as we only have 20 different amino acids, some amino acids have multiple codons. That makes it possible for a gene's DNA sequence to mutate without a resulting change in the protein. Evolutionary biologists traditionally look for these kinds of base pair differences between people to represent "neutral" mutations (neither advantageous nor detrimental to the organism in terms of natural selection). However, if the base pair sequence of duons is also important for transcription factor binding, then the genetic patterns of these kind of base pair differences would not be neutral-the protein would still have the same amino acids, but the mutated transcription factor might alter whether that gene is expressed, leading to an evolutionary disadvantage.

Dr. Stergachis's study found that duons are indeed highly conserved - that they have a low level of genetic variation at these sites - in the human samples they studied. Hence, the tendency of transcription factor proteins to require certain sequences of DNA for binding leads to high levels of evolutionary conservation.

The researchers observed that different transcription factor proteins avoid or prefer certain regions of genes, such as the beginning of the gene and regions known to generate structurally important amino acid sequences. They also found that transcription factors avoid stop codons both within and outside of gene sequences (i.e., the three base combinations that signal the end of a gene - TAG, TAA or TGA - are never found in transcription factor recognition sites). In addition, the study showed that the binding behavior of some transcription factors may be associated with patterns of regulatory chemical groups (i.e., epigenetics) attached to DNA in highly active genes. All of these observations illustrate that the transcription factor binding behavior and protein production have influenced each other over evolutionary time.

Another interesting discovery was that 13 percent of the DNA mutations that have been associated with human diseases and traits are located within duons, suggesting those diseases or traits may well be a result of a change in transcription factor activity on the relevant gene rather than a mutated version of that gene's protein.

"The fact that the genetic code can simultaneously write two kinds of information means that many DNA changes that appear to alter protein sequences may actually cause disease by disrupting gene control programs or even both mechanisms simultaneously," said John Stamatoyannopoulos, M.D., Associate Professor of Genome Sciences and Medicine, University of Washington School of Medicine, and senior author on the paper.
Further reading and resources

Stergachis AB, Haugen E, Shafer A, Fu W, Vernot B, Reynolds A, Raubitschek A, Ziegler S, LeProust EM, Akey JM, Stamatoyannopoulos JA. Exonic transcription factor binding directs codon choice and affects protein evolution. Science. 2013 Dec 13;342(6164):1367-72. [PubMed]

Weatheritt RJ, Babu MM. Evolution. The hidden codes that shape protein evolution. Science. 2013 Dec 13;342(6164):1325-6. doi: 10.1126/science.1248425. [PubMed]

September 2012 Genome Advance of the Month - ENCODE: Deciphering Function in the Human Genome [PubMed]

Top of page

Posted: February 6, 2014