NHGRI logo

International Human Genome Sequencing Consortium Publishes Sequence and Analysis of the Human Genome

February 12, 2001

WASHINGTON, D.C. - The Human Genome Project international consortium today announced the publication of a draft sequence and initial analysis of the human genome - the genetic blueprint for a human being. The paper appears in the Feb. 15 issue of the journal Nature.

The draft sequence, which covers more than 90 percent of the human genome, represents the exact order of DNA's four chemical bases - commonly abbreviated as A, T, C and G - along the human chromosomes. This DNA text influences everything from eye color and height, to aging and disease.

The consortium's initial analysis of this text represents scientists' first global view of the human genomic landscape, with its extraordinary trove of information about human development, physiology, medicine and evolution.

The results reported in this week's Nature represent major progress for the human genome consortium. On June 26, the consortium announced that it had collected roughly 90 percent of the letters of the text for the "Book of Life." The consortium's new achievement represents a further compilation of these letters into the first draft of a readable text.

There are small gaps still remaining in this text, but scientists are already getting a good sense of what the genome landscape looks like and the surprising stories it has to tell. Below are highlights:

  • The distribution of genes on mammalian chromosomes is striking. It turns out that our chromosomes have crowded urban centers with many genes in close proximity to one another and also vast expanses of unpopulated desert where only non-coding "junk" DNA can be found. This distribution of genes is in marked contrast to the genomes of many other organisms, such as the mustard weed, the worm and the fly. Their genomes, more closely resemble uniform, sprawling suburbs, with genes relatively evenly spaced throughout.

  • Though a definitive count of human genes must await further experimental and computational analysis, scientists now estimate that humans have some 30,000 to 35,000 genes in their genomes. This new estimate indicates that humans have only about twice as many genes as the worm or the fly. How can human complexity be explained by a genome with such a paucity of genes? It turns out that humans are very thrifty with their genes, able to do more with what they have than other species. Instead of producing only one protein per gene, the average human gene produces three different proteins. (See Vignette 2)

  • The full set of proteins (the proteome) encoded by the human genome is more complex than those of invertebrates because humans and other vertebrates have rearranged old protein domains into a rich collection of new architectures. In other words, humans have for the most part achieved innovations by rearranging and expanding tried-and-true strategies from other species, rather than by developing novel strategies of their own. (See Vignette 3)

  • Scientists have identified more than 200 genes in the human genome whose closest relatives are in bacteria. Analogous genes are not found in invertebrates, such as the worm, fly and yeast. This suggests that these genes were acquired at a more recent evolutionary past, perhaps after the emergence of vertebrates. Scientists didn't find any single bacterial source for the transferred genes, indicating that several independent gene transfers from different bacteria occurred. (See Vignette 9)

  • Our junk DNA, characterized by long stretches of repeating sequences, represents a rich fossil record of clues to our evolutionary past. It is possible to date groups of so-called "repeats" to when in the evolutionary process they were "born" and to follow their fates in different regions of the genome and in different species. The HGP scientists used 3 million such repeating elements as dating tools. Based on such "DNA dating," scientists can build family trees of the repeats, showing exactly where they came from and when. These repeats have reshaped the genome by rearranging it, creating entirely new genes, and modifying and reshuffling existing genes. (See Vignette 4)

  • We have a greater percentage of junk DNA in our genomes - 50 percent - than the mustard weed (11 percent), the worm (7 percent) or the fly (3 percent). Also, shockingly, there seems to have been a dramatic decrease in the activity of repeats in the human genome over the past 50 million years - as if the human species decided 50 million years ago to stop collecting junk. In contrast, there seems to be no such decline in repeats in rodents. (See Vignette 6)

  • Ordinarily, repeat elements land in inhospitable regions of the genome - regions that are A,T rich and G,C poor. But mysteriously, one type of repeat called "SINE elements" have found a way to take up residence in the GC-rich neighborhoods of the genome. Over the years, SINE elements have acquired a bad reputation among scientists for what looked like parasitic behavior. But this bad reputation is unjustified. We now see that SINE elements may by helpful symbionts that earn their keep in the genome. (See Vignette 7)

  • By dating the 3 million repeat elements and examining the pattern of interspersed repeats on the Y chromosome, scientists estimated the relative mutation rates in the X and the Y chromosome and in the male and female germ lines. They found that the ratio of mutations in males versus females is 2:1. Scientists point to several possible reasons for the higher mutation rate in the male germ line, including the fact that there are a greater number of cell divisions involved in the formation of sperm than in the formation of eggs. (See Vignette 8)

  • In a companion volume to the Book of Life, scientists have created a catalogue of 1.4 million single-letter differences, or single nucleotide polymorphisms (SNPs) - and specified their exact location in the human genome. This SNP map, the word's largest publicly available catalogue of SNPs, promises to revolutionize both mapping diseases and tracing human history. (See Vignette 10)

The sequence information from the consortium has been immediately and freely released to the world, with no restrictions on its use or redistribution. The information is scanned daily by scientists in academia and industry, as well as by commercial database companies, providing key information services to biotechnologists. Already, many tens of thousands of genes have been identified from the genome sequence, including more than 30 that play a direct role in human disease.

The scientific work reported here will serve as a basis for research and discovery in the coming decades. Such research will have profound long-term consequences for medicine. It will help elucidate the underlying molecular mechanisms of disease. This in turn will allow researchers to design better drugs and therapies for many illnesses.

But, as the authors of the Nature paper write, "the science is only part of the challenge.We must also involve society at large in the work ahead. We must set realistic expectations that the most important benefits will not be reaped overnight. Moreover, understanding and wisdom will be required to ensure that they are implemented broadly and equitably."

"We are standing at an extraordinary moment in scientific history. It's as though we have climbed to the top of the Himalayas. We can for the first time see the breathtaking vista of the human genome," said Eric Lander, director of the Whitehead Institute Center for Genome Research. "For many years to come, we will be exploring the intricate details of the terrain ahead. We've got a long way to go before we will ultimately understand all the secrets that the genome has to tell us."

"This remarkable achievement is a clear testament to the hard work of the hundreds of scientists in the sixteen genome centers that make up the Human Genome Project consortium," said Francis Collins, director of the National Human Genome Research Institute. "These scientists have proved to the world that they can work together toward a common human good. For, with the human genome sequence in hand, we can begin to build the tools we need to conquer the host of illnesses that cause untold human suffering and premature death."

What's Next?

The consortium's ultimate goal is to produce a completely "finished" sequence with no gaps and 99.99 percent accuracy. Although the near-finished version is adequate for most biomedical research, the HGP has made a commitment to filling all gaps and resolving all ambiguities in the sequence by 2003.

Production of genome sequence has skyrocketed over the past year, with more than 90 percent of the sequence having been produced in the past 15 months alone. Because of this increased capacity, the next phase is expected to move much more rapidly than previously expected.

The HGP also plans to sequence the genomes of many other species, because comparing genomes across species will provide researchers key tools for understanding the essential elements that evolution has designated as important to survival. This information will in turn translate into practical knowledge toward developing better therapies in the future.

As the authors of the Nature paper point out, the draft genome sequence has provided an initial look at the human gene content, but many ambiguities remain. One of the HGP's priorities will be to refine the data to accurately reflect every gene and every alternatively spliced form.

Several steps are needed to reach this ambitious goal, they report. Finishing the human sequence will help, but in addition, scientists will need cross-species comparisons to achieve this goal. A newly formed public-private consortium is speeding this effort, producing freely accessible data that can be readily used for cross-species comparison.

Comparative genomics will also offer scientists insights into important regions in the sequence that perform regulatory functions. Also among the future plans for HGP scientists is the sequencing of other large genomes, such as primates. Scientists also plan to complete the catalogue of human variations in the population and identify the genes that predispose individuals to risk for common diseases.

Finally, the sequence will serve as a foundation for a broad range of functional genomic tools to help biologists to probe the function of the genes in a more systematic manner. Development of such post-genomic tools will be one of the major thrusts for biologists in the next decade, according to the scientists.

The HGP sequencing consortium used a biocluster provided by Compaq Computer Corporation that provided one terabyte of secondary storage and assisted annotation and analysis.

In a related announcement today, the biotech firm Celera Genomics announced that it had published its human genome sequence in the journal Science. The company used a combination of its own data and the consortium's data, available freely online, to assemble its sequence.


Sequencing, which is determining the exact order of DNA's four chemical bases - commonly abbreviated A, T, C and G - has been expedited in the HGP by technological advances in deciphering DNA and the collaborative nature of the effort, which has drawn upon the talents of about 1,000 scientists worldwide.

The Human Genome Sequencing Project aims to determine the sequence of the euchromatic portion of human genome. The "euchromatic" portion excludes certain regions consisting of long stretches of highly repetitive DNA that encode little genetic information. Such regions are said to be "heterochromatic." (Genomes contain long stretches of highly repetitive DNA. For example, the center of chromosomes, called "centromeres," consists of heterochromatic DNA.

The international Human Genome Sequencing Consortium includes scientists at 20 institutions located in France, Germany, Japan, China, Great Britain and the United States. The five largest centers are located at: Baylor College of Medicine, Houston, Texas; Joint Genome Institute in Walnut Creek, CA; Sanger Centre near Cambridge, England; Washington University School of Medicine, St. Louis; and Whitehead Institute, Cambridge, Massachusetts.

The project is funded by grants from government agencies and public charities in the various countries. These include the National Human Genome Research Institute at the U.S. National Institutes of Health (NIH), the Wellcome Trust in England, and the U.S. Department of Energy, as well as agencies in Japan, France, Germany and China.

The total cost for Phase One ("working draft") is approximately $300 million worldwide, with roughly half ($150 million) being funded by the NIH.

The HGP is sometimes reported to have a cost of $3 billion. However, this figure refers to the total projected funding over a 15-year period (1990-2005) for a wide range of scientific activities related to genomics. These include studies of human diseases, experimental organisms (such as bacteria, yeast, worms, flies and mice); development of new technologies for biological and medical research; computational methods to analyze genomes; and ethical, legal and social issues related to genetics. Human genome sequencing represents only a small fraction of the overall 15-year budget.

The institutions that form the International Human Genome Sequencing Consortium include:

  1. Whitehead Institute for Biomedical Research, Center for Genome Research, Cambridge, MA, USA
  2. The Sanger Centre, The Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, United Kingdom
  3. Washington University Genome Sequencing Center, St. Louis, MO, USA
  4. US DOE Joint Genome Institute, Walnut Creek, CA, USA
  5. Baylor College of Medicine Human Genome Sequencing Center, Department of Molecular and Human Genetics, Houston, TX, USA
  6. RIKEN Genomic Sciences Center, Yokohama-city, Japan
  7. Genoscope and CNRS UMR-8030, Evry Cedex, France
  8. GTC Sequencing Center, Genome Therapeutics Corporation, Waltham, MA, USA
  9. Department of Genome Analysis, Institute of Molecular Biotechnology, Jena, Germany
  10. Beijing Genomics Institute/Human Genome Center, Institute of Genetics, Chinese Academy of Sciences, Beijing, China
  11. Multimegabase Sequencing Center; The Institute for Systems Biology, Seattle, WA
  12. Stanford Genome Technology Center, Stanford, CA, USA
  13. Stanford Human Genome Center and Department of Genetics, Stanford University School of Medicine, Stanford, CA, USA
  14. University Washington Genome Center, Seattle, WA, USA
  15. Department of Molecular Biology, Keio University School of Medicine, Tokyo, Japan
  16. University of Texas Southwestern Medical Center at Dallas, Dallas, TX, USA
  17. University of Oklahoma's Advanced Center for Genome Technology, Dept. of Chemistry and Biochemistry, University of Oklahoma, Norman, OK, USA
  18. Max Planck Institute for Molecular Genetics, Berlin, Germany, USA
  19. Cold Spring Harbor Laboratory, Lita Annenberg Hazen Genome Center, Cold Spring Harbor, NY, USA
  20. GBF - German Research Centre for Biotechnology, Braunschweig, Germany, USA

Geoff Spencer
Phone: (301) 402-0911
E-mail: spencerg@mail.nih.gov

Last updated: March 09, 2012