ENCODE Pilot Project

The National Human Genome Research Institute (NHGRI) launched a public research consortium named the Encyclopedia of DNA Elements (ENCODE) in September 2003 to carry out a project to identify all functional elements in the human genome sequence.

Background

In April 2003, the finished sequence of the human genome was announced by the International Human Genome Sequencing Consortium (Finishing the euchromatic sequence of the human genome October 21, 2004, Nature). Although this was a significant achievement, much remains to be done. Before the best use of the information contained in the sequence can be made, the identity and precise location of all of the protein-encoding and non-protein-encoding genes in the human genome will have to be determined, as will the identities and locations of other functional elements including promoters and other transcriptional regulatory sequences and determinants of chromosome structure and function, such as origins of replication. To date, much remains unknown about these functional elements in the human genome. A comprehensive encyclopedia of all of these features is needed to fully utilize the sequence to better understand human biology, to predict potential disease risks, and to stimulate the development of new therapies to prevent and treat these diseases.

To encourage discussion and comparison of existing computational and experimental approaches to annotating the human genome, and to stimulate the development of new ones, the NHGRI proposed to create a highly interactive public research consortium to carry out a pilot project for testing and comparing existing and new methods to identify functional sequences in DNA.

On July 23-24, 2002, the NHGRI organized a workshop, the Comprehensive Extraction of Biological Information from Genomic Sequence, to discuss this proposal. The workshop participants resoundingly supported the concept of a pilot project and made a number of recommendations about the project's goals, organization and implementation, which have now been incorporated into NHGRI's plan.

On March 7, 2003, the NHGRI held a meeting to officially launch the ENCODE Pilot Project Research Consortium and to provide information to potential applications for two RFAs being released. View the meeting webcast.

Background

In April 2003, the finished sequence of the human genome was announced by the International Human Genome Sequencing Consortium (Finishing the euchromatic sequence of the human genome October 21, 2004, Nature). Although this was a significant achievement, much remains to be done. Before the best use of the information contained in the sequence can be made, the identity and precise location of all of the protein-encoding and non-protein-encoding genes in the human genome will have to be determined, as will the identities and locations of other functional elements including promoters and other transcriptional regulatory sequences and determinants of chromosome structure and function, such as origins of replication. To date, much remains unknown about these functional elements in the human genome. A comprehensive encyclopedia of all of these features is needed to fully utilize the sequence to better understand human biology, to predict potential disease risks, and to stimulate the development of new therapies to prevent and treat these diseases.

To encourage discussion and comparison of existing computational and experimental approaches to annotating the human genome, and to stimulate the development of new ones, the NHGRI proposed to create a highly interactive public research consortium to carry out a pilot project for testing and comparing existing and new methods to identify functional sequences in DNA.

On July 23-24, 2002, the NHGRI organized a workshop, the Comprehensive Extraction of Biological Information from Genomic Sequence, to discuss this proposal. The workshop participants resoundingly supported the concept of a pilot project and made a number of recommendations about the project's goals, organization and implementation, which have now been incorporated into NHGRI's plan.

On March 7, 2003, the NHGRI held a meeting to officially launch the ENCODE Pilot Project Research Consortium and to provide information to potential applications for two RFAs being released. View the meeting webcast.

Pilot Phase

The pilot phase tested and compared existing methods to rigorously analyze a defined portion of the human genome sequence. It was organized as an open consortium (See: ENCODE Pilot Project: Participants and Projects) and brought together investigators with diverse backgrounds and expertise to evaluate the relative merits of each of a diverse set of techniques, technologies and strategies. The concurrent technology development phase of the project aimed to develop new high throughput methods to identify functional elements. The goal of these efforts was to identify a suite of approaches that would allow the comprehensive identification of all the functional elements in the human genome. Through the ENCODE pilot, NHGRI assessed the abilities of different approaches to be scaled up for an effort to analyze the entire human genome and to find gaps in the ability to identify functional elements in genomic sequence.

The ENCODE Pilot Project process involved close interactions between computational and experimental scientists to evaluate a number of methods for annotating the human genome. A set of regions (See: ENCODE Pilot Project: Target Selection below) representing approximately 1 percent (30 Mb) of the human genome was selected as the target for the pilot project and was analyzed by all ENCODE Pilot Project investigators. All data generated by ENCODE participants on these regions was rapidly released into public databases. The ENCODE Pilot Project Consortium was open to all academic, government and private sector scientists interested in participating in an open process to facilitate the comprehensive interpretation of the human genome sequence and who agreed to the criteria for participation for the project..

Pilot Phase

The pilot phase tested and compared existing methods to rigorously analyze a defined portion of the human genome sequence. It was organized as an open consortium (See: ENCODE Pilot Project: Participants and Projects) and brought together investigators with diverse backgrounds and expertise to evaluate the relative merits of each of a diverse set of techniques, technologies and strategies. The concurrent technology development phase of the project aimed to develop new high throughput methods to identify functional elements. The goal of these efforts was to identify a suite of approaches that would allow the comprehensive identification of all the functional elements in the human genome. Through the ENCODE pilot, NHGRI assessed the abilities of different approaches to be scaled up for an effort to analyze the entire human genome and to find gaps in the ability to identify functional elements in genomic sequence.

The ENCODE Pilot Project process involved close interactions between computational and experimental scientists to evaluate a number of methods for annotating the human genome. A set of regions (See: ENCODE Pilot Project: Target Selection below) representing approximately 1 percent (30 Mb) of the human genome was selected as the target for the pilot project and was analyzed by all ENCODE Pilot Project investigators. All data generated by ENCODE participants on these regions was rapidly released into public databases. The ENCODE Pilot Project Consortium was open to all academic, government and private sector scientists interested in participating in an open process to facilitate the comprehensive interpretation of the human genome sequence and who agreed to the criteria for participation for the project..

Target Selection

For use in the ENCODE pilot project, defined regions of the human genome - corresponding to 30Mb, roughly 1 percent of the total human genome - have been selected. These regions serve as the foundation on which to test and evaluate the effectiveness and efficiency of a diverse set of methods and technologies for finding various functional elements in human DNA.

Prior to embarking upon the target selection, it was decided that 50 percent of the 30Mb of sequence would be selected manually while the remaining sequence would be selected randomly. The two main criteria for manually selected regions were: 1) the presence of well-studied genes or other known sequence elements, and 2) the existence of a substantial amount of comparative sequence data. A total of 14.82Mb of sequence was manually selected using this approach, consisting of 14 targets that range in size from 500kb to 2Mb.

The remaining 50 percent of the 30Mb of sequence were composed of thirty, 500kb regions selected according to a stratified random-sampling strategy based on gene density and level of non-exonic conservation. The decision to use these particular criteria was made in order to ensure a good sampling of genomic regions varying widely in their content of genes and other functional elements. The human genome was divided into three parts - top 20 percent, middle 30 percent, and bottom 50 percent - along each of two axes: 1) gene density and 2) level of non-exonic conservation with respect to the orthologous mouse genomic sequence (see below), for a total of nine strata. From each stratum, three random regions were chosen for the pilot project. For those strata underrepresented by the manual picks, a fourth region was chosen, resulting in a total of 30 regions. For all strata, a "backup" region was designated for use in the event of unforeseen technical problems.

In greater detail, the stratification criteria were as follows:

Gene density: The gene density score of a region was the percentage of bases covered either by genes in the Ensembl database, or by human mRNA best BLAT (BLAST-like alignment tool) alignments in the UCSC browser database.
Non-exonic conservation: The region was divided into non-overlapping subwindows of 125 bases. Subwindows that showed less than 75 percent base alignment with mouse sequence were discarded. For the remaining subwindows, the percentage with at least 80 percent base identity to mouse, and which did not correspond to Ensembl genes, GenBank mRNA BLASTZ alignments, Fgenesh++ gene predictions, TwinScan gene predictions, spliced EST alignments, or repeats, was used as the non-exonic conservation score.

The above scores were computed within non-overlapping 500 kb windows of finished sequence across the genome, and used to assign each window to a stratum.

ENCODE Regions on the UCSC Genome Browser

Target Selection
For use in the ENCODE pilot project, defined regions of the human genome - corresponding to 30Mb, roughly 1 percent of the total human genome - have been selected. These regions serve as the foundation on which to test and evaluate the effectiveness and efficiency of a diverse set of methods and technologies for finding various functional elements in human DNA.

Prior to embarking upon the target selection, it was decided that 50 percent of the 30Mb of sequence would be selected manually while the remaining sequence would be selected randomly. The two main criteria for manually selected regions were: 1) the presence of well-studied genes or other known sequence elements, and 2) the existence of a substantial amount of comparative sequence data. A total of 14.82Mb of sequence was manually selected using this approach, consisting of 14 targets that range in size from 500kb to 2Mb.

The remaining 50 percent of the 30Mb of sequence were composed of thirty, 500kb regions selected according to a stratified random-sampling strategy based on gene density and level of non-exonic conservation. The decision to use these particular criteria was made in order to ensure a good sampling of genomic regions varying widely in their content of genes and other functional elements. The human genome was divided into three parts - top 20 percent, middle 30 percent, and bottom 50 percent - along each of two axes: 1) gene density and 2) level of non-exonic conservation with respect to the orthologous mouse genomic sequence (see below), for a total of nine strata. From each stratum, three random regions were chosen for the pilot project. For those strata underrepresented by the manual picks, a fourth region was chosen, resulting in a total of 30 regions. For all strata, a "backup" region was designated for use in the event of unforeseen technical problems.

In greater detail, the stratification criteria were as follows:

Gene density: The gene density score of a region was the percentage of bases covered either by genes in the Ensembl database, or by human mRNA best BLAT (BLAST-like alignment tool) alignments in the UCSC browser database.

Non-exonic conservation: The region was divided into non-overlapping subwindows of 125 bases. Subwindows that showed less than 75 percent base alignment with mouse sequence were discarded. For the remaining subwindows, the percentage with at least 80 percent base identity to mouse, and which did not correspond to Ensembl genes, GenBank mRNA BLASTZ alignments, Fgenesh++ gene predictions, TwinScan gene predictions, spliced EST alignments, or repeats, was used as the non-exonic conservation score.

The above scores were computed within non-overlapping 500 kb windows of finished sequence across the genome, and used to assign each window to a stratum.

ENCODE Regions on the UCSC Genome Browser

Consortium Resources

ENCODE Target Sequences

All ENCODE target regions have accessioned sequence entries in RefSeq (NT_*) that are updated after each human genome build is released. A list of the target regions [genome.ucsc.edu] for the different human genome builds is available at the UCSC ENCODE Browser [genome.ucsc.edu]. In addition, homologous sequences from other vertebrate genomes, identified from whole genome shotgun assemblies or obtained by direct sequencing of BAC clones, are submitted and updated continuously.

All sequences can be obtained from the NCBI Entrez search engine. They have the keyword "ENCODE" (although some alternate regions have also been accessioned and they have the word ALTERNATE in the title). The following Entrez query will retrieve all primary ENCODE target sequences:

encode[Keyword] NOT alternate[Title]

To facilitate analysis of the comparative genomics data, the sequence data is frozen for periodic data releases. Sequence and annotation data for the ENCODE regions can be downloaded as FASTA sequence files from: Index of ENCODE Downloads.

BAC Clones for ENCODE Targets

BAC clones for the ENCODE regions for different vertebrate genomes have been identified by Eric Green's group at NHGRI/NISC. The maps of identified BAC clones across the different targets for each different organism are available at the NISC Web site: Summary of BAC Maps. The corresponding BAC clones can be obtained from the BACPAC Resources Center at Children's Hospital Oakland Research Institute in Oakland, California or from the Arizona Genomics Institute.

Cell Lines

Common cell lines were identified to evaluate the performances of experiments, platforms and reagents used by investigators and to ensure that biological variation is not the cause of differences observed between experiments in different groups.

Two cell lines were chosen for their different properties:

HeLa S3, a cervical adenocarcinoma, was chosen because it is can be transfected with high efficiency and large quantities of these cells can be easily synchronized in the cell cycle to facilitate studies on DNA replication.
GM06990, an Epstein-Barr virus-transformed B-lymphocyte from the Utah CEPH collection was chosen as a representative lymphoblastoid cell line. These cells have a normal karyotype and can be stimulated with mitogens to activate signal transduction pathways that involve the activation of well studied genes in the ENCODE target regions.

Additional common cell lines were identified for use by Consortium members during the ENCODE pilot project. These include BJ-TERT, an immortalized foreskin fibroblast cell line; K562, an erythroblastoid cell line which expresses globin genes; and HepG2, a hepatocarcinoma cell line which expresses lipoproteins. K562 and HepG2 were selected because they express genes of interest that lie within the ENCODE regions.

Antibodies to DNA-Binding Proteins

The Consortium identified four common antibodies to use as controls in ChIP-chip cross-platform comparisons that were performed as part of the ENCODE pilot project. Antibodies that recognize RNA polymerase II and TAFII250, a component of a general transcription factor that initiates the preinitiation complex assembly for RNA polymerase II, should be bound to the promoters of all genes actively transcribed by RNA polymerase II. The RNA polymerase II antibody is available through Covance Research Products Inc. (Catalog #MMS-126R) and the TAFII250 antibody is available through Santa Cruz Biotechnology (Catalog #SC-735).

The third common antibody recognizes STAT-1, a transcription factor induced following treatment of cells with IFN. This protein should only bind to IFN-inducible promoters following stimulation of cells and is available through Santa Cruz Biotechnology (Catalog #SC-345). An antibody against a histone modification - acetylated histone H4 - was also chosen because of its role in cell cycle progression. It is available through Upstate Cell Signaling Solutions (Catalog #06-866).

Genome Tiling Microarrays

Investigators from Affymetrix and NimbleGen Systems, Inc. actively participated in the ENCODE pilot project. In addition, some ENCODE investigators collaborated with Agilent Technologies on tiling microarray experiments. Each company developed specialized microarrays that tile across the ENCODE regions and are useful for ChIP-chip and other studies.

Consortium Resources
ENCODE Target Sequences

All ENCODE target regions have accessioned sequence entries in RefSeq (NT_*) that are updated after each human genome build is released. A list of the target regions [genome.ucsc.edu] for the different human genome builds is available at the UCSC ENCODE Browser [genome.ucsc.edu]. In addition, homologous sequences from other vertebrate genomes, identified from whole genome shotgun assemblies or obtained by direct sequencing of BAC clones, are submitted and updated continuously.

All sequences can be obtained from the NCBI Entrez search engine. They have the keyword "ENCODE" (although some alternate regions have also been accessioned and they have the word ALTERNATE in the title). The following Entrez query will retrieve all primary ENCODE target sequences:

encode[Keyword] NOT alternate[Title]

To facilitate analysis of the comparative genomics data, the sequence data is frozen for periodic data releases. Sequence and annotation data for the ENCODE regions can be downloaded as FASTA sequence files from: Index of ENCODE Downloads.

BAC Clones for ENCODE Targets

BAC clones for the ENCODE regions for different vertebrate genomes have been identified by Eric Green's group at NHGRI/NISC. The maps of identified BAC clones across the different targets for each different organism are available at the NISC Web site: Summary of BAC Maps. The corresponding BAC clones can be obtained from the BACPAC Resources Center at Children's Hospital Oakland Research Institute in Oakland, California or from the Arizona Genomics Institute.

Cell Lines

Common cell lines were identified to evaluate the performances of experiments, platforms and reagents used by investigators and to ensure that biological variation is not the cause of differences observed between experiments in different groups.

Two cell lines were chosen for their different properties:

HeLa S3, a cervical adenocarcinoma, was chosen because it is can be transfected with high efficiency and large quantities of these cells can be easily synchronized in the cell cycle to facilitate studies on DNA replication.

GM06990, an Epstein-Barr virus-transformed B-lymphocyte from the Utah CEPH collection was chosen as a representative lymphoblastoid cell line. These cells have a normal karyotype and can be stimulated with mitogens to activate signal transduction pathways that involve the activation of well studied genes in the ENCODE target regions.

Additional common cell lines were identified for use by Consortium members during the ENCODE pilot project. These include BJ-TERT, an immortalized foreskin fibroblast cell line; K562, an erythroblastoid cell line which expresses globin genes; and HepG2, a hepatocarcinoma cell line which expresses lipoproteins. K562 and HepG2 were selected because they express genes of interest that lie within the ENCODE regions.

Antibodies to DNA-Binding Proteins

The Consortium identified four common antibodies to use as controls in ChIP-chip cross-platform comparisons that were performed as part of the ENCODE pilot project. Antibodies that recognize RNA polymerase II and TAFII250, a component of a general transcription factor that initiates the preinitiation complex assembly for RNA polymerase II, should be bound to the promoters of all genes actively transcribed by RNA polymerase II. The RNA polymerase II antibody is available through Covance Research Products Inc. (Catalog #MMS-126R) and the TAFII250 antibody is available through Santa Cruz Biotechnology (Catalog #SC-735).

The third common antibody recognizes STAT-1, a transcription factor induced following treatment of cells with IFN. This protein should only bind to IFN-inducible promoters following stimulation of cells and is available through Santa Cruz Biotechnology (Catalog #SC-345). An antibody against a histone modification - acetylated histone H4 - was also chosen because of its role in cell cycle progression. It is available through Upstate Cell Signaling Solutions (Catalog #06-866).

Genome Tiling Microarrays

Investigators from Affymetrix and NimbleGen Systems, Inc. actively participated in the ENCODE pilot project. In addition, some ENCODE investigators collaborated with Agilent Technologies on tiling microarray experiments. Each company developed specialized microarrays that tile across the ENCODE regions and are useful for ChIP-chip and other studies.

Comparative Sequence Analysis

A component of ENCODE pilot project data production involves the generation of sequencing information from a number of different genomes in order to extract the maximum amount of information about the human genome through comparative analyses. Efforts are already underway at the NHGRI, University of British Columbia and the NIH Intramural Sequencing Center to identify, map and sequence, respectively; BAC clones for regions syntenic to the human ENCODE targets will be made in additional mammalian species. In addition to these ENCODE-directed efforts, sequence data generated through whole genome sequencing projects will be used in comparative analyses to help scientists better understand the human sequence. ENCODE Participants intend to abide by the Fort Lauderdale recommendations on "Sharing Data from Large-scale Biological Research Projects" when using unpublished sequence data in Project analyses.

Comparative Sequence Analysis

A component of ENCODE pilot project data production involves the generation of sequencing information from a number of different genomes in order to extract the maximum amount of information about the human genome through comparative analyses. Efforts are already underway at the NHGRI, University of British Columbia and the NIH Intramural Sequencing Center to identify, map and sequence, respectively; BAC clones for regions syntenic to the human ENCODE targets will be made in additional mammalian species. In addition to these ENCODE-directed efforts, sequence data generated through whole genome sequencing projects will be used in comparative analyses to help scientists better understand the human sequence. ENCODE Participants intend to abide by the Fort Lauderdale recommendations on "Sharing Data from Large-scale Biological Research Projects" when using unpublished sequence data in Project analyses.

HapMap Coordination

The International HapMap Project has decided to focus on 10 of the ENCODE random regions for comprehensive genotyping as part of an in-depth study of human genetic variation. The regions were chosen to represent a range of conservation with the mouse genome and of gene density according to the strata identified during the ENCODE target selection process.

The 10 HapMap-ENCODE regions were resequenced in 48 unrelated individuals (16 Yoruba, 16 CEPH, 8 Han Chinese, and 8 Japanese) using a PCR-based method. 30,000 single nucleotide polymorphisms (SNPs) were identified in the HapMap-ENCODE regions. Some of these were already represented in dbSNP, a database of SNP data that is managed by the National Center for Biotechnology Information (NCBI), while others were discovered during the resequencing. The newly-discovered SNPs were added to dbSNP and the sequence data from the 48 individuals are stored in NCBI's Trace Archive.

Of the 30,000 SNPs identified in the HapMap-ENCODE regions, 10,000 were not analyzed because of failed design or failed genotyping. Genotype data were obtained from the remaining 20,000 SNPs in the HapMap-ENCODE regions of all 270 samples used for the HapMap Project (90 CEPH, 90 Yoruba, 45 Han Chinese, and 45 Japanese). This genotyping was done at the Broad Institute of Harvard and MIT, Illumina, Baylor College of Medicine, McGill University & Genome Quebec Innovation Centre, and the University of California, San Francisco.

The ENCODE-HapMap genotyping data set is considered to be a "gold standard" data set because of the high density of SNP coverage. The genotype data from these regions will be used to determine the best way to choose tag SNPs and to assess the adequacy of the entire HapMap for many analyses, such as coverage, linkage disequilibrium (LD) measures, and haplotype inference.

HapMap Coordination

The International HapMap Project has decided to focus on 10 of the ENCODE random regions for comprehensive genotyping as part of an in-depth study of human genetic variation. The regions were chosen to represent a range of conservation with the mouse genome and of gene density according to the strata identified during the ENCODE target selection process.

The 10 HapMap-ENCODE regions were resequenced in 48 unrelated individuals (16 Yoruba, 16 CEPH, 8 Han Chinese, and 8 Japanese) using a PCR-based method. 30,000 single nucleotide polymorphisms (SNPs) were identified in the HapMap-ENCODE regions. Some of these were already represented in dbSNP, a database of SNP data that is managed by the National Center for Biotechnology Information (NCBI), while others were discovered during the resequencing. The newly-discovered SNPs were added to dbSNP and the sequence data from the 48 individuals are stored in NCBI's Trace Archive.

Of the 30,000 SNPs identified in the HapMap-ENCODE regions, 10,000 were not analyzed because of failed design or failed genotyping. Genotype data were obtained from the remaining 20,000 SNPs in the HapMap-ENCODE regions of all 270 samples used for the HapMap Project (90 CEPH, 90 Yoruba, 45 Han Chinese, and 45 Japanese). This genotyping was done at the Broad Institute of Harvard and MIT, Illumina, Baylor College of Medicine, McGill University & Genome Quebec Innovation Centre, and the University of California, San Francisco.

The ENCODE-HapMap genotyping data set is considered to be a "gold standard" data set because of the high density of SNP coverage. The genotype data from these regions will be used to determine the best way to choose tag SNPs and to assess the adequacy of the entire HapMap for many analyses, such as coverage, linkage disequilibrium (LD) measures, and haplotype inference.

Meeting Reports

ENCODE Pilot Project Launch Meeting

July 23-24, 2002: Workshop on the Comprehensive Extraction of Biological Information From Genomic Sequence

Program Director

Last updated: October 18, 2012