NHGRI logo

ENCODE Pilot Project: Target Selection

For use in the ENCODE pilot project, defined regions of the human genome - corresponding to 30Mb, roughly 1 percent of the total human genome - have been selected. These regions serve as the foundation on which to test and evaluate the effectiveness and efficiency of a diverse set of methods and technologies for finding various functional elements in human DNA.

Prior to embarking upon the target selection, it was decided that 50 percent of the 30Mb of sequence would be selected manually while the remaining sequence would be selected randomly. The two main criteria for manually selected regions were: 1) the presence of well-studied genes or other known sequence elements, and 2) the existence of a substantial amount of comparative sequence data. A total of 14.82Mb of sequence was manually selected using this approach, consisting of 14 targets that range in size from 500kb to 2Mb.

The remaining 50 percent of the 30Mb of sequence were composed of thirty, 500kb regions selected according to a stratified random-sampling strategy based on gene density and level of non-exonic conservation. The decision to use these particular criteria was made in order to ensure a good sampling of genomic regions varying widely in their content of genes and other functional elements. The human genome was divided into three parts - top 20 percent, middle 30 percent, and bottom 50 percent - along each of two axes: 1) gene density and 2) level of non-exonic conservation with respect to the orthologous mouse genomic sequence (see below), for a total of nine strata. From each stratum, three random regions were chosen for the pilot project. For those strata underrepresented by the manual picks, a fourth region was chosen, resulting in a total of 30 regions. For all strata, a "backup" region was designated for use in the event of unforeseen technical problems.

In greater detail, the stratification criteria were as follows:

  • Gene density: The gene density score of a region was the percentage of bases covered either by genes in the Ensembl database, or by human mRNA best BLAT (BLAST-like alignment tool) alignments in the UCSC browser database.

  • Non-exonic conservation: The region was divided into non-overlapping subwindows of 125 bases. Subwindows that showed less than 75 percent base alignment with mouse sequence were discarded. For the remaining subwindows, the percentage with at least 80 percent base identity to mouse, and which did not correspond to Ensembl genes, GenBank mRNA BLASTZ alignments, Fgenesh++ gene predictions, TwinScan gene predictions, spliced EST alignments, or repeats, was used as the non-exonic conservation score.

The above scores were computed within non-overlapping 500 kb windows of finished sequence across the genome, and used to assign each window to a stratum.

ENCODE Regions on the UCSC Genome Browser [genome.ucsc.edu]

Last Reviewed: February 19, 2012

Last updated: February 19, 2012