ENCODE Pilot Project: Target Selection

National Human Genome Research Institute

National Institutes of Health
U.S. Department of Health and Human Services


ENCODE Pilot Project: Target Selection

For use in the ENCODE pilot project, defined regions of the human genome - corresponding to 30Mb, roughly 1 percent of the total human genome - have been selected. These regions serve as the foundation on which to test and evaluate the effectiveness and efficiency of a diverse set of methods and technologies for finding various functional elements in human DNA.

Prior to embarking upon the target selection, it was decided that 50 percent of the 30Mb of sequence would be selected manually while the remaining sequence would be selected randomly. The two main criteria for manually selected regions were: 1) the presence of well-studied genes or other known sequence elements, and 2) the existence of a substantial amount of comparative sequence data. A total of 14.82Mb of sequence was manually selected using this approach, consisting of 14 targets that range in size from 500kb to 2Mb.

The remaining 50 percent of the 30Mb of sequence were composed of thirty, 500kb regions selected according to a stratified random-sampling strategy based on gene density and level of non-exonic conservation. The decision to use these particular criteria was made in order to ensure a good sampling of genomic regions varying widely in their content of genes and other functional elements. The human genome was divided into three parts - top 20 percent, middle 30 percent, and bottom 50 percent - along each of two axes: 1) gene density and 2) level of non-exonic conservation with respect to the orthologous mouse genomic sequence (see below), for a total of nine strata. From each stratum, three random regions were chosen for the pilot project. For those strata underrepresented by the manual picks, a fourth region was chosen, resulting in a total of 30 regions. For all strata, a "backup" region was designated for use in the event of unforeseen technical problems.

In greater detail, the stratification criteria were as follows:

The above scores were computed within non-overlapping 500 kb windows of finished sequence across the genome, and used to assign each window to a stratum.

ENCODE Regions on the UCSC Genome Browser [genome.ucsc.edu]

Top of page

Last Reviewed: February 19, 2012