ENCODE Pilot Project: Coordination with HapMap

The International HapMap Project has decided to focus on 10 of the ENCODE random regions for comprehensive genotyping as part of an in-depth study of human genetic variation. The regions were chosen to represent a range of conservation with the mouse genome and of gene density according to the strata identified during the ENCODE target selection process.

The 10 HapMap-ENCODE regions were resequenced in 48 unrelated individuals (16 Yoruba, 16 CEPH, 8 Han Chinese, and 8 Japanese) using a PCR-based method. 30,000 single nucleotide polymorphisms (SNPs) were identified in the HapMap-ENCODE regions. Some of these were already represented in dbSNP, a database of SNP data that is managed by the National Center for Biotechnology Information (NCBI), while others were discovered during the resequencing. The newly-discovered SNPs were added to dbSNP and the sequence data from the 48 individuals are stored in NCBI's Trace Archive.

Of the 30,000 SNPs identified in the HapMap-ENCODE regions, 10,000 were not analyzed because of failed design or failed genotyping. Genotype data were obtained from the remaining 20,000 SNPs in the HapMap-ENCODE regions of all 270 samples used for the HapMap Project (90 CEPH, 90 Yoruba, 45 Han Chinese, and 45 Japanese). This genotyping was done at the Broad Institute of Harvard and MIT, Illumina, Baylor College of Medicine, McGill University & Genome Quebec Innovation Centre, and the University of California, San Francisco.

The ENCODE-HapMap genotyping data set is considered to be a "gold standard" data set because of the high density of SNP coverage. The genotype data from these regions will be used to determine the best way to choose tag SNPs and to assess the adequacy of the entire HapMap for many analyses, such as coverage, linkage disequilibrium (LD) measures, and haplotype inference.

Last Reviewed: February 19, 2012