NHGRI logo

A Catalog of Genome-Wide Association Studies

Full Description of Methods

Weekly PubMed searches are done using the terms "genome-wide" OR "genome AND identification" OR "genome AND association", with limits on the current year and human status.

Studies and associations are eligible for inclusion in the NHGRI GWAS catalog if they meet the following criteria:
  1. Inclusion of at least 100,000 SNPs in the initial stage, before quality control filters are applied.
  2. Statistical significance (SNP-trait p-value 1.0 x 10-5) in the overall (initial GWAS + replication) population.

      a. If a study does not report a combined p-value, the p-value and effect size from the largest sample size will be reported as long as the initial and replication samples each show an association of p 1.0 x 10-5.

      b. If a study does not include a replication stage, significant SNPs from the discovery stage will be reported.

      c. SNP-trait associations that are described as previously known at the time of publication and are statistically significant in the GWAS sample, but are not attempted for replication, are reported.

Studies and associations are excluded if:
  1. The study was published in a language other than English.
  2. SNPs assayed were limited to those in candidate genes.
  3. Samples were assayed to measure somatic variation (e.g., in tumor samples).
  4. The study does not include any new GWAS data.

Information on the following study-level fields is extracted: author (last name of first author); study date (online publication date, if available); PubMed URL; publication title; disease/trait information; initial sample size (summing across multiple Stage 1 populations, if applicable); replication sample size (summing across multiple populations, if applicable); platform (manufacturer); number of SNPs passing quality control metrics (using "up to [maximum number of SNPs]" if multiple platforms are used without imputation, the total number of imputed SNPs, or "pooled" to denote studies of pooled DNA, as applicable); whether the study was one of copy number variants (initially excluded; additional studies to be added).

For each identified SNP, we extract: chromosomal region (from UCSC Genome Browser); gene (as reported); rs number and risk allele (as reported); risk allele frequency in controls (if not available among all controls, among the control group with the largest sample size); p-value and any relevant text (e.g., subgroups where applicable); OR (or % variance explained, SD increment, or unit difference for quantitative traits), 95% CI and any relevant text (e.g., subgroups). If the p-value, OR, and 95% CI fields are not available for the combined population, we extract estimates from the population group with the largest sample size.

In extracting information, we follow these additional guidelines: Missing or not applicable fields are denoted as follows: ?, allele not reported; NS, not significant (no associations at p1.0 x 10-5 identified); NR, not reported; Where multiple genetic models are available, effect sizes (OR's or beta-coefficients) are prioritized as follows: 1) genotypic model, per-allele estimate; 2) genotypic model, heterozygote estimate, 3) allelic model, allelic estimate. Focusing on risk alleles, we invert ORs 1 and their associated confidence intervals, and report the opposite allele if available. If 95% CIs are not published, we estimate these using standard errors where available. If more than one SNP within a gene met the above criteria, we report one SNP unless there was evidence for an independent association. Associations attributed to a combination of one or more genetic variants are denoted as such in the rs number column (e.g., "rs1015362-G + rs4911414-T," "3-SNP haplotype 1"). If available, rs numbers for SNPs comprising the haplotype are indexed so that they are searchable using the SNP search features. Genes attributed to a SNP are extracted verbatim from the published report; "intergenic" and "NR" (not reported) were used to denote a location which was not attributed to a particular gene (if it appeared that gene information was sought) or an absence of reporting on location information, respectively. The term "pending" is used to identify an eligible GWAS for which SNP information has not yet been extracted; studies of CNVs, which are known to be incompletely ascertained, are also noted as pending.

Last updated: November 29, 2010