Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Principal components analysis corrects for stratification in genome-wide association studies

Abstract

Population stratification—allele frequency differences between cases and controls due to systematic ancestry differences—can cause spurious associations in disease studies. We describe a method that enables explicit detection and correction of population stratification on a genome-wide scale. Our method uses principal components analysis to explicitly model ancestry differences between cases and controls. The resulting correction is specific to a candidate marker's variation in frequency across ancestral populations, minimizing spurious associations while maximizing power to detect true associations. Our simple, efficient approach can easily be applied to disease studies with hundreds of thousands of markers.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Figure 1: The EIGENSTRAT algorithm, illustrated on simulated data.
Figure 2: The top two axes of variation of European American samples.

Similar content being viewed by others

References

  1. Lander, E.S. & Schork, N.J. Genetic dissection of complex traits. Science 265, 2037–2048 (1994).

    Article  CAS  PubMed  Google Scholar 

  2. Lohmueller, K. et al. Meta-analysis of genetic association studies supports a contribution of common variants to susceptibility to common disease. Nat. Genet. 33, 177–182 (2003).

    Article  CAS  PubMed  Google Scholar 

  3. Freedman, M. et al. Assessing the impact of population stratification on genetic association studies. Nat. Genet. 36, 388–393 (2004).

    Article  CAS  PubMed  Google Scholar 

  4. Marchini, J. et al. The effects of human population structure on large genetic association studies. Nat. Genet. 36, 512–517 (2004).

    Article  CAS  PubMed  Google Scholar 

  5. Helgason, A. et al. An Icelandic example of the impact of population structure on association studies. Nat. Genet. 37, 90–95 (2005).

    Article  CAS  PubMed  Google Scholar 

  6. Campbell, C.D. et al. Demonstrating stratification in a European American population. Nat. Genet. 37, 868–872 (2005).

    Article  CAS  PubMed  Google Scholar 

  7. Hirschhorn, J.N. & Daly, M.J. Genome-wide association studies for common diseases and complex traits. Nat. Rev. Genet. 6, 95–108 (2005).

    Article  CAS  PubMed  Google Scholar 

  8. Thomas, D.C. et al. Recent developments in genomewide association scans: a workshop summary and review. Am. J. Hum. Genet. 77, 337–345 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Reich, D. & Goldstein, D. Detecting association in a case-control study while allowing for population stratification. Genet. Epidemiol. 20, 4–16 (2001).

    Article  CAS  PubMed  Google Scholar 

  10. Devlin, B. & Roeder, K. Genomic control for association studies. Biometrics 55, 997–1004 (1999).

    Article  CAS  PubMed  Google Scholar 

  11. Devlin, B. et al. Genomic control to the extreme. Nat. Genet. 36, 1129–1130 (2004).

    Article  CAS  PubMed  Google Scholar 

  12. Pritchard, J.K. et al. Association mapping in structured populations. Am. J. Hum. Genet. 67, 170–181 (2000).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Satten, G. et al. Accounting for unmeasured population substructure in case-control studies of genetic association using a novel latent-class model. Am. J. Hum. Genet. 68, 466–477 (2001).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  14. Setakis, E., Stirnadel, H. & Balding, D.J. Logistic regression protects against population structure in genetic association studies. Genome Res. 16, 290–296 (2006).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  15. Pritchard, J.K. et al. Inference of population structure using multilocus genotype data. Genetics 155, 945–959 (2000).

    CAS  PubMed  PubMed Central  Google Scholar 

  16. Serre, D. & Paabo, S. Evidence for gradients of human genetic diversity within and among continents. Genome Res. 14, 1679–1685 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Jackson, J.E. A User's Guide to Principal Components (John Wiley & Sons, New York, 2003).

    Google Scholar 

  18. Menozzi, P., Piazza, A. & Cavalli-Sforza, L. Synthetic maps of human gene frequencies in Europeans. Science 201, 786–792 (1978).

    Article  CAS  PubMed  Google Scholar 

  19. Cavalli-Sforza, L.L., Menozzi, P. & Piazza, A. Demic expansions and human evolution. Science 259, 639–646 (1993).

    Article  CAS  PubMed  Google Scholar 

  20. Johnstone, I. On the distribution of the largest eigenvalue in principal components analysis. Ann. Stat. 29, 295–327 (2001).

    Article  Google Scholar 

  21. Soshnikov, A. A note on universality of the distribution of the largest eigenvalues in certain sample covariance matrices. J. Stat. Phys. 108, 1033–1056 (2002).

    Article  Google Scholar 

  22. Baik, J., Ben Arous, G. & Peche, S. Phase transition of the largest eigenvalue for non-null complex sample covariance matrices. Ann. Probab. 33, 1643–1697 (2005).

    Article  Google Scholar 

  23. Rosenberg, N.A. et al. Clines, clusters, and the effect of study design on the inference of human population structure. PLoS Genetics 1, 660–671 (2005).

    Article  CAS  Google Scholar 

  24. Pritchard, J.K. & Donnelly, P. Case-control studies of association in structured or admixed populations. Theor. Popul. Biol. 60, 227–237 (2001).

    Article  CAS  PubMed  Google Scholar 

  25. Balding, D.J. & Nichols, R.A. A method for quantifying differentiation between populations at multi-allelic loci and its implications for investigating identify and paternity. Genetica 96, 3–12 (1995).

    Article  CAS  PubMed  Google Scholar 

  26. Cavalli-Sforza, L.L., Menozzi, P. & Piazza, A. The History and Geography of Human Genes (Princeton Univ. Press, Princeton, New Jersey, 1994).

    Google Scholar 

  27. Nicholson, G. et al. Assessing population differentiation and isolation from single-nucleotide polymorphism data. J. R. Statist. Soc. (B) 64, 695–715 (2002).

    Article  Google Scholar 

  28. Bersaglieri, T. et al. Genetic signatures of strong recent positive selection at the lactase gene. Am. J. Hum. Genet. 74, 1111–1120 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Armitage, P. Tests for linear trends in proportions and frequencies. Biometrics 11, 375–386 (1955).

    Article  Google Scholar 

  30. Enattah, N.S. et al. Identification of a variant associated with adult-type hypolactasia. Nat. Genet. 30, 233–237 (2002).

    Article  CAS  PubMed  Google Scholar 

  31. The International HapMap Consortium. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).

  32. Cimmino, M.A. et al. Prevalence of rheumatoid arthritis in Italy: the Chiavari study. Ann. Rheum. Dis. 57, 315–318 (1998).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Rosati, G. The prevalence of multiple sclerosis in the world: an update. Neurol. Sci. 22, 117–139 (2001).

    Article  CAS  PubMed  Google Scholar 

  34. Panza, F. et al. Shifts in angiotensin I converting enzyme insertion allele frequency across Europe: implications for Alzheimer's disease risk. J. Neurol. Neurosurg. Psychiatry 74, 1159–1161 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Bernardi, F. et al. Contribution of factor VII genotype to activated FVII levels. Differences in genotype frequencies between northern and southern European populations. Arterioscler. Thromb. Vasc. Biol. 17, 2548–2553 (1997).

    Article  CAS  PubMed  Google Scholar 

  36. Angastiniotis, M. & Modell, B. Global epidemiology of hemoglobin disorders. Ann. NY Acad. Sci. 850, 251–269 (1998).

    Article  CAS  PubMed  Google Scholar 

  37. Clayton, D.G. et al. Population structure, differential bias and genomic control in a large-scale, case-control association study. Nat. Genet. 37, 1243–1246 (2005).

    Article  CAS  PubMed  Google Scholar 

  38. Wright, S. The genetical structure of populations. Ann. Eugen. 15, 323–354 (1951).

    Article  CAS  PubMed  Google Scholar 

  39. Benito-Garcia, E. et al. Dietary caffeine does not affect methotrexate efficacy in rheumatoid arthritis patients. J. Rheumatol. (in the press).

Download references

Acknowledgements

The authors are grateful to B. Blumenstiel, M. DeFelice, M. Parkin, R. Barry, W. Winslow, C. Healy and S. Gabriel for generation of the Affymetrix genotype data. We are grateful to the BRASS study participants, the BRASS study team, and our rheumatology colleagues at the Brigham and Women's Hospital Arthritis Center. We thank C. Campbell and J. Hirschhorn for helpful comments and sharing data from their paper6. The BRASS study was supported by a grant from Millennium Pharmaceuticals. D.R. is supported in part by a Burroughs Wellcome Career Development Award in the Biomedical Sciences.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alkes L Price.

Ethics declarations

Competing interests

M.E.W. serves as a consultant to Millennium Pharmaceuticals; the BRASS study, which produced a data set described in the paper, was supported by a grant from Millenium Pharmaceuticals.

Supplementary information

Supplementary Fig. 1

P-P plot of EIGENSTRAT test statistics. (PDF 429 kb)

Supplementary Table 1

Simulations using K axes of variation. (PDF 58 kb)

Supplementary Table 2

Simulations using M SNPs. (PDF 66 kb)

Supplementary Table 3

Simulations of Pritchard and Donnelly. (PDF 68 kb)

Supplementary Table 4

Simulations with no stratification and n subpopulations. (PDF 73 kb)

Supplementary Table 5

Stratification correction at rs10511418 using M SNPs. (PDF 73 kb)

Supplementary Note (PDF 207 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Price, A., Patterson, N., Plenge, R. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat Genet 38, 904–909 (2006). https://doi.org/10.1038/ng1847

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/ng1847

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing