Tracing Sub-Structure in the European American Population with PCA-Informative Markers

My bibliography Save this article

Tracing Sub-Structure in the European American Population with PCA-Informative Markers

Author

Listed:

Peristera Paschou
Petros Drineas
Jamey Lewis
Caroline M Nievergelt
Deborah A Nickerson
Joshua D Smith
Paul M Ridker
Daniel I Chasman
Ronald M Krauss
Elad Ziv

Registered:

Abstract

Genetic structure in the European American population reflects waves of migration and recent gene flow among different populations. This complex structure can introduce bias in genetic association studies. Using Principal Components Analysis (PCA), we analyze the structure of two independent European American datasets (1,521 individuals–307,315 autosomal SNPs). Individual variation lies across a continuum with some individuals showing high degrees of admixture with non-European populations, as demonstrated through joint analysis with HapMap data. The CEPH Europeans only represent a small fraction of the variation encountered in the larger European American datasets we studied. We interpret the first eigenvector of this data as correlated with ancestry, and we apply an algorithm that we have previously described to select PCA-informative markers (PCAIMs) that can reproduce this structure. Importantly, we develop a novel method that can remove redundancy from the selected SNP panels and show that we can effectively remove correlated markers, thus increasing genotyping savings. Only 150–200 PCAIMs suffice to accurately predict fine structure in European American datasets, as identified by PCA. Simulating association studies, we couple our method with a PCA-based stratification correction tool and demonstrate that a small number of PCAIMs can efficiently remove false correlations with almost no loss in power. The structure informative SNPs that we propose are an important resource for genetic association studies of European Americans. Furthermore, our redundancy removal algorithm can be applied on sets of ancestry informative markers selected with any method in order to select the most uncorrelated SNPs, and significantly decreases genotyping costs.Author Summary: Genetic association studies search to identify disease susceptibility genes through the analysis of genetic markers such as single nucleotide polymorphisms (SNPs) in large numbers of cases and controls. In such settings, the existence of sub-structure in the population under study (i.e. differences in ancestry among cases and controls) may lead to spurious results. It is therefore imperative to control for this possible bias. Such biases may arise for example when studying the European American population, which consists of individuals of diverse ancestry proportions from different European countries and to some degree also from African and Native American populations. Here, we study the genetic sub-structure of the European American population, analyzing 1,521 individuals for over 300,000 SNPs across the entire genome. Applying a powerful method that is based on dimensionality reduction (Principal Components Analysis), we are able to identify 200 SNPs that successfully represent the complete dataset. Importantly, we introduce a novel method that effectively removes redundancy from any set of genetic markers, and may prove extremely useful in a variety of different research scenarios, in order to significantly reduce the cost of a study.

Suggested Citation

Peristera Paschou & Petros Drineas & Jamey Lewis & Caroline M Nievergelt & Deborah A Nickerson & Joshua D Smith & Paul M Ridker & Daniel I Chasman & Ronald M Krauss & Elad Ziv, 2008. "Tracing Sub-Structure in the European American Population with PCA-Informative Markers," PLOS Genetics, Public Library of Science, vol. 4(7), pages 1-13, July.

Handle: RePEc:plo:pgen00:1000114
DOI: 10.1371/journal.pgen.1000114

Download full text from publisher

References listed on IDEAS

Noah A Rosenberg & Saurabh Mahajan & Sohini Ramachandran & Chengfeng Zhao & Jonathan K Pritchard & Marcus W Feldman, 2005. "Clines, Clusters, and the Effect of Study Design on the Inference of Human Population Structure," PLOS Genetics, Public Library of Science, vol. 1(6), pages 1-12, December.
Carter,Susan B. & Gartner,Scott Sigmund & Haines,Michael R. & Olmstead,Alan L. & Sutch,Richard & Wri (ed.), 2006. "The Historical Statistics of the United States 5 Volume Hardback Set," Cambridge Books, Cambridge University Press, number 9780521817912, January.
Robert Sladek & Ghislain Rocheleau & Johan Rung & Christian Dina & Lishuang Shen & David Serre & Philippe Boutin & Daniel Vincent & Alexandre Belisle & Samy Hadjadj & Beverley Balkau & Barbara Heude &, 2007. "A genome-wide association study identifies novel risk loci for type 2 diabetes," Nature, Nature, vol. 445(7130), pages 881-885, February.
B. Devlin & Kathryn Roeder, 1999. "Genomic Control for Association Studies," Biometrics, The International Biometric Society, vol. 55(4), pages 997-1004, December.
Chao Tian & Robert M Plenge & Michael Ransom & Annette Lee & Pablo Villoslada & Carlo Selmi & Lars Klareskog & Ann E Pulver & Lihong Qi & Peter K Gregersen & Michael F Seldin, 2008. "Analysis and Application of European Genetic Substructure Using 300 K SNP Information," PLOS Genetics, Public Library of Science, vol. 4(1), pages 1-11, January.
Nick Patterson & Alkes L Price & David Reich, 2006. "Population Structure and Eigenanalysis," PLOS Genetics, Public Library of Science, vol. 2(12), pages 1-20, December.
Hakon Hakonarson & Struan F. A. Grant & Jonathan P. Bradfield & Luc Marchand & Cecilia E. Kim & Joseph T. Glessner & Rosemarie Grabs & Tracy Casalunovo & Shayne P. Taback & Edward C. Frackelton & Marg, 2007. "A genome-wide association study identifies KIAA0350 as a type 1 diabetes gene," Nature, Nature, vol. 448(7153), pages 591-594, August.
Peristera Paschou & Elad Ziv & Esteban G Burchard & Shweta Choudhry & William Rodriguez-Cintron & Michael W Mahoney & Petros Drineas, 2007. "PCA-Correlated SNPs for Structure Identification in Worldwide Human Populations," PLOS Genetics, Public Library of Science, vol. 3(9), pages 1-15, September.

Full references (including those not matched with items on IDEAS)

Citations

Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.

Cited by:

Jason Sawler & Bruce Reisch & Mallikarjuna K Aradhya & Bernard Prins & Gan-Yuan Zhong & Heidi Schwaninger & Charles Simon & Edward Buckler & Sean Myles, 2013. "Genomics Assisted Ancestry Deconvolution in Grape," PLOS ONE, Public Library of Science, vol. 8(11), pages 1-1, November.
Marie-Claude Babron & Marie de Tayrac & Douglas N Rutledge & Eleftheria Zeggini & Emmanuelle Génin, 2012. "Rare and Low Frequency Variant Stratification in the UK Population: Description and Impact on Association Tests," PLOS ONE, Public Library of Science, vol. 7(10), pages 1-9, October.
Paola Raska & Edwin Iversen & Ann Chen & Zhihua Chen & Brooke L Fridley & Jennifer Permuth-Wey & Ya-Yu Tsai & Robert A Vierkant & Ellen L Goode & Harvey Risch & Joellen M Schildkraut & Thomas A Seller, 2012. "European American Stratification in Ovarian Cancer Case Control Data: The Utility of Genome-Wide Data for Inferring Ancestry," PLOS ONE, Public Library of Science, vol. 7(5), pages 1-9, May.
Lourenço, V.M. & Pires, A.M., 2014. "M-regression, false discovery rates and outlier detection with application to genetic association studies," Computational Statistics & Data Analysis, Elsevier, vol. 78(C), pages 33-42.
Jamey Lewis & Zafiris Abas & Christos Dadousis & Dimitrios Lykidis & Peristera Paschou & Petros Drineas, 2011. "Tracing Cattle Breeds with Principal Components Analysis Ancestry Informative SNPs," PLOS ONE, Public Library of Science, vol. 6(4), pages 1-8, April.
Petros Drineas & Jamey Lewis & Peristera Paschou, 2010. "Inferring Geographic Coordinates of Origin for Europeans Using Small Panels of Ancestry Informative Markers," PLOS ONE, Public Library of Science, vol. 5(8), pages 1-6, August.
Jun Zhang, 2010. "Ancestral Informative Marker Selection and Population Structure Visualization Using Sparse Laplacian Eigenfunctions," PLOS ONE, Public Library of Science, vol. 5(11), pages 1-12, November.

Most related items

These are the items that most often cite the same works as this one and are cited by the same works as this one.

Eric R Londin & Margaret A Keller & Cathleen Maista & Gretchen Smith & Laura A Mamounas & Ran Zhang & Steven J Madore & Katrina Gwinn & Roderick A Corriveau, 2010. "CoAIMs: A Cost-Effective Panel of Ancestry Informative Markers for Determining Continental Origins," PLOS ONE, Public Library of Science, vol. 5(10), pages 1-12, October.
Kai Yu & Zhaoming Wang & Qizhai Li & Sholom Wacholder & David J Hunter & Robert N Hoover & Stephen Chanock & Gilles Thomas, 2008. "Population Substructure and Control Selection in Genome-Wide Association Studies," PLOS ONE, Public Library of Science, vol. 3(7), pages 1-14, July.
Marie-Claude Babron & Marie de Tayrac & Douglas N Rutledge & Eleftheria Zeggini & Emmanuelle Génin, 2012. "Rare and Low Frequency Variant Stratification in the UK Population: Description and Impact on Association Tests," PLOS ONE, Public Library of Science, vol. 7(10), pages 1-9, October.
Andrey V Khrunin & Denis V Khokhrin & Irina N Filippova & Tõnu Esko & Mari Nelis & Natalia A Bebyakova & Natalia L Bolotova & Janis Klovins & Liene Nikitina-Zake & Karola Rehnström & Samuli Ripatti & , 2013. "A Genome-Wide Analysis of Populations from European Russia Reveals a New Pole of Genetic Diversity in Northern Europe," PLOS ONE, Public Library of Science, vol. 8(3), pages 1-9, March.
Ilja M Nolte & Chris Wallace & Stephen J Newhouse & Daryl Waggott & Jingyuan Fu & Nicole Soranzo & Rhian Gwilliam & Panos Deloukas & Irina Savelieva & Dongling Zheng & Chrysoula Dalageorgou & Martin F, 2009. "Common Genetic Variation Near the Phospholamban Gene Is Associated with Cardiac Repolarisation: Meta-Analysis of Three Genome-Wide Association Studies," PLOS ONE, Public Library of Science, vol. 4(7), pages 1-10, July.
Hoicheong Siu & Li Jin & Momiao Xiong, 2012. "Manifold Learning for Human Population Structure Studies," PLOS ONE, Public Library of Science, vol. 7(1), pages 1-18, January.
Nick Patterson & Alkes L Price & David Reich, 2006. "Population Structure and Eigenanalysis," PLOS Genetics, Public Library of Science, vol. 2(12), pages 1-20, December.
Zhao Huaqing & Rebbeck Timothy R. & Mitra Nandita, 2012. "Analyzing Genetic Association Studies with an Extended Propensity Score Approach," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 11(5), pages 1-24, October.
Diana Chang & Alon Keinan, 2014. "Principal Component Analysis Characterizes Shared Pathogenetics from Genome-Wide Association Studies," PLOS Computational Biology, Public Library of Science, vol. 10(9), pages 1-14, September.
Ronald J Nowling & Krystal R Manke & Scott J Emrich, 2020. "Detecting inversions with PCA in the presence of population structure," PLOS ONE, Public Library of Science, vol. 15(10), pages 1-20, October.
Jianzhong Ma & Christopher I Amos, 2012. "Investigation of Inversion Polymorphisms in the Human Genome Using Principal Components Analysis," PLOS ONE, Public Library of Science, vol. 7(7), pages 1-12, July.
Patrick A Reeves & Christopher M Richards, 2009. "Accurate Inference of Subtle Population Structure (and Other Genetic Discontinuities) Using Principal Coordinates," PLOS ONE, Public Library of Science, vol. 4(1), pages 1-11, January.
Markus Neuditschko & Mehar S Khatkar & Herman W Raadsma, 2012. "NetView: A High-Definition Network-Visualization Approach to Detect Fine-Scale Population Structures from Genome-Wide Patterns of Variation," PLOS ONE, Public Library of Science, vol. 7(10), pages 1-13, October.
Cornelia Di Gaetano & Floriana Voglino & Simonetta Guarrera & Giovanni Fiorito & Fabio Rosa & Anna Maria Di Blasio & Paola Manzini & Irma Dianzani & Marta Betti & Daniele Cusi & Francesca Frau & Crist, 2012. "An Overview of the Genetic Structure within the Italian Population from Genome-Wide Data," PLOS ONE, Public Library of Science, vol. 7(9), pages 1-10, September.
Gabriel E Hoffman & Benjamin A Logsdon & Jason G Mezey, 2013. "PUMA: A Unified Framework for Penalized Multiple Regression Analysis of GWAS Data," PLOS Computational Biology, Public Library of Science, vol. 9(6), pages 1-19, June.
Ning Jiang & Minghui Wang & Tianye Jia & Lin Wang & Lindsey Leach & Christine Hackett & David Marshall & Zewei Luo, 2011. "A Robust Statistical Method for Association-Based eQTL Analysis," PLOS ONE, Public Library of Science, vol. 6(8), pages 1-11, August.
André X C N Valente & Joseph Zischkau & Joo Heon Shin & Yuan Gao & Abhijit Sarkar, 2012. "Genome-Wide Association Study Heterogeneous Cohort Homogenization via Subject Weight Knock-Down," PLOS ONE, Public Library of Science, vol. 7(10), pages 1-10, October.
Thomas Charlon & Manuel Martínez-Bueno & Lara Bossini-Castillo & F David Carmona & Alessandro Di Cara & Jérôme Wojcik & Sviatoslav Voloshynovskiy & Javier Martín & Marta E Alarcón-Riquelme, 2016. "Single Nucleotide Polymorphism Clustering in Systemic Autoimmune Diseases," PLOS ONE, Public Library of Science, vol. 11(8), pages 1-10, August.
Maggie C Y Ng & Daniel Shriner & Brian H Chen & Jiang Li & Wei-Min Chen & Xiuqing Guo & Jiankang Liu & Suzette J Bielinski & Lisa R Yanek & Michael A Nalls & Mary E Comeau & Laura J Rasmussen-Torvik &, 2014. "Meta-Analysis of Genome-Wide Association Studies in African Americans Provides Insights into the Genetic Architecture of Type 2 Diabetes," PLOS Genetics, Public Library of Science, vol. 10(8), pages 1-14, August.
Paola Raska & Edwin Iversen & Ann Chen & Zhihua Chen & Brooke L Fridley & Jennifer Permuth-Wey & Ya-Yu Tsai & Robert A Vierkant & Ellen L Goode & Harvey Risch & Joellen M Schildkraut & Thomas A Seller, 2012. "European American Stratification in Ovarian Cancer Case Control Data: The Utility of Genome-Wide Data for Inferring Ancestry," PLOS ONE, Public Library of Science, vol. 7(5), pages 1-9, May.

More about this item

Statistics

Access and download statistics

Corrections

All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pgen00:1000114. See general information about how to correct material in RePEc.

If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosgenetics (email available below). General contact details of provider: https://journals.plos.org/plosgenetics/ .

Please note that corrections may take a couple of weeks to filter through the various RePEc services.

IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.

Browse Econ Literature

More features

Tracing Sub-Structure in the European American Population with PCA-Informative Markers

Author

Abstract

Suggested Citation

Download full text from publisher

References listed on IDEAS

Citations

Most related items

More about this item

Statistics

Corrections

More services and features

MyIDEAS

Author registration

Rankings

RePEc Genealogy

RePEc Biblio

MPRA

New papers by email

EconAcademics

Plagiarism

About RePEc

RePEc home

Blog

Help/FAQ

RePEc team

Participating archives

Privacy statement

Help us

Corrections

Volunteers

Get papers listed

Open a RePEc archive

Get RePEc data