IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0093766.html
   My bibliography  Save this article

Fast Principal Component Analysis of Large-Scale Genome-Wide Data

Author

Listed:
  • Gad Abraham
  • Michael Inouye

Abstract

Principal component analysis (PCA) is routinely used to analyze genome-wide single-nucleotide polymorphism (SNP) data, for detecting population structure and potential outliers. However, the size of SNP datasets has increased immensely in recent years and PCA of large datasets has become a time consuming task. We have developed flashpca, a highly efficient PCA implementation based on randomized algorithms, which delivers identical accuracy in extracting the top principal components compared with existing tools, in substantially less time. We demonstrate the utility of flashpca on both HapMap3 and on a large Immunochip dataset. For the latter, flashpca performed PCA of 15,000 individuals up to 125 times faster than existing tools, with identical results, and PCA of 150,000 individuals using flashpca completed in 4 hours. The increasing size of SNP datasets will make tools such as flashpca essential as traditional approaches will not adequately scale. This approach will also help to scale other applications that leverage PCA or eigen-decomposition to substantially larger datasets.

Suggested Citation

  • Gad Abraham & Michael Inouye, 2014. "Fast Principal Component Analysis of Large-Scale Genome-Wide Data," PLOS ONE, Public Library of Science, vol. 9(4), pages 1-5, April.
  • Handle: RePEc:plo:pone00:0093766
    DOI: 10.1371/journal.pone.0093766
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0093766
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0093766&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0093766?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. John Novembre & Toby Johnson & Katarzyna Bryc & Zoltán Kutalik & Adam R. Boyko & Adam Auton & Amit Indap & Karen S. King & Sven Bergmann & Matthew R. Nelson & Matthew Stephens & Carlos D. Bustamante, 2008. "Genes mirror geography within Europe," Nature, Nature, vol. 456(7219), pages 274-274, November.
    2. John Novembre & Toby Johnson & Katarzyna Bryc & Zoltán Kutalik & Adam R. Boyko & Adam Auton & Amit Indap & Karen S. King & Sven Bergmann & Matthew R. Nelson & Matthew Stephens & Carlos D. Bustamante, 2008. "Genes mirror geography within Europe," Nature, Nature, vol. 456(7218), pages 98-101, November.
    3. Nick Patterson & Alkes L Price & David Reich, 2006. "Population Structure and Eigenanalysis," PLOS Genetics, Public Library of Science, vol. 2(12), pages 1-20, December.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Lili Liu & Atlas Khan & Elena Sanchez-Rodriguez & Francesca Zanoni & Yifu Li & Nicholas Steers & Olivia Balderes & Junying Zhang & Priya Krithivasan & Robert A. LeDesma & Clara Fischman & Scott J. Heb, 2022. "Genetic regulation of serum IgA levels and susceptibility to common immune, infectious, kidney, and cardio-metabolic traits," Nature Communications, Nature, vol. 13(1), pages 1-17, December.
    2. Hugh G Gauch Jr. & Sheng Qian & Hans-Peter Piepho & Linda Zhou & Rui Chen, 2019. "Consequences of PCA graphs, SNP codings, and PCA variants for elucidating population structure," PLOS ONE, Public Library of Science, vol. 14(6), pages 1-26, June.
    3. Nagel, Mats, 2020. "Changing perspectives: Towards detailed phenotyping in genetics," Thesis Commons a4nz2, Center for Open Science.
    4. Atlas Khan & Ning Shang & Jordan G. Nestor & Chunhua Weng & George Hripcsak & Peter C. Harris & Ali G. Gharavi & Krzysztof Kiryluk, 2023. "Polygenic risk alters the penetrance of monogenic kidney disease," Nature Communications, Nature, vol. 14(1), pages 1-10, December.
    5. Emmanuel Paradis, 2022. "Reduced multidimensional scaling," Computational Statistics, Springer, vol. 37(1), pages 91-105, March.
    6. Michael Greenacre & Patrick J. F Groenen & Trevor Hastie & Alfonso Iodice d’Enza & Angelos Markos & Elena Tuzhilina, 2023. "Principal component analysis," Economics Working Papers 1856, Department of Economics and Business, Universitat Pompeu Fabra.
    7. Torrecilla, José L. & Romo, Juan, 2018. "Data learning from big data," Statistics & Probability Letters, Elsevier, vol. 136(C), pages 15-19.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Andrey V Khrunin & Denis V Khokhrin & Irina N Filippova & Tõnu Esko & Mari Nelis & Natalia A Bebyakova & Natalia L Bolotova & Janis Klovins & Liene Nikitina-Zake & Karola Rehnström & Samuli Ripatti & , 2013. "A Genome-Wide Analysis of Populations from European Russia Reveals a New Pole of Genetic Diversity in Northern Europe," PLOS ONE, Public Library of Science, vol. 8(3), pages 1-9, March.
    2. Pierre Luisi & Angelina García & Juan Manuel Berros & Josefina M B Motti & Darío A Demarchi & Emma Alfaro & Eliana Aquilano & Carina Argüelles & Sergio Avena & Graciela Bailliet & Julieta Beltramo & C, 2020. "Fine-scale genomic analyses of admixed individuals reveal unrecognized genetic ancestry components in Argentina," PLOS ONE, Public Library of Science, vol. 15(7), pages 1-30, July.
    3. Diana Chang & Alon Keinan, 2014. "Principal Component Analysis Characterizes Shared Pathogenetics from Genome-Wide Association Studies," PLOS Computational Biology, Public Library of Science, vol. 10(9), pages 1-14, September.
    4. Bryc, Katarzyna & Bryc, Wlodek & Silverstein, Jack W., 2013. "Separation of the largest eigenvalues in eigenanalysis of genotype data from discrete subpopulations," Theoretical Population Biology, Elsevier, vol. 89(C), pages 34-43.
    5. Gil McVean, 2009. "A Genealogical Interpretation of Principal Components Analysis," PLOS Genetics, Public Library of Science, vol. 5(10), pages 1-10, October.
    6. Guindon, Stéphane & Guo, Hongbin & Welch, David, 2016. "Demographic inference under the coalescent in a spatial continuum," Theoretical Population Biology, Elsevier, vol. 111(C), pages 43-50.
    7. Marie-Claude Babron & Marie de Tayrac & Douglas N Rutledge & Eleftheria Zeggini & Emmanuelle Génin, 2012. "Rare and Low Frequency Variant Stratification in the UK Population: Description and Impact on Association Tests," PLOS ONE, Public Library of Science, vol. 7(10), pages 1-9, October.
    8. Priya Moorjani & Nick Patterson & Joel N Hirschhorn & Alon Keinan & Li Hao & Gil Atzmon & Edward Burns & Harry Ostrer & Alkes L Price & David Reich, 2011. "The History of African Gene Flow into Southern Europeans, Levantines, and Jews," PLOS Genetics, Public Library of Science, vol. 7(4), pages 1-13, April.
    9. Wang Chaolong & Szpiech Zachary A & Degnan James H & Jakobsson Mattias & Pemberton Trevor J & Hardy John A & Singleton Andrew B & Rosenberg Noah A, 2010. "Comparing Spatial Maps of Human Population-Genetic Variation Using Procrustes Analysis," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 9(1), pages 1-22, January.
    10. Thomas Charlon & Manuel Martínez-Bueno & Lara Bossini-Castillo & F David Carmona & Alessandro Di Cara & Jérôme Wojcik & Sviatoslav Voloshynovskiy & Javier Martín & Marta E Alarcón-Riquelme, 2016. "Single Nucleotide Polymorphism Clustering in Systemic Autoimmune Diseases," PLOS ONE, Public Library of Science, vol. 11(8), pages 1-10, August.
    11. Diana Chang & Feng Gao & Andrea Slavney & Li Ma & Yedael Y Waldman & Aaron J Sams & Paul Billing-Ross & Aviv Madar & Richard Spritz & Alon Keinan, 2014. "Accounting for eXentricities: Analysis of the X Chromosome in GWAS Reveals X-Linked Genes Implicated in Autoimmune Diseases," PLOS ONE, Public Library of Science, vol. 9(12), pages 1-31, December.
    12. Duforet-Frebourg, Nicolas & Slatkin, Montgomery, 2016. "Isolation-by-distance-and-time in a stepping-stone model," Theoretical Population Biology, Elsevier, vol. 108(C), pages 24-35.
    13. Aman Agrawal & Alec M Chiu & Minh Le & Eran Halperin & Sriram Sankararaman, 2020. "Scalable probabilistic PCA for large-scale genetic variation data," PLOS Genetics, Public Library of Science, vol. 16(5), pages 1-19, May.
    14. Thalida E Arpawong & Neil Pendleton & Krisztina Mekli & John J McArdle & Margaret Gatz & Chris Armoskus & James A Knowles & Carol A Prescott, 2017. "Genetic variants specific to aging-related verbal memory: Insights from GWASs in a population-based cohort," PLOS ONE, Public Library of Science, vol. 12(8), pages 1-27, August.
    15. Matthieu Marbac & Mohammed Sedki & Tienne Patin, 2020. "Variable Selection for Mixed Data Clustering: Application in Human Population Genomics," Journal of Classification, Springer;The Classification Society, vol. 37(1), pages 124-142, April.
    16. Isabel Alves & Joanna Giemza & Michael G. B. Blum & Carolina Bernhardsson & Stéphanie Chatel & Matilde Karakachoff & Aude Pierre & Anthony F. Herzig & Robert Olaso & Martial Monteil & Véronique Gallie, 2024. "Human genetic structure in Northwest France provides new insights into West European historical demography," Nature Communications, Nature, vol. 15(1), pages 1-18, December.
    17. Zheng, Xiuwen & Weir, Bruce S., 2016. "Eigenanalysis of SNP data with an identity by descent interpretation," Theoretical Population Biology, Elsevier, vol. 107(C), pages 65-76.
    18. Jason Sawler & Bruce Reisch & Mallikarjuna K Aradhya & Bernard Prins & Gan-Yuan Zhong & Heidi Schwaninger & Charles Simon & Edward Buckler & Sean Myles, 2013. "Genomics Assisted Ancestry Deconvolution in Grape," PLOS ONE, Public Library of Science, vol. 8(11), pages 1-1, November.
    19. Marco Lopez-Cruz & Fernando M. Aguate & Jacob D. Washburn & Natalia Leon & Shawn M. Kaeppler & Dayane Cristina Lima & Ruijuan Tan & Addie Thompson & Laurence Willard Bretonne & Gustavo los Campos, 2023. "Leveraging data from the Genomes-to-Fields Initiative to investigate genotype-by-environment interactions in maize in North America," Nature Communications, Nature, vol. 14(1), pages 1-14, December.
    20. Beatrix Eugster & Rafael Lalive & Andreas Steinhauer & Josef Zweimüller, 2011. "The Demand for Social Insurance: Does Culture Matter?," Economic Journal, Royal Economic Society, vol. 121(556), pages 413-448, November.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0093766. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.