IDEAS home Printed from https://ideas.repec.org/a/bla/biomet/v79y2023i2p891-902.html
   My bibliography  Save this article

An eigenvalue ratio approach to inferring population structure from whole genome sequencing data

Author

Listed:
  • Yuyang Xu
  • Zhonghua Liu
  • Jianfeng Yao

Abstract

Inference of population structure from genetic data plays an important role in population and medical genetics studies. With the advancement and decreasing cost of sequencing technology, the increasingly available whole genome sequencing data provide much richer information about the underlying population structure. The traditional method originally developed for array‐based genotype data for computing and selecting top principal components (PCs) that capture population structure may not perform well on sequencing data for two reasons. First, the number of genetic variants p is much larger than the sample size n in sequencing data such that the sample‐to‐marker ratio n/p$n/p$ is nearly zero, violating the assumption of the Tracy‐Widom test used in their method. Second, their method might not be able to handle the linkage disequilibrium well in sequencing data. To resolve those two practical issues, we propose a new method called ERStruct to determine the number of top informative PCs based on sequencing data. More specifically, we propose to use the ratio of consecutive eigenvalues as a more robust test statistic, and then we approximate its null distribution using modern random matrix theory. Both simulation studies and applications to two public data sets from the HapMap 3 and the 1000 Genomes Projects demonstrate the empirical performance of our ERStruct method.

Suggested Citation

  • Yuyang Xu & Zhonghua Liu & Jianfeng Yao, 2023. "An eigenvalue ratio approach to inferring population structure from whole genome sequencing data," Biometrics, The International Biometric Society, vol. 79(2), pages 891-902, June.
  • Handle: RePEc:bla:biomet:v:79:y:2023:i:2:p:891-902
    DOI: 10.1111/biom.13691
    as

    Download full text from publisher

    File URL: https://doi.org/10.1111/biom.13691
    Download Restriction: no

    File URL: https://libkey.io/10.1111/biom.13691?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Clare Bycroft & Colin Freeman & Desislava Petkova & Gavin Band & Lloyd T. Elliott & Kevin Sharp & Allan Motyer & Damjan Vukcevic & Olivier Delaneau & Jared O’Connell & Adrian Cortes & Samantha Welsh &, 2018. "The UK Biobank resource with deep phenotyping and genomic data," Nature, Nature, vol. 562(7726), pages 203-209, October.
    2. Nick Patterson & Alkes L Price & David Reich, 2006. "Population Structure and Eigenanalysis," PLOS Genetics, Public Library of Science, vol. 2(12), pages 1-20, December.
    3. Yi†Hui Zhou & J. S. Marron & Fred A. Wright, 2018. "Eigenvalue significance testing for genetic association," Biometrics, The International Biometric Society, vol. 74(2), pages 439-447, June.
    4. Baik, Jinho & Silverstein, Jack W., 2006. "Eigenvalues of large sample covariance matrices of spiked population models," Journal of Multivariate Analysis, Elsevier, vol. 97(6), pages 1382-1408, July.
    5. Seung C. Ahn & Alex R. Horenstein, 2013. "Eigenvalue Ratio Test for the Number of Factors," Econometrica, Econometric Society, vol. 81(3), pages 1203-1227, May.
    6. Alexei Onatski, 2009. "Testing Hypotheses About the Number of Factors in Large Factor Models," Econometrica, Econometric Society, vol. 77(5), pages 1447-1479, September.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Bo Zhang & Jiti Gao & Guangming Pan & Yanrong Yang, 2019. "Spiked Eigenvalues of High-Dimensional Separable Sample Covariance Matrices," Monash Econometrics and Business Statistics Working Papers 31/19, Monash University, Department of Econometrics and Business Statistics.
    2. Anna Bykhovskaya & Vadim Gorin, 2023. "High-Dimensional Canonical Correlation Analysis," Papers 2306.16393, arXiv.org, revised Aug 2023.
    3. GUO-FITOUSSI, Liang, 2013. "A Comparison of the Finite Sample Properties of Selection Rules of Factor Numbers in Large Datasets," MPRA Paper 50005, University Library of Munich, Germany.
    4. Li, Hongjun & Li, Qi & Shi, Yutang, 2017. "Determining the number of factors when the number of factors can increase with sample size," Journal of Econometrics, Elsevier, vol. 197(1), pages 76-86.
    5. Shuquan Yang & Nengxiang Ling & Yulin Gong, 2022. "Robust estimation of the number of factors for the pair-elliptical factor models," Computational Statistics, Springer, vol. 37(3), pages 1495-1522, July.
    6. Junyang Qian & Yosuke Tanigawa & Wenfei Du & Matthew Aguirre & Chris Chang & Robert Tibshirani & Manuel A Rivas & Trevor Hastie, 2020. "A fast and scalable framework for large-scale and ultrahigh-dimensional sparse regression with application to the UK Biobank," PLOS Genetics, Public Library of Science, vol. 16(10), pages 1-30, October.
    7. Alain-Philippe Fortin & Patrick Gagliardini & O. Scaillet, 2022. "Eigenvalue tests for the number of latent factors in short panels," Swiss Finance Institute Research Paper Series 22-81, Swiss Finance Institute.
    8. Artūras Juodis & Simas Kučinskas, 2023. "Quantifying noise in survey expectations," Quantitative Economics, Econometric Society, vol. 14(2), pages 609-650, May.
    9. Jushan Bai & Serena Ng, 2020. "Simpler Proofs for Approximate Factor Models of Large Dimensions," Papers 2008.00254, arXiv.org.
    10. Forzani, Liliana & Gieco, Antonella & Tolmasky, Carlos, 2017. "Likelihood ratio test for partial sphericity in high and ultra-high dimensions," Journal of Multivariate Analysis, Elsevier, vol. 159(C), pages 18-38.
    11. Matteo Barigozzi & Marc Hallin, 2023. "Dynamic Factor Models: a Genealogy," Papers 2310.17278, arXiv.org, revised Jan 2024.
    12. Oguzhan Cepni & I. Ethem Guney & Norman R. Swanson, 2020. "Forecasting and nowcasting emerging market GDP growth rates: The role of latent global economic policy uncertainty and macroeconomic data surprise factors," Journal of Forecasting, John Wiley & Sons, Ltd., vol. 39(1), pages 18-36, January.
    13. Zhou, Ruichao & Wu, Jianhong, 2023. "Determining the number of change-points in high-dimensional factor models by cross-validation with matrix completion," Economics Letters, Elsevier, vol. 232(C).
    14. Ryan Greenaway‐McGrevy & Nelson C. Mark & Donggyu Sul & Jyh‐Lin Wu, 2018. "Identifying Exchange Rate Common Factors," International Economic Review, Department of Economics, University of Pennsylvania and Osaka University Institute of Social and Economic Research Association, vol. 59(4), pages 2193-2218, November.
    15. Gagliardini, Patrick & Ossola, Elisa & Scaillet, Olivier, 2019. "A diagnostic criterion for approximate factor structure," Journal of Econometrics, Elsevier, vol. 212(2), pages 503-521.
    16. Xu Cheng & Zhipeng Liao & Frank Schorfheide, 2016. "Shrinkage Estimation of High-Dimensional Factor Models with Structural Instabilities," The Review of Economic Studies, Review of Economic Studies Ltd, vol. 83(4), pages 1511-1543.
    17. Bo Zhang & Jiti Gao & Guangming Pan & Yanrong Yang, 2023. "Eigen-Analysis for High-Dimensional Time Series Clustering," Monash Econometrics and Business Statistics Working Papers 22/23, Monash University, Department of Econometrics and Business Statistics.
    18. Matteo Barigozzi & Marco Lippi & Matteo Luciani, 2014. "Dynamic Factor Models, Cointegration and Error Correction Mechanisms," Working Papers ECARES ECARES 2014-14, ULB -- Universite Libre de Bruxelles.
    19. Bai, Jushan & Duan, Jiangtao & Han, Xu, 2024. "The likelihood ratio test for structural changes in factor models," Journal of Econometrics, Elsevier, vol. 238(2).
    20. In Choi & Dukpa Kim & Yun Jung Kim & Noh‐Sun Kwark, 2018. "A multilevel factor model: Identification, asymptotic theory and applications," Journal of Applied Econometrics, John Wiley & Sons, Ltd., vol. 33(3), pages 355-377, April.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:biomet:v:79:y:2023:i:2:p:891-902. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: http://www.blackwellpublishing.com/journal.asp?ref=0006-341X .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.