IDEAS home Printed from https://ideas.repec.org/a/bla/jorssb/v84y2022i3p853-878.html
   My bibliography  Save this article

Empirical Bayes PCA in high dimensions

Author

Listed:
  • Xinyi Zhong
  • Chang Su
  • Zhou Fan

Abstract

When the dimension of data is comparable to or larger than the number of data samples, principal components analysis (PCA) may exhibit problematic high‐dimensional noise. In this work, we propose an empirical Bayes PCA method that reduces this noise by estimating a joint prior distribution for the principal components. EB‐PCA is based on the classical Kiefer–Wolfowitz non‐parametric maximum likelihood estimator for empirical Bayes estimation, distributional results derived from random matrix theory for the sample PCs and iterative refinement using an approximate message passing (AMP) algorithm. In theoretical ‘spiked’ models, EB‐PCA achieves Bayes‐optimal estimation accuracy in the same settings as an oracle Bayes AMP procedure that knows the true priors. Empirically, EB‐PCA significantly improves over PCA when there is strong prior structure, both in simulation and on quantitative benchmarks constructed from the 1000 Genomes Project and the International HapMap Project. An illustration is presented for analysis of gene expression data obtained by single‐cell RNA‐seq.

Suggested Citation

  • Xinyi Zhong & Chang Su & Zhou Fan, 2022. "Empirical Bayes PCA in high dimensions," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 84(3), pages 853-878, July.
  • Handle: RePEc:bla:jorssb:v:84:y:2022:i:3:p:853-878
    DOI: 10.1111/rssb.12490
    as

    Download full text from publisher

    File URL: https://doi.org/10.1111/rssb.12490
    Download Restriction: no

    File URL: https://libkey.io/10.1111/rssb.12490?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Johnstone, Iain M. & Lu, Arthur Yu, 2009. "On Consistency and Sparsity for Principal Components Analysis in High Dimensions," Journal of the American Statistical Association, American Statistical Association, vol. 104(486), pages 682-693.
    2. Jianqing Fan & Yuan Liao & Martina Mincheva, 2013. "Large covariance estimation by thresholding principal orthogonal complements," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 75(4), pages 603-680, September.
    3. Bai, Zhidong & Yao, Jianfeng, 2012. "On sample eigenvalues in a generalized spiked population model," Journal of Multivariate Analysis, Elsevier, vol. 106(C), pages 167-177.
    4. Shen, Dan & Shen, Haipeng & Marron, J.S., 2013. "Consistency of sparse PCA in High Dimension, Low Sample Size contexts," Journal of Multivariate Analysis, Elsevier, vol. 115(C), pages 317-333.
    5. SIMAR, Leopold, 1976. "Maximum likelihood estimation of a compound Poisson process," LIDAM Reprints CORE 271, Université catholique de Louvain, Center for Operations Research and Econometrics (CORE).
    6. Roger Koenker & Ivan Mizera, 2014. "Convex Optimization, Shape Constraints, Compound Decisions, and Empirical Bayes Rules," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 109(506), pages 674-685, June.
    7. Koenker, Roger & Mizera, Ivan, 2014. "Convex Optimization in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 60(i05).
    8. Baik, Jinho & Silverstein, Jack W., 2006. "Eigenvalues of large sample covariance matrices of spiked population models," Journal of Multivariate Analysis, Elsevier, vol. 97(6), pages 1382-1408, July.
    9. Benaych-Georges, Florent & Nadakuditi, Raj Rao, 2012. "The singular values and vectors of low rank perturbations of large rectangular random matrices," Journal of Multivariate Analysis, Elsevier, vol. 111(C), pages 120-135.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Barigozzi, Matteo & Trapani, Lorenzo, 2020. "Sequential testing for structural stability in approximate factor models," Stochastic Processes and their Applications, Elsevier, vol. 130(8), pages 5149-5187.
    2. Ding, Xiucai & Ji, Hong Chang, 2023. "Spiked multiplicative random matrices and principal components," Stochastic Processes and their Applications, Elsevier, vol. 163(C), pages 25-60.
    3. Anna Bykhovskaya & Vadim Gorin, 2023. "High-Dimensional Canonical Correlation Analysis," Papers 2306.16393, arXiv.org, revised Aug 2023.
    4. Kim, Donggyu & Wang, Yazhen, 2016. "Sparse PCA-based on high-dimensional Itô processes with measurement errors," Journal of Multivariate Analysis, Elsevier, vol. 152(C), pages 172-189.
    5. Hong, David & Balzano, Laura & Fessler, Jeffrey A., 2018. "Asymptotic performance of PCA for high-dimensional heteroscedastic data," Journal of Multivariate Analysis, Elsevier, vol. 167(C), pages 435-452.
    6. Jianqing Fan & Yuan Liao & Martina Mincheva, 2013. "Large covariance estimation by thresholding principal orthogonal complements," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 75(4), pages 603-680, September.
    7. Dey, Rounak & Lee, Seunggeun, 2019. "Asymptotic properties of principal component analysis and shrinkage-bias adjustment under the generalized spiked population model," Journal of Multivariate Analysis, Elsevier, vol. 173(C), pages 145-164.
    8. Damien Passemier & Zhaoyuan Li & Jianfeng Yao, 2017. "On estimation of the noise variance in high dimensional probabilistic principal component analysis," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 79(1), pages 51-67, January.
    9. Fan, Jianqing & Jiang, Bai & Sun, Qiang, 2022. "Bayesian factor-adjusted sparse regression," Journal of Econometrics, Elsevier, vol. 230(1), pages 3-19.
    10. Yata, Kazuyoshi & Aoshima, Makoto, 2013. "PCA consistency for the power spiked model in high-dimensional settings," Journal of Multivariate Analysis, Elsevier, vol. 122(C), pages 334-354.
    11. Wang, Yihe & Zhao, Sihai Dave, 2021. "A nonparametric empirical Bayes approach to large-scale multivariate regression," Computational Statistics & Data Analysis, Elsevier, vol. 156(C).
    12. Steland, Ansgar, 2020. "Testing and estimating change-points in the covariance matrix of a high-dimensional time series," Journal of Multivariate Analysis, Elsevier, vol. 177(C).
    13. Couillet, Romain, 2015. "Robust spiked random matrices and a robust G-MUSIC estimator," Journal of Multivariate Analysis, Elsevier, vol. 140(C), pages 139-161.
    14. Ziwei Zhu & Tengyao Wang & Richard J. Samworth, 2022. "High‐dimensional principal component analysis with heterogeneous missingness," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 84(5), pages 2000-2031, November.
    15. Joongyeub Yeo & George Papanicolaou, 2016. "Random matrix approach to estimation of high-dimensional factor models," Papers 1611.05571, arXiv.org, revised Nov 2017.
    16. Choi, Sung Hoon & Kim, Donggyu, 2023. "Large volatility matrix analysis using global and national factor models," Journal of Econometrics, Elsevier, vol. 235(2), pages 1917-1933.
    17. Timothy B. Armstrong & Michal Kolesár & Mikkel Plagborg‐Møller, 2022. "Robust Empirical Bayes Confidence Intervals," Econometrica, Econometric Society, vol. 90(6), pages 2567-2602, November.
    18. Lam, Clifford, 2020. "High-dimensional covariance matrix estimation," LSE Research Online Documents on Economics 101667, London School of Economics and Political Science, LSE Library.
    19. Feng, Long & Dicker, Lee H., 2018. "Approximate nonparametric maximum likelihood for mixture models: A convex optimization approach to fitting arbitrary multivariate mixing distributions," Computational Statistics & Data Analysis, Elsevier, vol. 122(C), pages 80-91.
    20. Zhu, Ziwei & Wang, Tengyao & Samworth, Richard J., 2022. "High-dimensional principal component analysis with heterogeneous missingness," LSE Research Online Documents on Economics 117647, London School of Economics and Political Science, LSE Library.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:jorssb:v:84:y:2022:i:3:p:853-878. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: https://edirc.repec.org/data/rssssea.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.