IDEAS home Printed from https://ideas.repec.org/a/spr/advdac/v13y2019i2d10.1007_s11634-018-0310-9.html
   My bibliography  Save this article

Comparisons among several methods for handling missing data in principal component analysis (PCA)

Author

Listed:
  • Sébastien Loisel

    (Heriot-Watt University)

  • Yoshio Takane

    (University of Victoria)

Abstract

Missing data are prevalent in many data analytic situations. Those in which principal component analysis (PCA) is applied are no exceptions. The performance of five methods for handling missing data in PCA is investigated, the missing data passive method, the weighted low rank approximation (WLRA) method, the regularized PCA (RPCA) method, the trimmed scores regression method, and the data augmentation (DA) method. Three complete data sets of varying sizes were selected, in which missing data were created randomly and non-randomly. These data were then analyzed by the five methods, and their parameter recovery capability, as measured by the mean congruence coefficient between loadings obtained from full and missing data, is compared as functions of the number of extracted components (dimensionality) and the proportion of missing data (censor rate). For randomly censored data, all five methods worked well when the dimensionality and censor rate were small. Their performance deteriorated, as the dimensionality and censor rate increased, but the speed of deterioration was distinctly faster with the WLRA method. The RPCA method worked best and the DA method came as a close second in terms of parameter recovery. However, the latter, as implemented here, was found to be extremely time-consuming. For non-randomly censored data, the recovery was also affected by the degree of non-randomness in censoring processes. Again the RPCA method worked best, maintaining good to excellent recoveries when the censor rate was small and the dimensionality of solutions was not too excessive.

Suggested Citation

  • Sébastien Loisel & Yoshio Takane, 2019. "Comparisons among several methods for handling missing data in principal component analysis (PCA)," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 13(2), pages 495-518, June.
  • Handle: RePEc:spr:advdac:v:13:y:2019:i:2:d:10.1007_s11634-018-0310-9
    DOI: 10.1007/s11634-018-0310-9
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11634-018-0310-9
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11634-018-0310-9?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Michael E. Tipping & Christopher M. Bishop, 1999. "Probabilistic Principal Component Analysis," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 61(3), pages 611-622.
    2. Joost Ginkel & Pieter Kroonenberg, 2014. "Using Generalized Procrustes Analysis for Multiple Imputation in Principal Component Analysis," Journal of Classification, Springer;The Classification Society, vol. 31(2), pages 242-269, July.
    3. Henk Kiers, 1997. "Weighted least squares fitting using ordinary least squares algorithms," Psychometrika, Springer;The Psychometric Society, vol. 62(2), pages 251-266, June.
    4. Roderick McDonald & E. Burr, 1967. "A comparison of four methods of constructing factor scores," Psychometrika, Springer;The Psychometric Society, vol. 32(4), pages 381-401, December.
    5. Serneels, Sven & Verdonck, Tim, 2008. "Principal component analysis for data containing outliers and missing elements," Computational Statistics & Data Analysis, Elsevier, vol. 52(3), pages 1712-1727, January.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. A. Iodice D’Enza & A. Markos & F. Palumbo, 2022. "Chunk-wise regularised PCA-based imputation of missing data," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 31(2), pages 365-386, June.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Julie Josse & Jérôme Pagès & François Husson, 2011. "Multiple imputation in principal component analysis," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 5(3), pages 231-246, October.
    2. Hron, K. & Templ, M. & Filzmoser, P., 2010. "Imputation of missing values for compositional data using classical and robust methods," Computational Statistics & Data Analysis, Elsevier, vol. 54(12), pages 3095-3107, December.
    3. Wasito, Ito & Mirkin, Boris, 2006. "Nearest neighbours in least-squares data imputation algorithms with different missing patterns," Computational Statistics & Data Analysis, Elsevier, vol. 50(4), pages 926-949, February.
    4. Wang, Zihan & Daeipour, Mohamad & Xu, Hongyi, 2023. "Quantification and propagation of Aleatoric uncertainties in topological structures," Reliability Engineering and System Safety, Elsevier, vol. 233(C).
    5. Qing Li & Long Hai Vo, 2021. "Intangible Capital and Innovation: An Empirical Analysis of Vietnamese Enterprises," Economics Discussion / Working Papers 21-02, The University of Western Australia, Department of Economics.
    6. Joost Ginkel & Pieter Kroonenberg, 2014. "Using Generalized Procrustes Analysis for Multiple Imputation in Principal Component Analysis," Journal of Classification, Springer;The Classification Society, vol. 31(2), pages 242-269, July.
    7. Husson, François & Josse, Julie & Saporta, Gilbert, 2016. "Jan de Leeuw and the French School of Data Analysis," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 73(i06).
    8. Matteo Barigozzi & Matteo Luciani, 2024. "Quasi Maximum Likelihood Estimation and Inference of Large Approximate Dynamic Factor Models via the EM algorithm," Finance and Economics Discussion Series 2024-086, Board of Governors of the Federal Reserve System (U.S.).
    9. Xin Xu & Yang Lu & Yupeng Zhou & Zhiguo Fu & Yanjie Fu & Minghao Yin, 2021. "An Information-Explainable Random Walk Based Unsupervised Network Representation Learning Framework on Node Classification Tasks," Mathematics, MDPI, vol. 9(15), pages 1-14, July.
    10. de Leeuw, Jan, 2006. "Principal component analysis of binary data by iterated singular value decomposition," Computational Statistics & Data Analysis, Elsevier, vol. 50(1), pages 21-39, January.
    11. Dorota Toczydlowska & Gareth W. Peters & Man Chung Fung & Pavel V. Shevchenko, 2017. "Stochastic Period and Cohort Effect State-Space Mortality Models Incorporating Demographic Factors via Probabilistic Robust Principal Components," Risks, MDPI, vol. 5(3), pages 1-77, July.
    12. Matteo Barigozzi & Marc Hallin, 2023. "Dynamic Factor Models: a Genealogy," Papers 2310.17278, arXiv.org, revised Jan 2024.
    13. Chen, Tao & Martin, Elaine & Montague, Gary, 2009. "Robust probabilistic PCA with missing data and contribution analysis for outlier detection," Computational Statistics & Data Analysis, Elsevier, vol. 53(10), pages 3706-3716, August.
    14. Duane F. Alwin, 1973. "The Use of Factor Analysis in the Construction of Linear Composites in Social Research," Sociological Methods & Research, , vol. 2(2), pages 191-212, November.
    15. Chen, Andrew Y. & McCoy, Jack, 2024. "Missing values handling for machine learning portfolios," Journal of Financial Economics, Elsevier, vol. 155(C).
    16. Wang, Shao-Hsuan & Huang, Su-Yun, 2022. "Perturbation theory for cross data matrix-based PCA," Journal of Multivariate Analysis, Elsevier, vol. 190(C).
    17. Cook, R. Dennis, 2022. "A slice of multivariate dimension reduction," Journal of Multivariate Analysis, Elsevier, vol. 188(C).
    18. Wentao Qu & Xianchao Xiu & Huangyue Chen & Lingchen Kong, 2023. "A Survey on High-Dimensional Subspace Clustering," Mathematics, MDPI, vol. 11(2), pages 1-39, January.
    19. Groenen, P.J.F. & Giaquinto, P. & Kiers, H.A.L., 2003. "Weighted Majorization Algorithms for Weighted Least Squares Decomposition Models," Econometric Institute Research Papers EI 2003-09, Erasmus University Rotterdam, Erasmus School of Economics (ESE), Econometric Institute.
    20. Ligon, Ethan, 2017. "Estimating household welfare from disaggregate expenditures," Department of Agricultural & Resource Economics, UC Berkeley, Working Paper Series qt5gc4h1fm, Department of Agricultural & Resource Economics, UC Berkeley.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:advdac:v:13:y:2019:i:2:d:10.1007_s11634-018-0310-9. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.