IDEAS home Printed from https://ideas.repec.org/a/eee/csdana/v53y2009i10p3706-3716.html
   My bibliography  Save this article

Robust probabilistic PCA with missing data and contribution analysis for outlier detection

Author

Listed:
  • Chen, Tao
  • Martin, Elaine
  • Montague, Gary

Abstract

Principal component analysis (PCA) is a widely adopted multivariate data analysis technique, with interpretation being established on the basis of both classical linear projection and a probability model (i.e. probabilistic PCA (PPCA)). Recently robust PPCA models, by using the multivariate t-distribution, have been proposed to consider the situation where there may be outliers within the data set. This paper presents an overview of the robust PPCA technique, and further discusses the issue of missing data. An expectation-maximization (EM) algorithm is presented for the maximum likelihood estimation of the model parameters in the presence of missing data. When applying robust PPCA for outlier detection, a contribution analysis method is proposed to identify which variables contribute the most to the occurrence of outliers, providing valuable information regarding the source of outlying data. The proposed technique is demonstrated on numerical examples, and the application to outlier detection and diagnosis in an industrial fermentation process.

Suggested Citation

  • Chen, Tao & Martin, Elaine & Montague, Gary, 2009. "Robust probabilistic PCA with missing data and contribution analysis for outlier detection," Computational Statistics & Data Analysis, Elsevier, vol. 53(10), pages 3706-3716, August.
  • Handle: RePEc:eee:csdana:v:53:y:2009:i:10:p:3706-3716
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0167-9473(09)00124-8
    Download Restriction: Full text for ScienceDirect subscribers only.
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Michael E. Tipping & Christopher M. Bishop, 1999. "Probabilistic Principal Component Analysis," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 61(3), pages 611-622.
    2. Tao Chen & Julian Morris & Elaine Martin, 2006. "Probability density estimation via an infinite Gaussian mixture model: application to statistical process monitoring," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 55(5), pages 699-715, November.
    3. Li, Baibing & Martin, Elaine B. & Morris, A. Julian, 2002. "On principal component analysis in L1," Computational Statistics & Data Analysis, Elsevier, vol. 40(3), pages 471-474, September.
    4. Liu, Chuanhai, 1997. "ML Estimation of the MultivariatetDistribution and the EM Algorithm," Journal of Multivariate Analysis, Elsevier, vol. 63(2), pages 296-312, November.
    5. Kotz,Samuel & Nadarajah,Saralees, 2004. "Multivariate T-Distributions and Their Applications," Cambridge Books, Cambridge University Press, number 9780521826549, October.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Boente, Graciela & Pires, Ana M. & Rodrigues, Isabel M., 2010. "Detecting influential observations in principal components and common principal components," Computational Statistics & Data Analysis, Elsevier, vol. 54(12), pages 2967-2975, December.
    2. Efstathios Panayi & Gareth Peters & Ioannis Kosmidis, 2014. "Liquidity commonality does not imply liquidity resilience commonality: A functional characterisation for ultra-high frequency cross-sectional LOB data," Papers 1406.5486, arXiv.org.
    3. Debruyne, Michiel & Hubert, Mia & Van Horebeek, Johan, 2010. "Detecting influential observations in Kernel PCA," Computational Statistics & Data Analysis, Elsevier, vol. 54(12), pages 3007-3019, December.
    4. Sang-Mok Lee & So-Won Choi & Eul-Bum Lee, 2023. "Prediction Modeling of Flue Gas Control for Combustion Efficiency Optimization for Steel Mill Power Plant Boilers Based on Partial Least Squares Regression (PLSR)," Energies, MDPI, vol. 16(19), pages 1-33, September.
    5. Dorota Toczydlowska & Gareth W. Peters, 2018. "Financial Big Data Solutions for State Space Panel Regression in Interest Rate Dynamics," Econometrics, MDPI, vol. 6(3), pages 1-45, July.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Dorota Toczydlowska & Gareth W. Peters & Man Chung Fung & Pavel V. Shevchenko, 2017. "Stochastic Period and Cohort Effect State-Space Mortality Models Incorporating Demographic Factors via Probabilistic Robust Principal Components," Risks, MDPI, vol. 5(3), pages 1-77, July.
    2. Marconi, Gabriele, 2014. "European higher education policies and the problem of estimating a complex model with a small cross-section," MPRA Paper 87600, University Library of Munich, Germany.
    3. Matteo Barigozzi, 2023. "Asymptotic equivalence of Principal Components and Quasi Maximum Likelihood estimators in Large Approximate Factor Models," Papers 2307.09864, arXiv.org, revised Jun 2024.
    4. Benaych-Georges, Florent & Nadakuditi, Raj Rao, 2012. "The singular values and vectors of low rank perturbations of large rectangular random matrices," Journal of Multivariate Analysis, Elsevier, vol. 111(C), pages 120-135.
    5. Gaofeng Jia & Alexandros A. Taflanidis & Norberto C. Nadal-Caraballo & Jeffrey A. Melby & Andrew B. Kennedy & Jane M. Smith, 2016. "Surrogate modeling for peak or time-dependent storm surge prediction over an extended coastal region using an existing database of synthetic storms," Natural Hazards: Journal of the International Society for the Prevention and Mitigation of Natural Hazards, Springer;International Society for the Prevention and Mitigation of Natural Hazards, vol. 81(2), pages 909-938, March.
    6. Landgraf, Andrew J. & Lee, Yoonkyung, 2020. "Dimensionality reduction for binary data through the projection of natural parameters," Journal of Multivariate Analysis, Elsevier, vol. 180(C).
    7. Jung, WoongHee & Taflanidis, Alexandros A. & Kyprioti, Aikaterini P. & Zhang, Jize, 2024. "Adaptive multi-fidelity Monte Carlo for real-time probabilistic storm surge predictions," Reliability Engineering and System Safety, Elsevier, vol. 247(C).
    8. Paola Zuccolotto, 2012. "Principal component analysis with interval imputed missing values," AStA Advances in Statistical Analysis, Springer;German Statistical Society, vol. 96(1), pages 1-23, January.
    9. Elizondo Rocío, 2013. "Forecasting the Term Structure of Interest Rates in Mexico Using an Affine Model," Working Papers 2013-03, Banco de México.
    10. Francesco Curreri & Giacomo Fiumara & Maria Gabriella Xibilia, 2020. "Input Selection Methods for Soft Sensor Design: A Survey," Future Internet, MDPI, vol. 12(6), pages 1-24, June.
    11. Gaofeng Jia & Alexandros Taflanidis & Norberto Nadal-Caraballo & Jeffrey Melby & Andrew Kennedy & Jane Smith, 2016. "Surrogate modeling for peak or time-dependent storm surge prediction over an extended coastal region using an existing database of synthetic storms," Natural Hazards: Journal of the International Society for the Prevention and Mitigation of Natural Hazards, Springer;International Society for the Prevention and Mitigation of Natural Hazards, vol. 81(2), pages 909-938, March.
    12. Juan Carlos Chávez & Felipe J. Fonseca & Manuel Gómez-Zaldívar, 2017. "Resoluciones de disputas comerciales y desempeño económico regional en México. (Commercial Disputes Resolution and Regional Economic Performance in Mexico)," Ensayos Revista de Economia, Universidad Autonoma de Nuevo Leon, Facultad de Economia, vol. 0(1), pages 79-93, May.
    13. Chen, Ray-Bing & Chen, Ying & Härdle, Wolfgang K., 2014. "TVICA—Time varying independent component analysis and its application to financial data," Computational Statistics & Data Analysis, Elsevier, vol. 74(C), pages 95-109.
    14. Yan Yu Chen & Chun-Cheih Chao & Fu-Chen Liu & Po-Chen Hsu & Hsueh-Fen Chen & Shih-Chi Peng & Yung-Jen Chuang & Chung-Yu Lan & Wen-Ping Hsieh & David Shan Hill Wong, 2013. "Dynamic Transcript Profiling of Candida albicans Infection in Zebrafish: A Pathogen-Host Interaction Study," PLOS ONE, Public Library of Science, vol. 8(9), pages 1-16, September.
    15. Plat, Richard, 2009. "Stochastic portfolio specific mortality and the quantification of mortality basis risk," Insurance: Mathematics and Economics, Elsevier, vol. 45(1), pages 123-132, August.
    16. Kondylis, Athanassios & Whittaker, Joe, 2008. "Spectral preconditioning of Krylov spaces: Combining PLS and PC regression," Computational Statistics & Data Analysis, Elsevier, vol. 52(5), pages 2588-2603, January.
    17. Chen Tong & Peter Reinhard Hansen & Ilya Archakov, 2024. "Cluster GARCH," Papers 2406.06860, arXiv.org.
    18. Simplice A. Asongu & Nicholas M. Odhiambo, 2019. "Governance, capital flight and industrialisation in Africa," Journal of Economic Structures, Springer;Pan-Pacific Association of Input-Output Studies (PAPAIOS), vol. 8(1), pages 1-22, December.
    19. Wang, Zihan & Daeipour, Mohamad & Xu, Hongyi, 2023. "Quantification and propagation of Aleatoric uncertainties in topological structures," Reliability Engineering and System Safety, Elsevier, vol. 233(C).
    20. M. J. Aziakpono & S. Kleimeier & H. Sander, 2012. "Banking market integration in the SADC countries: evidence from interest rate analyses," Applied Economics, Taylor & Francis Journals, vol. 44(29), pages 3857-3876, October.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:csdana:v:53:y:2009:i:10:p:3706-3716. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/csda .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.