IDEAS home Printed from https://ideas.repec.org/a/spr/stmapp/v30y2021i3d10.1007_s10260-020-00546-2.html
   My bibliography  Save this article

An empirical comparison of two approaches for CDPCA in high-dimensional data

Author

Listed:
  • Adelaide Freitas

    (University of Aveiro
    University of Aveiro)

  • Eloísa Macedo

    (University of Aveiro)

  • Maurizio Vichi

    (University “La Sapienza”)

Abstract

Modified principal component analysis techniques, specially those yielding sparse solutions, are attractive due to its usefulness for interpretation purposes, in particular, in high-dimensional data sets. Clustering and disjoint principal component analysis (CDPCA) is a constrained PCA that promotes sparsity in the loadings matrix. In particular, CDPCA seeks to describe the data in terms of disjoint (and possibly sparse) components and has, simultaneously, the particularity of identifying clusters of objects. Based on simulated and real gene expression data sets where the number of variables is higher than the number of the objects, we empirically compare the performance of two different heuristic iterative procedures, namely ALS and two-step-SDP algorithms proposed in the specialized literature to perform CDPCA. To avoid possible effect of different variance values among the original variables, all the data was standardized. Although both procedures perform well, numerical tests highlight two main features that distinguish their performance, in particular related to the two-step-SDP algorithm: it provides faster results than ALS and, since it employs a clustering procedure (k-means) on the variables, outperforms ALS algorithm in recovering the true variable partitioning unveiled by the generated data sets. Overall, both procedures produce satisfactory results in terms of solution precision, where ALS performs better, and in recovering the true object clusters, in which two-step-SDP outperforms ALS approach for data sets with lower sample size and more structure complexity (i.e., error level in the CDPCA model). The proportion of explained variance by the components estimated by both algorithms is affected by the data structure complexity (higher error level, the lower variance) and presents similar values for the two algorithms, except for data sets with two object clusters where the two-step-SDP approach yields higher variance. Moreover, experimental tests suggest that the two-step-SDP approach, in general, presents more ability to recover the true number of object clusters, while the ALS algorithm is better in terms of quality of object clustering with more homogeneous, compact and well-separated clusters in the reduced space of the CDPCA components.

Suggested Citation

  • Adelaide Freitas & Eloísa Macedo & Maurizio Vichi, 2021. "An empirical comparison of two approaches for CDPCA in high-dimensional data," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 30(3), pages 1007-1031, September.
  • Handle: RePEc:spr:stmapp:v:30:y:2021:i:3:d:10.1007_s10260-020-00546-2
    DOI: 10.1007/s10260-020-00546-2
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s10260-020-00546-2
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s10260-020-00546-2?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Rocci, Roberto & Vichi, Maurizio, 2008. "Two-mode multi-partitioning," Computational Statistics & Data Analysis, Elsevier, vol. 52(4), pages 1984-2003, January.
    2. Doyo Enki & Nickolay Trendafilov & Ian Jolliffe, 2013. "A clustering approach to interpretable principal components," Journal of Applied Statistics, Taylor & Francis Journals, vol. 40(3), pages 583-599.
    3. Maurizio Vichi, 2017. "Disjoint factor analysis with cross-loadings," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 11(3), pages 563-591, September.
    4. S. K. Vines, 2000. "Simple principal components," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 49(4), pages 441-451.
    5. Vichi, Maurizio & Saporta, Gilbert, 2009. "Clustering and disjoint principal component analysis," Computational Statistics & Data Analysis, Elsevier, vol. 53(8), pages 3194-3208, June.
    6. Kohei Adachi & Nickolay T. Trendafilov, 2016. "Sparse principal component analysis subject to prespecified cardinality of loadings," Computational Statistics, Springer, vol. 31(4), pages 1403-1427, December.
    7. Carlo Cavicchia & Maurizio Vichi & Giorgia Zaccaria, 2020. "The ultrametric correlation matrix for modelling hierarchical latent concepts," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 14(4), pages 837-853, December.
    8. Charrad, Malika & Ghazzali, Nadia & Boiteau, Véronique & Niknafs, Azam, 2014. "NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 61(i06).
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Carlo Cavicchia & Maurizio Vichi & Giorgia Zaccaria, 2023. "Hierarchical disjoint principal component analysis," AStA Advances in Statistical Analysis, Springer;German Statistical Society, vol. 107(3), pages 537-574, September.
    2. Nickolay Trendafilov, 2014. "From simple structure to sparse components: a review," Computational Statistics, Springer, vol. 29(3), pages 431-454, June.
    3. Naoto Yamashita, 2023. "Principal component analysis constrained by layered simple structures," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 17(2), pages 347-367, June.
    4. José Fernando Romero Cañizares & Purificación Vicente Galindo & Yannis Phillis & Evangelos Grigoroudis, 2022. "Graphical sustainability analysis using disjoint biplots," Operational Research, Springer, vol. 22(2), pages 1575-1596, April.
    5. Bolívar, Fernando & Duran, Miguel A. & Lozano-Vivas, Ana, 2023. "Bank business models, size, and profitability," Finance Research Letters, Elsevier, vol. 53(C).
    6. Kim, Hyun Hak & Swanson, Norman R., 2018. "Mining big data using parsimonious factor, machine learning, variable selection and shrinkage methods," International Journal of Forecasting, Elsevier, vol. 34(2), pages 339-354.
    7. Alfonso Iodice D’Enza & Francesco Palumbo, 2013. "Iterative factor clustering of binary data," Computational Statistics, Springer, vol. 28(2), pages 789-807, April.
    8. Roopam Shukla & Ankit Agarwal & Kamna Sachdeva & Juergen Kurths & P. K. Joshi, 2019. "Climate change perception: an analysis of climate change and risk perceptions among farmer types of Indian Western Himalayas," Climatic Change, Springer, vol. 152(1), pages 103-119, January.
    9. Yannis Yatracos, 2013. "Detecting Clusters in the Data from Variance Decompositions of Its Projections," Journal of Classification, Springer;The Classification Society, vol. 30(1), pages 30-55, April.
    10. Saemi Shin & Won Suck Yoon & Sang-Hoon Byeon, 2022. "Trends in Occupational Infectious Diseases in South Korea and Classification of Industries According to the Risk of Biological Hazards Using K-Means Clustering," IJERPH, MDPI, vol. 19(19), pages 1-19, September.
    11. Blasius, J. & Greenacre, M. & Groenen, P.J.F. & van de Velden, M., 2009. "Special issue on correspondence analysis and related methods," Computational Statistics & Data Analysis, Elsevier, vol. 53(8), pages 3103-3106, June.
    12. Jihane El Ouadi & Hanae Errousso & Nicolas Malhene & Siham Benhadou & Hicham Medromi, 2022. "A machine-learning based hybrid algorithm for strategic location of urban bundling hubs to support shared public transport," Quality & Quantity: International Journal of Methodology, Springer, vol. 56(5), pages 3215-3258, October.
    13. Kreitmair, Ursula & Bower-Bir, Jacob, 2021. "Too different to solve climate change? Experimental evidence on the effects of production and benefit heterogeneity on collective action," Ecological Economics, Elsevier, vol. 184(C).
    14. Getaneh Addis Tessema & Jan van der Borg & Anton Van Rompaey & Steven Van Passel & Enyew Adgo & Amare Sewnet Minale & Kerebih Asrese & Amaury Frankl & Jean Poesen, 2022. "Benefit Segmentation of Tourists to Geosites and Its Implications for Sustainable Development of Geotourism in the Southern Lake Tana Region, Ethiopia," Sustainability, MDPI, vol. 14(6), pages 1-25, March.
    15. Wu, Tong & Rocha, Juan C. & Berry, Kevin & Chaigneau, Tomas & Hamann, Maike & Lindkvist, Emilie & Qiu, Jiangxiao & Schill, Caroline & Shepon, Alon & Crépin, Anne-Sophie & Folke, Carl, 2024. "Triple Bottom Line or Trilemma? Global Tradeoffs Between Prosperity, Inequality, and the Environment," World Development, Elsevier, vol. 178(C).
    16. Turati, Pietro & Pedroni, Nicola & Zio, Enrico, 2017. "Simulation-based exploration of high-dimensional system models for identifying unexpected events," Reliability Engineering and System Safety, Elsevier, vol. 165(C), pages 317-330.
    17. Ben Beck & Meghan Winters & Trisalyn Nelson & Chris Pettit & Simone Z Leao & Meead Saberi & Jason Thompson & Sachith Seneviratne & Kerry Nice & Mark Stevenson, 2023. "Developing urban biking typologies: Quantifying the complex interactions of bicycle ridership, bicycle network and built environment characteristics," Environment and Planning B, , vol. 50(1), pages 7-23, January.
    18. Maurizio Vichi, 2017. "Disjoint factor analysis with cross-loadings," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 11(3), pages 563-591, September.
    19. Haytham Mohamed Salem & Linda R. Schott & Julia Piaskowski & Asmita Chapagain & Jenifer L. Yost & Erin Brooks & Kendall Kahl & Jodi Johnson-Maynard, 2024. "Evaluating Intra-Field Spatial Variability for Nutrient Management Zone Delineation through Geospatial Techniques and Multivariate Analysis," Sustainability, MDPI, vol. 16(2), pages 1-23, January.
    20. Raquel Lourenço Carvalhal Monteiro & Valdecy Pereira & Helder Gomes Costa, 2019. "Analysis of the Better Life Index Trough a Cluster Algorithm," Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, Springer, vol. 142(2), pages 477-506, April.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:stmapp:v:30:y:2021:i:3:d:10.1007_s10260-020-00546-2. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.