IDEAS home Printed from https://ideas.repec.org/p/grt/wpegrt/2010-02.html
   My bibliography  Save this paper

Clustering of categorical variables around latent variables

Author

Listed:
  • Jérome SARACCO
  • Marie CHAVENT
  • Vanessa KUENTZ

Abstract

In the framework of clustering, the usual aim is to cluster observations and not variables. However the issue of variable clustering clearly appears for dimension reduction, selection of variables or in some case studies (sensory analysis, biochemistry, marketing, etc.). Clustering of variables is then studied as a way to arrange variables into homogeneous clusters, thereby organizing data into meaningful structures. Once the variables are clustered into groups such that variables are similar to the other variables belonging to their cluster, the selection of a subset of variables is possible. Several specific methods have been developed for the clustering of numerical variables. However concerning categorical variables, much less methods have been proposed. In this paper we extend the criterion used by Vigneau and Qannari (2003) in their Clustering around Latent Variables approach for numerical variables to the case of categorical data. The homogeneity criterion of a cluster of categorical variables is defined as the sum of the correlation ratio between the categorical variables and a latent variable, which is in this case a numerical variable. We show that the latent variable maximizing the homogeneity of a cluster can be obtained with Multiple Correspondence Analysis. Different algorithms for the clustering of categorical variables are proposed: iterative relocation algorithm, ascendant and divisive hierarchical clustering. The proposed methodology is illustrated by a real data application to satisfaction of pleasure craft operators.

Suggested Citation

  • Jérome SARACCO & Marie CHAVENT & Vanessa KUENTZ, 2010. "Clustering of categorical variables around latent variables," Cahiers du GREThA (2007-2019) 2010-02, Groupe de Recherche en Economie Théorique et Appliquée (GREThA).
  • Handle: RePEc:grt:wpegrt:2010-02
    as

    Download full text from publisher

    File URL: http://cahiersdugretha.u-bordeaux.fr/2010/2010-02.pdf
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Valentina Stan & Gilbert Saporta, 2005. "Conjoint use of variables clustering and PLS structural equations modelling," Post-Print hal-01125056, HAL.
    2. Vichi, Maurizio & Kiers, Henk A. L., 2001. "Factorial k-means analysis for two-way data," Computational Statistics & Data Analysis, Elsevier, vol. 37(1), pages 49-64, July.
    3. Vichi, Maurizio & Saporta, Gilbert, 2009. "Clustering and disjoint principal component analysis," Computational Statistics & Data Analysis, Elsevier, vol. 53(8), pages 3194-3208, June.
    4. Michael Greenacre, 2008. "Correspondence analysis of raw data," Economics Working Papers 1112, Department of Economics and Business, Universitat Pompeu Fabra, revised Jul 2009.
    5. W. J. Krzanowski, 1987. "Selection of Variables to Preserve Multivariate Data Structure, Using Principal Components," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 36(1), pages 22-33, March.
    6. Plasse, Marie & Niang, Ndeye & Saporta, Gilbert & Villeminot, Alexandre & Leblond, Laurent, 2007. "Combined use of association rules mining and clustering methods to find relevant links between binary rare attributes in a large data set," Computational Statistics & Data Analysis, Elsevier, vol. 52(1), pages 596-613, September.
    7. I. T. Jolliffe, 1972. "Discarding Variables in a Principal Component Analysis. I: Artificial Data," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 21(2), pages 160-173, June.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Alfonso Iodice D’Enza & Francesco Palumbo, 2013. "Iterative factor clustering of binary data," Computational Statistics, Springer, vol. 28(2), pages 789-807, April.
    2. Pacheco, Joaquín & Casado, Silvia & Porras, Santiago, 2013. "Exact methods for variable selection in principal component analysis: Guide functions and pre-selection," Computational Statistics & Data Analysis, Elsevier, vol. 57(1), pages 95-111.
    3. Donatella Vicari & Paolo Giordani, 2023. "CPclus: Candecomp/Parafac Clustering Model for Three-Way Data," Journal of Classification, Springer;The Classification Society, vol. 40(2), pages 432-465, July.
    4. Cumming, J.A. & Wooff, D.A., 2007. "Dimension reduction via principal variables," Computational Statistics & Data Analysis, Elsevier, vol. 52(1), pages 550-565, September.
    5. Lazhar Labiod & Mohamed Nadif, 2021. "Efficient regularized spectral data embedding," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 15(1), pages 99-119, March.
    6. Vanessa Kuentz-Simonet & Amaury Labenne & Tina Rambonilaza, 2017. "Using ClustOfVar to Construct Quality of Life Indicators for Vulnerability Assessment Municipality Trajectories in Southwest France from 1999 to 2009," Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, Springer, vol. 131(3), pages 973-997, April.
    7. Brusco, Michael J., 2014. "A comparison of simulated annealing algorithms for variable selection in principal component analysis and discriminant analysis," Computational Statistics & Data Analysis, Elsevier, vol. 77(C), pages 38-53.
    8. Cristina Tortora & Mireille Gettler Summa & Marina Marino & Francesco Palumbo, 2016. "Factor probabilistic distance clustering (FPDC): a new clustering method," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 10(4), pages 441-464, December.
    9. Jolliffe, Ian, 2022. "A 50-year personal journey through time with principal component analysis," Journal of Multivariate Analysis, Elsevier, vol. 188(C).
    10. António Pedro Duarte Silva, 2002. "Discarding Variables in a Principal Component Analysis: Algorithms for All-Subsets Comparisons," Computational Statistics, Springer, vol. 17(2), pages 251-271, July.
    11. José Fernando Romero Cañizares & Purificación Vicente Galindo & Yannis Phillis & Evangelos Grigoroudis, 2022. "Graphical sustainability analysis using disjoint biplots," Operational Research, Springer, vol. 22(2), pages 1575-1596, April.
    12. Michael Brusco & Renu Singh & Douglas Steinley, 2009. "Variable Neighborhood Search Heuristics for Selecting a Subset of Variables in Principal Component Analysis," Psychometrika, Springer;The Psychometric Society, vol. 74(4), pages 705-726, December.
    13. Heungsun Hwang & Hec Montréal & William Dillon & Yoshio Takane, 2006. "An Extension of Multiple Correspondence Analysis for Identifying Heterogeneous Subgroups of Respondents," Psychometrika, Springer;The Psychometric Society, vol. 71(1), pages 161-171, March.
    14. Hertrich Markus, 2019. "A Novel Housing Price Misalignment Indicator for Germany," German Economic Review, De Gruyter, vol. 20(4), pages 759-794, December.
    15. Eric Beh & Luigi D’Ambra, 2009. "Some Interpretative Tools for Non-Symmetrical Correspondence Analysis," Journal of Classification, Springer;The Classification Society, vol. 26(1), pages 55-76, April.
    16. Sonika Redhu & Pragati Jain, 2024. "Unveiling the nexus between water scarcity and socioeconomic development in the water-scarce countries," Environment, Development and Sustainability: A Multidisciplinary Approach to the Theory and Practice of Sustainable Development, Springer, vol. 26(8), pages 19557-19577, August.
    17. Pilar García Gómez & Ángel López Nicolás, 2005. "Socio-economic inequalities in health in Catalonia," Hacienda Pública Española / Review of Public Economics, IEF, vol. 175(4), pages 103-121, december.
    18. Michael Greenacre, 2011. "A Simple Permutation Test for Clusteredness," Working Papers 555, Barcelona School of Economics.
    19. David Bholat & Stephen Hans & Pedro Santos & Cheryl Schonhardt-Bailey, 2015. "Text mining for central banks," Handbooks, Centre for Central Banking Studies, Bank of England, number 33, April.
    20. Michael Greenacre, 2012. "Fuzzy coding in constrained ordinations," Economics Working Papers 1325, Department of Economics and Business, Universitat Pompeu Fabra.

    More about this item

    Keywords

    clustering of categorical variables; correlation ratio; iterative relocation algorithm; hierarchical clustering;
    All these keywords.

    JEL classification:

    • C49 - Mathematical and Quantitative Methods - - Econometric and Statistical Methods: Special Topics - - - Other
    • C69 - Mathematical and Quantitative Methods - - Mathematical Methods; Programming Models; Mathematical and Simulation Modeling - - - Other

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:grt:wpegrt:2010-02. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Ernest Miguelez (email available below). General contact details of provider: https://edirc.repec.org/data/ifredfr.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.