IDEAS home Printed from https://ideas.repec.org/p/hal/journl/hal-02007388.html
   My bibliography  Save this paper

Assessing variable importance in clustering: a new method based on unsupervised binary decision trees

Author

Listed:
  • Ghattas Badih

    (I2M - Institut de Mathématiques de Marseille - AMU - Aix Marseille Université - ECM - École Centrale de Marseille - CNRS - Centre National de la Recherche Scientifique)

  • Michel Pierre

    (I2M - Institut de Mathématiques de Marseille - AMU - Aix Marseille Université - ECM - École Centrale de Marseille - CNRS - Centre National de la Recherche Scientifique, CEReSS - Centre d'études et de recherche sur les services de santé et la qualité de vie - AMU - Aix Marseille Université, AMSE - Aix-Marseille Sciences Economiques - EHESS - École des hautes études en sciences sociales - AMU - Aix Marseille Université - ECM - École Centrale de Marseille - CNRS - Centre National de la Recherche Scientifique)

  • Boyer Laurent

    (CEReSS - Centre d'études et de recherche sur les services de santé et la qualité de vie - AMU - Aix Marseille Université)

Abstract

We consider different approaches for assessing variable importance in clustering. We focus on clustering using binary decision trees (CUBT), which is a non-parametric top-down hierarchical clustering method designed for both continuous and nominal data. We suggest a measure of variable importance for this method similar to the one used in Breiman's classification and regression trees. This score is useful to rank the variables in a dataset, to determine which variables are the most important or to detect the irrelevant ones. We analyze both stability and efficiency of this score on different data simulation models in the presence of noise, and compare it to other classical variable importance measures. Our experiments show that variable importance based on CUBT is much more efficient than other approaches in a large variety of situations.

Suggested Citation

  • Ghattas Badih & Michel Pierre & Boyer Laurent, 2019. "Assessing variable importance in clustering: a new method based on unsupervised binary decision trees," Post-Print hal-02007388, HAL.
  • Handle: RePEc:hal:journl:hal-02007388
    DOI: 10.1007/s00180-018-0857-0
    Note: View the original document on HAL open archive server: https://amu.hal.science/hal-02007388
    as

    Download full text from publisher

    File URL: https://amu.hal.science/hal-02007388/document
    Download Restriction: no

    File URL: https://libkey.io/10.1007/s00180-018-0857-0?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    Other versions of this item:

    References listed on IDEAS

    as
    1. Ricardo Fraiman & Badih Ghattas & Marcela Svarc, 2013. "Interpretable clustering using unsupervised binary trees," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 7(2), pages 125-145, June.
    2. R. Darrell Bock, 1972. "Estimating item parameters and latent ability when responses are scored in two or more nominal categories," Psychometrika, Springer;The Psychometric Society, vol. 37(1), pages 29-51, March.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Luo, Nanyu & Ji, Feng & Han, Yuting & He, Jinbo & Zhang, Xiaoya, 2024. "Fitting item response theory models using deep learning computational frameworks," OSF Preprints tjxab, Center for Open Science.
    2. Michelle M. LaMar, 2018. "Markov Decision Process Measurement Model," Psychometrika, Springer;The Psychometric Society, vol. 83(1), pages 67-88, March.
    3. Bas Hemker & Klaas Sijtsma & Ivo Molenaar & Brian Junker, 1996. "Polytomous IRT models and monotone likelihood ratio of the total score," Psychometrika, Springer;The Psychometric Society, vol. 61(4), pages 679-693, December.
    4. Sijia Huang & Li Cai, 2024. "Cross-Classified Item Response Theory Modeling With an Application to Student Evaluation of Teaching," Journal of Educational and Behavioral Statistics, , vol. 49(3), pages 311-341, June.
    5. Björn Andersson & Tao Xin, 2021. "Estimation of Latent Regression Item Response Theory Models Using a Second-Order Laplace Approximation," Journal of Educational and Behavioral Statistics, , vol. 46(2), pages 244-265, April.
    6. Yang Liu & Jan Hannig & Abhishek Pal Majumder, 2019. "Second-Order Probability Matching Priors for the Person Parameter in Unidimensional IRT Models," Psychometrika, Springer;The Psychometric Society, vol. 84(3), pages 701-718, September.
    7. Hsiao, Cheng & Sun, Bao-Hong, 1998. "Modeling survey response bias - with an analysis of the demand for an advanced electronic device," Journal of Econometrics, Elsevier, vol. 89(1-2), pages 15-39, November.
    8. Golovkine, Steven & Klutchnikoff, Nicolas & Patilea, Valentin, 2022. "Clustering multivariate functional data using unsupervised binary trees," Computational Statistics & Data Analysis, Elsevier, vol. 168(C).
    9. Roderick McDonald, 1986. "Describing the elephant: Structure and function in multivariate data," Psychometrika, Springer;The Psychometric Society, vol. 51(4), pages 513-534, December.
    10. Jouni Kuha & Myrsini Katsikatsou & Irini Moustaki, 2018. "Latent variable modelling with non‐ignorable item non‐response: multigroup response propensity models for cross‐national analysis," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 181(4), pages 1169-1192, October.
    11. Laine Bradshaw & Jonathan Templin, 2014. "Combining Item Response Theory and Diagnostic Classification Models: A Psychometric Model for Scaling Ability and Diagnosing Misconceptions," Psychometrika, Springer;The Psychometric Society, vol. 79(3), pages 403-425, July.
    12. Javier Revuelta, 2004. "Analysis of distractor difficulty in multiple-choice items," Psychometrika, Springer;The Psychometric Society, vol. 69(2), pages 217-234, June.
    13. Peida Zhan & Wen-Chung Wang & Xiaomin Li, 2020. "A Partial Mastery, Higher-Order Latent Structural Model for Polytomous Attributes in Cognitive Diagnostic Assessments," Journal of Classification, Springer;The Classification Society, vol. 37(2), pages 328-351, July.
    14. Gerhard Tutz & Moritz Berger, 2016. "Response Styles in Rating Scales," Journal of Educational and Behavioral Statistics, , vol. 41(3), pages 239-268, June.
    15. Ulf Böckenholt, 2012. "The Cognitive-Miser Response Model: Testing for Intuitive and Deliberate Reasoning," Psychometrika, Springer;The Psychometric Society, vol. 77(2), pages 388-399, April.
    16. Albert Yu & Jeffrey A. Douglas, 2023. "IRT Models for Learning With Item-Specific Learning Parameters," Journal of Educational and Behavioral Statistics, , vol. 48(6), pages 866-888, December.
    17. Jochen Ranger & Kay Brauer, 2022. "On the Generalized S − X 2 –Test of Item Fit: Some Variants, Residuals, and a Graphical Visualization," Journal of Educational and Behavioral Statistics, , vol. 47(2), pages 202-230, April.
    18. Albert Maydeu-Olivares & Harry Joe, 2006. "Limited Information Goodness-of-fit Testing in Multidimensional Contingency Tables," Psychometrika, Springer;The Psychometric Society, vol. 71(4), pages 713-732, December.
    19. César Martinelli & Susan W. Parker & Ana Cristina Pérez-Gea & Rodimiro Rodrigo, 2018. "Cheating and Incentives: Learning from a Policy Experiment," American Economic Journal: Economic Policy, American Economic Association, vol. 10(1), pages 298-325, February.
    20. John Hsu & Tom Leonard & Kam-Wah Tsui, 1991. "Statistical inference for multiple choice tests," Psychometrika, Springer;The Psychometric Society, vol. 56(2), pages 327-348, June.

    More about this item

    Keywords

    Variables ranking; Unsupervised learning; CUBT; Deviance; Variable importance;
    All these keywords.

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:hal:journl:hal-02007388. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: CCSD (email available below). General contact details of provider: https://hal.archives-ouvertes.fr/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.