IDEAS home Printed from https://ideas.repec.org/a/spr/compst/v34y2019i4d10.1007_s00180-019-00894-y.html
   My bibliography  Save this article

Classification tree algorithm for grouped variables

Author

Listed:
  • A. Poterie

    (Univ Rennes)

  • J.-F. Dupuy

    (Univ Rennes)

  • V. Monbet

    (Univ Rennes)

  • L. Rouvière

    (Univ Rennes)

Abstract

We consider the problem of predicting a categorical variable based on groups of inputs. Some methods have already been proposed to elaborate classification rules based on groups of variables (e.g. group lasso for logistic regression). However, to our knowledge, no tree-based approach has been proposed to tackle this issue. Here, we propose the Tree Penalized Linear Discriminant Analysis algorithm (TPLDA), a new-tree based approach which constructs a classification rule based on groups of variables. It consists in splitting a node by repeatedly selecting a group and then applying a regularized linear discriminant analysis based on this group. This process is repeated until some stopping criterion is satisfied. A pruning strategy is proposed to select an optimal tree. Compared to the existing multivariate classification tree methods, the proposed method is computationally less demanding and the resulting trees are more easily interpretable. Furthermore, TPLDA automatically provides a measure of importance for each group of variables. This score allows to rank groups of variables with respect to their ability to predict the response and can also be used to perform group variable selection. The good performances of the proposed algorithm and its interest in terms of prediction accuracy, interpretation and group variable selection are loud and compared to alternative reference methods through simulations and applications on real datasets.

Suggested Citation

  • A. Poterie & J.-F. Dupuy & V. Monbet & L. Rouvière, 2019. "Classification tree algorithm for grouped variables," Computational Statistics, Springer, vol. 34(4), pages 1613-1648, December.
  • Handle: RePEc:spr:compst:v:34:y:2019:i:4:d:10.1007_s00180-019-00894-y
    DOI: 10.1007/s00180-019-00894-y
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s00180-019-00894-y
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s00180-019-00894-y?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Xu, Ping & Brock, Guy N. & Parrish, Rudolph S., 2009. "Modified linear discriminant analysis approaches for classification of high-dimensional microarray data," Computational Statistics & Data Analysis, Elsevier, vol. 53(5), pages 1674-1687, March.
    2. Lukas Meier & Sara Van De Geer & Peter Bühlmann, 2008. "The group lasso for logistic regression," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 70(1), pages 53-71, February.
    3. Gregorutti, Baptiste & Michel, Bertrand & Saint-Pierre, Philippe, 2015. "Grouped variable importance with random forests and application to multiple functional data analysis," Computational Statistics & Data Analysis, Elsevier, vol. 90(C), pages 15-35.
    4. Dudoit S. & Fridlyand J. & Speed T. P, 2002. "Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data," Journal of the American Statistical Association, American Statistical Association, vol. 97, pages 77-87, March.
    5. Wei-Yin Loh, 2014. "Fifty Years of Classification and Regression Trees," International Statistical Review, International Statistical Institute, vol. 82(3), pages 329-348, December.
    6. Wickramarachchi, D.C. & Robertson, B.L. & Reale, M. & Price, C.J. & Brown, J., 2016. "HHCART: An oblique decision tree," Computational Statistics & Data Analysis, Elsevier, vol. 96(C), pages 12-23.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Hornung, Roman & Boulesteix, Anne-Laure, 2022. "Interaction forests: Identifying and exploiting interpretable quantitative and qualitative interaction effects," Computational Statistics & Data Analysis, Elsevier, vol. 171(C).
    2. Grzegorz Wałęga & Agnieszka Wałęga, 2021. "Over-indebted Households in Poland: Classification Tree Analysis," Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, Springer, vol. 153(2), pages 561-584, January.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Bilin Zeng & Xuerong Meggie Wen & Lixing Zhu, 2017. "A link-free sparse group variable selection method for single-index model," Journal of Applied Statistics, Taylor & Francis Journals, vol. 44(13), pages 2388-2400, October.
    2. Brendan P. W. Ames & Mingyi Hong, 2016. "Alternating direction method of multipliers for penalized zero-variance discriminant analysis," Computational Optimization and Applications, Springer, vol. 64(3), pages 725-754, July.
    3. Irina Gaynanova & James G. Booth & Martin T. Wells, 2016. "Simultaneous Sparse Estimation of Canonical Vectors in the ≫ Setting," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 111(514), pages 696-706, April.
    4. Emilio Carrizosa & Cristina Molero-Río & Dolores Romero Morales, 2021. "Mathematical optimization in classification and regression trees," TOP: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 29(1), pages 5-33, April.
    5. Pedro Duarte Silva, A., 2011. "Two-group classification with high-dimensional correlated data: A factor model approach," Computational Statistics & Data Analysis, Elsevier, vol. 55(11), pages 2975-2990, November.
    6. Tutz, Gerhard & Pößnecker, Wolfgang & Uhlmann, Lorenz, 2015. "Variable selection in general multinomial logit models," Computational Statistics & Data Analysis, Elsevier, vol. 82(C), pages 207-222.
    7. Frénay, Benoît & Doquire, Gauthier & Verleysen, Michel, 2014. "Estimating mutual information for feature selection in the presence of label noise," Computational Statistics & Data Analysis, Elsevier, vol. 71(C), pages 832-848.
    8. Andreas Dellnitz & Andreas Kleine & Madjid Tavana, 2024. "An integrated data envelopment analysis and regression tree method for new product price estimation," OR Spectrum: Quantitative Approaches in Management, Springer;Gesellschaft für Operations Research e.V., vol. 46(4), pages 1189-1211, December.
    9. Ye, Ya-Fen & Shao, Yuan-Hai & Deng, Nai-Yang & Li, Chun-Na & Hua, Xiang-Yu, 2017. "Robust Lp-norm least squares support vector regression with feature selection," Applied Mathematics and Computation, Elsevier, vol. 305(C), pages 32-52.
    10. Kubokawa, Tatsuya & Srivastava, Muni S., 2008. "Estimation of the precision matrix of a singular Wishart distribution and its application in high-dimensional data," Journal of Multivariate Analysis, Elsevier, vol. 99(9), pages 1906-1928, October.
    11. Vincent, Martin & Hansen, Niels Richard, 2014. "Sparse group lasso and high dimensional multinomial classification," Computational Statistics & Data Analysis, Elsevier, vol. 71(C), pages 771-786.
    12. Parrish, Rudolph S. & Spencer III, Horace J. & Xu, Ping, 2009. "Distribution modeling and simulation of gene expression data," Computational Statistics & Data Analysis, Elsevier, vol. 53(5), pages 1650-1660, March.
    13. Hossain, Ahmed & Beyene, Joseph & Willan, Andrew R. & Hu, Pingzhao, 2009. "A flexible approximate likelihood ratio test for detecting differential expression in microarray data," Computational Statistics & Data Analysis, Elsevier, vol. 53(10), pages 3685-3695, August.
    14. Luca Scrucca, 2014. "Graphical tools for model-based mixture discriminant analysis," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 8(2), pages 147-165, June.
    15. Ariana Chang & Tian‐Shyug Lee & Hsiu‐Mei Lee, 2024. "Applying sustainable development goals in financial forecasting using machine learning techniques," Corporate Social Responsibility and Environmental Management, John Wiley & Sons, vol. 31(3), pages 2277-2289, May.
    16. Croux, Christophe & Jagtiani, Julapa & Korivi, Tarunsai & Vulanovic, Milos, 2020. "Important factors determining Fintech loan default: Evidence from a lendingclub consumer platform," Journal of Economic Behavior & Organization, Elsevier, vol. 173(C), pages 270-296.
    17. Caner, Mehmet, 2023. "Generalized linear models with structured sparsity estimators," Journal of Econometrics, Elsevier, vol. 236(2).
    18. repec:jss:jstsof:33:i01 is not listed on IDEAS
    19. J. Burez & D. Van Den Poel, 2005. "CRM at a Pay-TV Company: Using Analytical Models to Reduce Customer Attrition by Targeted Marketing for Subscription Services," Working Papers of Faculty of Economics and Business Administration, Ghent University, Belgium 05/348, Ghent University, Faculty of Economics and Business Administration.
    20. Won, Joong-Ho & Lim, Johan & Yu, Donghyeon & Kim, Byung Soo & Kim, Kyunga, 2014. "Monotone false discovery rate," Statistics & Probability Letters, Elsevier, vol. 87(C), pages 86-93.
    21. Jan, Budczies & Kosztyla, Daniel & von Törne, Christian & Stenzinger, Albrecht & Darb-Esfahani, Silvia & Dietel, Manfred & Denkert, Carsten, 2014. "cancerclass: An R Package for Development and Validation of Diagnostic Tests from High-Dimensional Molecular Data," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 59(i01).

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:compst:v:34:y:2019:i:4:d:10.1007_s00180-019-00894-y. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.