IDEAS home Printed from https://ideas.repec.org/a/spr/jclass/v37y2020i3d10.1007_s00357-019-09350-4.html
   My bibliography  Save this article

C443: a Methodology to See a Forest for the Trees

Author

Listed:
  • Aniek Sies

    (KU Leuven)

  • Iven Mechelen

    (KU Leuven)

Abstract

Often tree-based accounts of statistical learning problems yield multiple decision trees which together constitute a forest. Reasons for this include examining tree instability, improving prediction accuracy, accounting for missingness in the data, and taking into account multiple outcome variables. A key disadvantage of forests, unlike individual decision trees, is their lack of transparency. Hence, an obvious challenge is whether it is possible to recover some of the insightfulness of individual trees from a forest. In this paper, we will propose a conceptual framework and methodology to do so by reducing forests into one or a small number of summary trees, which may be used to gain insight into the central tendency as well as the heterogeneity of the forest. This is done by clustering the trees in the forest based on similarities between them. By means of simulated data, we will demonstrate how and why different similarity types in the proposed methodology may lead to markedly different conclusions, and explain when and why certain approaches may be recommended over other ones. We will finally illustrate the methodology with an empirical data set on the prediction of cocaine use on the basis of personality characteristics.

Suggested Citation

  • Aniek Sies & Iven Mechelen, 2020. "C443: a Methodology to See a Forest for the Trees," Journal of Classification, Springer;The Classification Society, vol. 37(3), pages 730-753, October.
  • Handle: RePEc:spr:jclass:v:37:y:2020:i:3:d:10.1007_s00357-019-09350-4
    DOI: 10.1007/s00357-019-09350-4
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s00357-019-09350-4
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s00357-019-09350-4?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Glenn Milligan & Martha Cooper, 1985. "An examination of procedures for determining the number of clusters in a data set," Psychometrika, Springer;The Psychometric Society, vol. 50(2), pages 159-179, June.
    2. Miglio, Rossella & Soffritti, Gabriele, 2004. "The comparison between classification trees through proximity measures," Computational Statistics & Data Analysis, Elsevier, vol. 45(3), pages 577-593, April.
    3. Briand, Bénédicte & Ducharme, Gilles R. & Parache, Vanessa & Mercat-Rommens, Catherine, 2009. "A similarity measure to assess the stability of classification trees," Computational Statistics & Data Analysis, Elsevier, vol. 53(4), pages 1208-1217, February.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Tiffany Elsten & Mark Rooij, 2022. "SUBiNN: a stacked uni- and bivariate kNN sparse ensemble," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 16(4), pages 847-874, December.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Liu, Pei-chen Barry & Hansen, Mark & Mukherjee, Avijit, 2008. "Scenario-based air traffic flow management: From theory to practice," Transportation Research Part B: Methodological, Elsevier, vol. 42(7-8), pages 685-702, August.
    2. Li, Pai-Ling & Chiou, Jeng-Min, 2011. "Identifying cluster number for subspace projected functional data clustering," Computational Statistics & Data Analysis, Elsevier, vol. 55(6), pages 2090-2103, June.
    3. Alessandra Cepparulo & Antonello Zanfei, 2019. "The diffusion of public eServices in European cities," Working Papers 1904, University of Urbino Carlo Bo, Department of Economics, Society & Politics - Scientific Committee - L. Stefanini & G. Travaglini, revised 2019.
    4. Noelia Caceres & Luis M. Romero & Francisco J. Morales & Antonio Reyes & Francisco G. Benitez, 2018. "Estimating traffic volumes on intercity road locations using roadway attributes, socioeconomic features and other work-related activity characteristics," Transportation, Springer, vol. 45(5), pages 1449-1473, September.
    5. Michele Cincera, 2005. "Firms' productivity growth and R&D spillovers: An analysis of alternative technological proximity measures," Economics of Innovation and New Technology, Taylor & Francis Journals, vol. 14(8), pages 657-682.
    6. Douglas L. Steinley & M. J. Brusco, 2019. "Using an Iterative Reallocation Partitioning Algorithm to Verify Test Multidimensionality," Journal of Classification, Springer;The Classification Society, vol. 36(3), pages 397-413, October.
    7. Javier Sevil-Serrano & Alberto Aibar-Solana & Ángel Abós & José Antonio Julián & Luis García-González, 2019. "Healthy or Unhealthy? The Cocktail of Health-Related Behavior Profiles in Spanish Adolescents," IJERPH, MDPI, vol. 16(17), pages 1-14, August.
    8. Jacques-Antoine Gauthier & Eric D. Widmer & Philipp Bucher & Cédric Notredame, 2009. "How Much Does It Cost?," Sociological Methods & Research, , vol. 38(1), pages 197-231, August.
    9. Jack DeWaard & Keuntae Kim & James Raymer, 2012. "Migration Systems in Europe: Evidence From Harmonized Flow Data," Demography, Springer;Population Association of America (PAA), vol. 49(4), pages 1307-1333, November.
    10. Vicente Rodríguez Montequín & Joaquín Villanueva Balsera & Sonia María Cousillas Fernández & Francisco Ortega Fernández, 2018. "Exploring Project Complexity through Project Failure Factors: Analysis of Cluster Patterns Using Self-Organizing Maps," Complexity, Hindawi, vol. 2018, pages 1-17, May.
    11. Goethner, Maximilian & Hornuf, Lars & Regner, Tobias, 2021. "Protecting investors in equity crowdfunding: An empirical analysis of the small investor protection act," Technological Forecasting and Social Change, Elsevier, vol. 162(C).
    12. Maria Lo Bue & Stephan Klasen, 2013. "Identifying Synergies and Complementarities Between MDGs: Results from Cluster Analysis," Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, Springer, vol. 113(2), pages 647-670, September.
    13. Valérie Barraud-Didier & Marie-Christine Henninger & Pierre Triboulet, 2014. "La Participation des Adhérents Dans Leurs Coopératives Agricoles: Une Étude Exploratoire du Secteur Céréalier Français," Canadian Journal of Agricultural Economics/Revue canadienne d'agroeconomie, Canadian Agricultural Economics Society/Societe canadienne d'agroeconomie, vol. 62(1), pages 125-148, March.
    14. Latruffe, Laure & Dupuy, Aurelia & Desjeux, Yann, 2012. "What would farmers’ strategies be in a no-CAP situation? An illustration to France," 86th Annual Conference, April 16-18, 2012, Warwick University, Coventry, UK 134989, Agricultural Economics Society.
    15. J. Fernando Vera & Rodrigo Macías, 2021. "On the Behaviour of K-Means Clustering of a Dissimilarity Matrix by Means of Full Multidimensional Scaling," Psychometrika, Springer;The Psychometric Society, vol. 86(2), pages 489-513, June.
    16. Alessio Ishizaka & Philippe Nemery, 2013. "A Multi-Criteria Group Decision Framework for Partner Grouping When Sharing Facilities," Group Decision and Negotiation, Springer, vol. 22(4), pages 773-799, July.
    17. Corey Ducharme & Bruno Agard & Martin Trépanier, 2024. "Improving demand forecasting for customers with missing downstream data in intermittent demand supply chains with supervised multivariate clustering," Journal of Forecasting, John Wiley & Sons, Ltd., vol. 43(5), pages 1661-1681, August.
    18. Victor Kaufman & Anthony Rodriguez & Lisa C. Walsh & Edward Shafranske & Shelly P. Harrell, 2022. "Unique Ways in Which the Quality of Friendships Matter for Life Satisfaction," Journal of Happiness Studies, Springer, vol. 23(6), pages 2563-2580, August.
    19. Henner Gimpel & Daniel Rau & Maximilian Röglinger, 2018. "Understanding FinTech start-ups – a taxonomy of consumer-oriented service offerings," Electronic Markets, Springer;IIM University of St. Gallen, vol. 28(3), pages 245-264, August.
    20. Parnes, Dror & Gormus, Alper, 2024. "Prescreening bank failures with K-means clustering: Pros and cons," International Review of Financial Analysis, Elsevier, vol. 93(C).

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:jclass:v:37:y:2020:i:3:d:10.1007_s00357-019-09350-4. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.