IDEAS home Printed from https://ideas.repec.org/p/inn/wpaper/2011-20.html
   My bibliography  Save this paper

evtree: Evolutionary Learning of Globally Optimal Classification and Regression Trees in R

Author

Listed:
  • Thomas Grubinger
  • Achim Zeileis
  • Karl-Peter Pfeiffer

Abstract

Commonly used classification and regression tree methods like the CART algorithm are recursive partitioning methods that build the model in a forward stepwise search. Although this approach is known to be an efficient heuristic, the results of recursive tree methods are only locally optimal, as splits are chosen to maximize homogeneity at the next step only. An alternative way to search over the parameter space of trees is to use global optimization methods like evolutionary algorithms. This paper describes the "evtree" package, which implements an evolutionary algorithm for learning globally optimal classification and regression trees in R. Computationally intensive tasks are fully computed in C++ while the "partykit" (Hothorn and Zeileis 2011) package is leveraged for representing the resulting trees in R, providing unified infrastructure for summaries, visualizations, and predictions. "evtree" is compared to "rpart" (Therneau and Atkinson 1997), the open-source CART implementation, and conditional inference trees ("ctree", Hothorn, Hornik, and Zeileis 2006). The usefulness of "evtree" is illustrated in a textbook customer classification task and a benchmark study of predictive accuracy in which "evtree" achieved at least similar and most of the time better results compared to the recursive algorithms "rpart" and "ctree".

Suggested Citation

  • Thomas Grubinger & Achim Zeileis & Karl-Peter Pfeiffer, 2011. "evtree: Evolutionary Learning of Globally Optimal Classification and Regression Trees in R," Working Papers 2011-20, Faculty of Economics and Statistics, Universität Innsbruck.
  • Handle: RePEc:inn:wpaper:2011-20
    as

    Download full text from publisher

    File URL: https://www2.uibk.ac.at/downloads/c4041030/wpaper/2011-20.pdf
    Download Restriction: no
    ---><---

    Other versions of this item:

    References listed on IDEAS

    as
    1. Kurt Hornik & Christian Buchta & Achim Zeileis, 2009. "Open-source machine learning: R meets Weka," Computational Statistics, Springer, vol. 24(2), pages 225-232, May.
    2. Torsten Hothorn & Achim Zeileis, 2014. "partykit: A Modular Toolkit for Recursive Partytioning in R," Working Papers 2014-10, Faculty of Economics and Statistics, Universität Innsbruck.
    3. Scrucca, Luca, 2013. "GA: A Package for Genetic Algorithms in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 53(i04).
    4. Calcagno, Vincent & de Mazancourt, Claire, 2010. "glmulti: An R Package for Easy Automated Model Selection with (Generalized) Linear Models," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 34(i12).
    5. Karatzoglou, Alexandros & Smola, Alexandros & Hornik, Kurt & Zeileis, Achim, 2004. "kernlab - An S4 Package for Kernel Methods in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 11(i09).
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Susan Athey & Stefan Wager, 2021. "Policy Learning With Observational Data," Econometrica, Econometric Society, vol. 89(1), pages 133-161, January.
    2. Max Tabord-Meehan, 2023. "Stratification Trees for Adaptive Randomisation in Randomised Controlled Trials," The Review of Economic Studies, Review of Economic Studies Ltd, vol. 90(5), pages 2646-2673.
    3. Meryem Pulat & İpek Deveci Kocakoç, 2024. "Classification with machine learning algorithms after hybrid feature selection in imbalanced data sets," Operations Research and Decisions, Wroclaw University of Science and Technology, Faculty of Management, vol. 34(4), pages 157-183.
    4. Emmanuel Jordy Menvouta & Jolien Ponnet & Robin Van Oirbeek & Tim Verdonck, 2022. "mCube: Multinomial Micro-level reserving Model," Papers 2212.00101, arXiv.org.
    5. Yagli, Gokhan Mert & Yang, Dazhi & Srinivasan, Dipti, 2019. "Automatic hourly solar forecasting using machine learning models," Renewable and Sustainable Energy Reviews, Elsevier, vol. 105(C), pages 487-498.
    6. Alvarez-Iglesias, Alberto & Hinde, John & Ferguson, John & Newell, John, 2017. "An alternative pruning based approach to unbiased recursive partitioning," Computational Statistics & Data Analysis, Elsevier, vol. 106(C), pages 90-102.
    7. Fernandez Martinez, Roberto & Lostado Lorza, Ruben & Santos Delgado, Ana Alexandra & Piedra, Nelson, 2021. "Use of classification trees and rule-based models to optimize the funding assignment to research projects: A case study of UTPL," Journal of Informetrics, Elsevier, vol. 15(1).
    8. Höppner, Sebastiaan & Stripling, Eugen & Baesens, Bart & Broucke, Seppe vanden & Verdonck, Tim, 2020. "Profit driven decision trees for churn prediction," European Journal of Operational Research, Elsevier, vol. 284(3), pages 920-933.
    9. Anja Breuer & Yves Staudt, 2022. "Equalization Reserves for Reinsurance and Non-Life Undertakings in Switzerland," Risks, MDPI, vol. 10(3), pages 1-41, March.
    10. Davide Natalini & Giangiacomo Bravo & Aled Wynne Jones, 2019. "Global food security and food riots – an agent-based modelling approach," Food Security: The Science, Sociology and Economics of Food Production and Access to Food, Springer;The International Society for Plant Pathology, vol. 11(5), pages 1153-1173, October.
    11. Chi-Chang Chang & Tse-Hung Huang & Pei-Wei Shueng & Ssu-Han Chen & Chun-Chia Chen & Chi-Jie Lu & Yi-Ju Tseng, 2021. "Developing a Stacked Ensemble-Based Classification Scheme to Predict Second Primary Cancers in Head and Neck Cancer Survivors," IJERPH, MDPI, vol. 18(23), pages 1-10, November.
    12. Emilio Carrizosa & Cristina Molero-Río & Dolores Romero Morales, 2021. "Mathematical optimization in classification and regression trees," TOP: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 29(1), pages 5-33, April.
    13. Yves Staudt & Joël Wagner, 2021. "Assessing the Performance of Random Forests for Modeling Claim Severity in Collision Car Insurance," Risks, MDPI, vol. 9(3), pages 1-28, March.
    14. Vrigazova Borislava, 2021. "The Proportion for Splitting Data into Training and Test Set for the Bootstrap in Classification Problems," Business Systems Research, Sciendo, vol. 12(1), pages 228-242, May.
    15. Islam, Towhidul & Meade, Nigel & Carson, Richard T. & Louviere, Jordan J. & Wang, Juan, 2022. "The usefulness of socio-demographic variables in predicting purchase decisions: Evidence from machine learning procedures," Journal of Business Research, Elsevier, vol. 151(C), pages 324-338.
    16. Ronilo Ragodos & Tong Wang, 2022. "Disjunctive Rule Lists," INFORMS Journal on Computing, INFORMS, vol. 34(6), pages 3259-3276, November.
    17. Claudio Conversano & Elise Dusseldorp, 2017. "Modeling Threshold Interaction Effects Through the Logistic Classification Trunk," Journal of Classification, Springer;The Classification Society, vol. 34(3), pages 399-426, October.
    18. Federico Divina & Miguel García Torres & Francisco A. Goméz Vela & José Luis Vázquez Noguera, 2019. "A Comparative Study of Time Series Forecasting Methods for Short Term Electric Energy Consumption Prediction in Smart Buildings," Energies, MDPI, vol. 12(10), pages 1-23, May.
    19. Patrick Rehill, 2024. "Distilling interpretable causal trees from causal forests," Papers 2408.01023, arXiv.org.
    20. Patrick Rehill & Nicholas Biddle, 2022. "Policy learning for many outcomes of interest: Combining optimal policy trees with multi-objective Bayesian optimisation," Papers 2212.06312, arXiv.org, revised Oct 2023.
    21. Federico Divina & Aude Gilson & Francisco Goméz-Vela & Miguel García Torres & José F. Torres, 2018. "Stacking Ensemble Learning for Short-Term Electricity Consumption Forecasting," Energies, MDPI, vol. 11(4), pages 1-31, April.
    22. Roberto Chiosa & Marco Savino Piscitelli & Alfonso Capozzoli, 2021. "A Data Analytics-Based Energy Information System (EIS) Tool to Perform Meter-Level Anomaly Detection and Diagnosis in Buildings," Energies, MDPI, vol. 14(1), pages 1-28, January.
    23. Hajko, Vladimír, 2017. "The failure of Energy-Economy Nexus: A meta-analysis of 104 studies," Energy, Elsevier, vol. 125(C), pages 771-787.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Tsukioka, Yasutomo & Yanagi, Junya & Takada, Teruko, 2018. "Investor sentiment extracted from internet stock message boards and IPO puzzles," International Review of Economics & Finance, Elsevier, vol. 56(C), pages 205-217.
    2. Bergeaud, Antonin & Raimbault, Juste, 2020. "An empirical analysis of the spatial variability of fuel prices in the United States," Transportation Research Part A: Policy and Practice, Elsevier, vol. 132(C), pages 131-143.
    3. Bernard W T Coetzee & Kevin J Gaston & Steven L Chown, 2014. "Local Scale Comparisons of Biodiversity as a Test for Global Protected Area Ecological Performance: A Meta-Analysis," PLOS ONE, Public Library of Science, vol. 9(8), pages 1-11, August.
    4. Lazzari, Florencia & Mor, Gerard & Cipriano, Jordi & Solsona, Francesc & Chemisana, Daniel & Guericke, Daniela, 2023. "Optimizing planning and operation of renewable energy communities with genetic algorithms," Applied Energy, Elsevier, vol. 338(C).
    5. Scrucca, Luca, 2013. "GA: A Package for Genetic Algorithms in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 53(i04).
    6. Ji, Yonggang & Lin, Nan & Zhang, Baoxue, 2012. "Model selection in binary and tobit quantile regression using the Gibbs sampler," Computational Statistics & Data Analysis, Elsevier, vol. 56(4), pages 827-839.
    7. Andrea S Martinez-Vernon & James A Covington & Ramesh P Arasaradnam & Siavash Esfahani & Nicola O’Connell & Ioannis Kyrou & Richard S Savage, 2018. "An improved machine learning pipeline for urinary volatiles disease detection: Diagnosing diabetes," PLOS ONE, Public Library of Science, vol. 13(9), pages 1-20, September.
    8. Madhumita Sahoo & Aman Kasot & Anirban Dhar & Amlanjyoti Kar, 2018. "On Predictability of Groundwater Level in Shallow Wells Using Satellite Observations," Water Resources Management: An International Journal, Published for the European Water Resources Association (EWRA), Springer;European Water Resources Association (EWRA), vol. 32(4), pages 1225-1244, March.
    9. P. J. Zarco-Tejada & T. Poblete & C. Camino & V. Gonzalez-Dugo & R. Calderon & A. Hornero & R. Hernandez-Clemente & M. Román-Écija & M. P. Velasco-Amo & B. B. Landa & P. S. A. Beck & M. Saponari & D. , 2021. "Divergent abiotic spectral pathways unravel pathogen stress signals across species," Nature Communications, Nature, vol. 12(1), pages 1-11, December.
    10. László Kovács, 2019. "Applications of Metaheuristics in Insurance," Society and Economy, Akadémiai Kiadó, Hungary, vol. 41(3), pages 371-395, September.
    11. Verónica Lloréns-Rico & Ann C. Gregory & Johan Van Weyenbergh & Sander Jansen & Tina Van Buyten & Junbin Qian & Marcos Braz & Soraya Maria Menezes & Pierre Van Mol & Lore Vanderbeke & Christophe Dooms, 2021. "Clinical practices underlie COVID-19 patient respiratory microbiome composition and its interactions with the host," Nature Communications, Nature, vol. 12(1), pages 1-12, December.
    12. Uwe Ligges & Sebastian Krey, 2011. "Feature clustering for instrument classification," Computational Statistics, Springer, vol. 26(2), pages 279-291, June.
    13. Arnout Van Messem & Andreas Christmann, 2010. "A review on consistency and robustness properties of support vector machines for heavy-tailed distributions," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 4(2), pages 199-220, September.
    14. Sun, Mucun & Feng, Cong & Zhang, Jie, 2020. "Multi-distribution ensemble probabilistic wind power forecasting," Renewable Energy, Elsevier, vol. 148(C), pages 135-149.
    15. Guangzhou Wang & Haley M. Burrill & Laura Y. Podzikowski & Maarten B. Eppinga & Fusuo Zhang & Junling Zhang & Peggy A. Schultz & James D. Bever, 2023. "Dilution of specialist pathogens drives productivity benefits from diversity in plant mixtures," Nature Communications, Nature, vol. 14(1), pages 1-11, December.
    16. Imbert, Clément & Papp, John, 2020. "Costs and benefits of rural-urban migration: Evidence from India," Journal of Development Economics, Elsevier, vol. 146(C).
    17. Ana Patrícia Rocha & Hugo Miguel Pereira Choupina & Maria do Carmo Vilas-Boas & José Maria Fernandes & João Paulo Silva Cunha, 2018. "System for automatic gait analysis based on a single RGB-D camera," PLOS ONE, Public Library of Science, vol. 13(8), pages 1-24, August.
    18. Krityakierne, Tipaluck & Baowan, Duangkamon, 2020. "Aggregated GP-based Optimization for Contaminant Source Localization," Operations Research Perspectives, Elsevier, vol. 7(C).
    19. Hongbo Guo & Enzai Du & César Terrer & Robert B. Jackson, 2024. "Global distribution of surface soil organic carbon in urban greenspaces," Nature Communications, Nature, vol. 15(1), pages 1-9, December.
    20. Lisa Cherry & Darren Mollendor & Bill Eisenstein & Terri S. Hogue & Katharyn Peterman & John E. McCray, 2019. "Predicting Parcel-Scale Redevelopment Using Linear and Logistic Regression—the Berkeley Neighborhood Denver, Colorado Case Study," Sustainability, MDPI, vol. 11(7), pages 1-16, March.

    More about this item

    Keywords

    machine learning; classification trees; regression trees; evolutionary algorithms; R;
    All these keywords.

    JEL classification:

    • C14 - Mathematical and Quantitative Methods - - Econometric and Statistical Methods and Methodology: General - - - Semiparametric and Nonparametric Methods: General
    • C45 - Mathematical and Quantitative Methods - - Econometric and Statistical Methods: Special Topics - - - Neural Networks and Related Topics
    • C87 - Mathematical and Quantitative Methods - - Data Collection and Data Estimation Methodology; Computer Programs - - - Econometric Software

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:inn:wpaper:2011-20. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Judith Courian (email available below). General contact details of provider: https://edirc.repec.org/data/fuibkat.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.