IDEAS home Printed from https://ideas.repec.org/a/eee/csdana/v106y2017icp90-102.html
   My bibliography  Save this article

An alternative pruning based approach to unbiased recursive partitioning

Author

Listed:
  • Alvarez-Iglesias, Alberto
  • Hinde, John
  • Ferguson, John
  • Newell, John

Abstract

Tree-based methods are a non-parametric modelling strategy that can be used in combination with generalized linear models or Cox proportional hazards models, mostly at an exploratory stage. Their popularity is mainly due to the simplicity of the technique along with the ease in which the resulting model can be interpreted. Variable selection bias from variables with many possible splits or missing values has been identified as one of the problems associated with tree-based methods. A number of unbiased recursive partitioning algorithms have been proposed that avoid this bias by using p-values in the splitting procedure of the algorithm. The final tree is obtained using direct stopping rules (pre-pruning strategy) or by growing a large tree first and pruning it afterwards (post-pruning). Some of the drawbacks of pre-pruned trees based on p-values in the presence of interaction effects and a large number of explanatory variables are discussed, and a simple alternative post-pruning solution is presented that allows the identification of such interactions. The proposed method includes a novel pruning algorithm that uses a false discovery rate (FDR) controlling procedure for the determination of splits corresponding to significant tests. The new approach is demonstrated with simulated and real-life examples.

Suggested Citation

  • Alvarez-Iglesias, Alberto & Hinde, John & Ferguson, John & Newell, John, 2017. "An alternative pruning based approach to unbiased recursive partitioning," Computational Statistics & Data Analysis, Elsevier, vol. 106(C), pages 90-102.
  • Handle: RePEc:eee:csdana:v:106:y:2017:i:c:p:90-102
    DOI: 10.1016/j.csda.2016.08.011
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S016794731630192X
    Download Restriction: Full text for ScienceDirect subscribers only.

    File URL: https://libkey.io/10.1016/j.csda.2016.08.011?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Grubinger, Thomas & Zeileis, Achim & Pfeiffer, Karl-Peter, 2014. "evtree: Evolutionary Learning of Globally Optimal Classification and Regression Trees in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 61(i01).
    2. Kim H. & Loh W.Y., 2001. "Classification Trees With Unbiased Multiway Splits," Journal of the American Statistical Association, American Statistical Association, vol. 96, pages 589-604, June.
    3. Shih, Yu-Shan & Tsai, Hsin-Wen, 2004. "Variable selection bias in regression trees with constant fits," Computational Statistics & Data Analysis, Elsevier, vol. 45(3), pages 595-607, April.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Strobl, Carolin & Boulesteix, Anne-Laure & Augustin, Thomas, 2007. "Unbiased split selection for classification trees based on the Gini Index," Computational Statistics & Data Analysis, Elsevier, vol. 52(1), pages 483-501, September.
    2. Emilio Carrizosa & Cristina Molero-Río & Dolores Romero Morales, 2021. "Mathematical optimization in classification and regression trees," TOP: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 29(1), pages 5-33, April.
    3. Fernandez Martinez, Roberto & Lostado Lorza, Ruben & Santos Delgado, Ana Alexandra & Piedra, Nelson, 2021. "Use of classification trees and rule-based models to optimize the funding assignment to research projects: A case study of UTPL," Journal of Informetrics, Elsevier, vol. 15(1).
    4. Ollech, Daniel & Webel, Karsten, 2020. "A random forest-based approach to identifying the most informative seasonality tests," Discussion Papers 55/2020, Deutsche Bundesbank.
    5. Yagli, Gokhan Mert & Yang, Dazhi & Srinivasan, Dipti, 2019. "Automatic hourly solar forecasting using machine learning models," Renewable and Sustainable Energy Reviews, Elsevier, vol. 105(C), pages 487-498.
    6. S. H. C. M. van Veen & R. C. van Kleef & W. P. M. M. van de Ven & R. C. J. A. van Vliet, 2018. "Exploring the predictive power of interaction terms in a sophisticated risk equalization model using regression trees," Health Economics, John Wiley & Sons, Ltd., vol. 27(2), pages 1-12, February.
    7. Davide Natalini & Giangiacomo Bravo & Aled Wynne Jones, 2019. "Global food security and food riots – an agent-based modelling approach," Food Security: The Science, Sociology and Economics of Food Production and Access to Food, Springer;The International Society for Plant Pathology, vol. 11(5), pages 1153-1173, October.
    8. Shih, Y. -S., 2004. "A note on split selection bias in classification trees," Computational Statistics & Data Analysis, Elsevier, vol. 45(3), pages 457-466, April.
    9. repec:hum:wpaper:sfb649dp2008-035 is not listed on IDEAS
    10. Postiglione, Paolo & Benedetti, Roberto & Lafratta, Giovanni, 2010. "A regression tree algorithm for the identification of convergence clubs," Computational Statistics & Data Analysis, Elsevier, vol. 54(11), pages 2776-2785, November.
    11. Federico Divina & Aude Gilson & Francisco Goméz-Vela & Miguel García Torres & José F. Torres, 2018. "Stacking Ensemble Learning for Short-Term Electricity Consumption Forecasting," Energies, MDPI, vol. 11(4), pages 1-31, April.
    12. Max Tabord-Meehan, 2023. "Stratification Trees for Adaptive Randomisation in Randomised Controlled Trials," The Review of Economic Studies, Review of Economic Studies Ltd, vol. 90(5), pages 2646-2673.
    13. Anja Breuer & Yves Staudt, 2022. "Equalization Reserves for Reinsurance and Non-Life Undertakings in Switzerland," Risks, MDPI, vol. 10(3), pages 1-41, March.
    14. Patrick Rehill & Nicholas Biddle, 2022. "Policy learning for many outcomes of interest: Combining optimal policy trees with multi-objective Bayesian optimisation," Papers 2212.06312, arXiv.org, revised Oct 2023.
    15. Hothorn, Torsten & Lausen, Berthold, 2005. "Bundling classifiers by bagging trees," Computational Statistics & Data Analysis, Elsevier, vol. 49(4), pages 1068-1078, June.
    16. Vrigazova Borislava, 2021. "The Proportion for Splitting Data into Training and Test Set for the Bootstrap in Classification Problems," Business Systems Research, Sciendo, vol. 12(1), pages 228-242, May.
    17. Gerhard Tutz & Moritz Berger, 2016. "Item-focussed Trees for the Identification of Items in Differential Item Functioning," Psychometrika, Springer;The Psychometric Society, vol. 81(3), pages 727-750, September.
    18. Gray, J. Brian & Fan, Guangzhe, 2008. "Classification tree analysis using TARGET," Computational Statistics & Data Analysis, Elsevier, vol. 52(3), pages 1362-1372, January.
    19. Noh, Hyun Gon & Song, Moon Sup & Park, Sung Hyun, 2004. "An unbiased method for constructing multilabel classification trees," Computational Statistics & Data Analysis, Elsevier, vol. 47(1), pages 149-164, August.
    20. Dimitris Bertsimas & Margrét V. Bjarnadóttir & Michael A. Kane & J. Christian Kryder & Rudra Pandey & Santosh Vempala & Grant Wang, 2008. "Algorithmic Prediction of Health-Care Costs," Operations Research, INFORMS, vol. 56(6), pages 1382-1392, December.
    21. Ronilo Ragodos & Tong Wang, 2022. "Disjunctive Rule Lists," INFORMS Journal on Computing, INFORMS, vol. 34(6), pages 3259-3276, November.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:csdana:v:106:y:2017:i:c:p:90-102. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/csda .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.