IDEAS home Printed from https://ideas.repec.org/a/eee/csdana/v60y2013icp50-69.html
   My bibliography  Save this article

A new variable selection approach using Random Forests

Author

Listed:
  • Hapfelmeier, A.
  • Ulm, K.

Abstract

Random Forests are frequently applied as they achieve a high prediction accuracy and have the ability to identify informative variables. Several approaches for variable selection have been proposed to combine and intensify these qualities. An extensive review of the corresponding literature led to the development of a new approach that is based on the theoretical framework of permutation tests and meets important statistical properties. A comparison to another eight popular variable selection methods in three simulation studies and four real data applications indicated that: the new approach can also be used to control the test-wise and family-wise error rate, provides a higher power to distinguish relevant from irrelevant variables and leads to models which are located among the very best performing ones. In addition, it is equally applicable to regression and classification problems.

Suggested Citation

  • Hapfelmeier, A. & Ulm, K., 2013. "A new variable selection approach using Random Forests," Computational Statistics & Data Analysis, Elsevier, vol. 60(C), pages 50-69.
  • Handle: RePEc:eee:csdana:v:60:y:2013:i:c:p:50-69
    DOI: 10.1016/j.csda.2012.09.020
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0167947312003490
    Download Restriction: Full text for ScienceDirect subscribers only.

    File URL: https://libkey.io/10.1016/j.csda.2012.09.020?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Strobl, Carolin & Boulesteix, Anne-Laure & Augustin, Thomas, 2007. "Unbiased split selection for classification trees based on the Gini Index," Computational Statistics & Data Analysis, Elsevier, vol. 52(1), pages 483-501, September.
    2. Archer, Kellie J. & Kimes, Ryan V., 2008. "Empirical characterization of random forest variable importance measures," Computational Statistics & Data Analysis, Elsevier, vol. 52(4), pages 2249-2260, January.
    3. van Wieringen, Wessel N. & Kun, David & Hampel, Regina & Boulesteix, Anne-Laure, 2009. "Survival prediction using gene expression data: A review and comparison," Computational Statistics & Data Analysis, Elsevier, vol. 53(5), pages 1590-1603, March.
    4. Willi Sauerbrei, 1999. "The Use of Resampling Methods to Simplify Regression Models in Medical Statistics," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 48(3), pages 313-329.
    5. Harrison, David Jr. & Rubinfeld, Daniel L., 1978. "Hedonic housing prices and the demand for clean air," Journal of Environmental Economics and Management, Elsevier, vol. 5(1), pages 81-102, March.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Hapfelmeier, Alexander & Hornung, Roman & Haller, Bernhard, 2023. "Efficient permutation testing of variable importance measures by the example of random forests," Computational Statistics & Data Analysis, Elsevier, vol. 181(C).
    2. Jin Li & Maggie Tran & Justy Siwabessy, 2016. "Selecting Optimal Random Forest Predictive Models: A Case Study on Predicting the Spatial Distribution of Seabed Hardness," PLOS ONE, Public Library of Science, vol. 11(2), pages 1-29, February.
    3. Zardad Khan & Asma Gul & Aris Perperoglou & Miftahuddin Miftahuddin & Osama Mahmoud & Werner Adler & Berthold Lausen, 2020. "Ensemble of optimal trees, random forest and random projection ensemble classification," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 14(1), pages 97-116, March.
    4. Saurabh Saxena & Darius Roman & Valentin Robu & David Flynn & Michael Pecht, 2021. "Battery Stress Factor Ranking for Accelerated Degradation Test Planning Using Machine Learning," Energies, MDPI, vol. 14(3), pages 1-17, January.
    5. Abellán, Joaquín & Baker, Rebecca M. & Coolen, Frank P.A. & Crossman, Richard J. & Masegosa, Andrés R., 2014. "Classification with decision trees from a nonparametric predictive inference perspective," Computational Statistics & Data Analysis, Elsevier, vol. 71(C), pages 789-802.
    6. Liangyuan Hu & Lihua Li, 2022. "Using Tree-Based Machine Learning for Health Studies: Literature Review and Case Series," IJERPH, MDPI, vol. 19(23), pages 1-13, December.
    7. Dogah, Kingsley E. & Premaratne, Gamini, 2018. "Sectoral exposure of financial markets to oil risk factors in BRICS countries," Energy Economics, Elsevier, vol. 76(C), pages 228-256.
    8. Weijun Wang & Dan Zhao & Liguo Fan & Yulong Jia, 2019. "Study on Icing Prediction of Power Transmission Lines Based on Ensemble Empirical Mode Decomposition and Feature Selection Optimized Extreme Learning Machine," Energies, MDPI, vol. 12(11), pages 1-21, June.
    9. Hermel Homburger & Manuel K Schneider & Sandra Hilfiker & Andreas Lüscher, 2014. "Inferring Behavioral States of Grazing Livestock from High-Frequency Position Data Alone," PLOS ONE, Public Library of Science, vol. 9(12), pages 1-22, December.
    10. Fellinghauer, Bernd & Bühlmann, Peter & Ryffel, Martin & von Rhein, Michael & Reinhardt, Jan D., 2013. "Stable graphical model estimation with Random Forests for discrete, continuous, and mixed variables," Computational Statistics & Data Analysis, Elsevier, vol. 64(C), pages 132-152.
    11. Massimiliano Fessina & Giambattista Albora & Andrea Tacchella & Andrea Zaccaria, 2022. "Which products activate a product? An explainable machine learning approach," Papers 2212.03094, arXiv.org.
    12. Michael Dadole Ubagan & Yun-Sik Lee & Taekjun Lee & Jinsol Hong & Il Hoi Kim & Sook Shin, 2021. "Settlement and Recruitment Potential of Four Invasive and One Indigenous Barnacles in South Korea and Their Future," Sustainability, MDPI, vol. 13(2), pages 1-14, January.
    13. Ingrida Vaiciulyte & Zivile Kalsyte & Leonidas Sakalauskas & Darius Plikynas, 2017. "Assessment of market reaction on the share performance on the basis of its visualization in 2D space," Journal of Business Economics and Management, Taylor & Francis Journals, vol. 18(2), pages 309-318, March.
    14. Hapfelmeier, A. & Ulm, K., 2014. "Variable selection by Random Forests using data with missing values," Computational Statistics & Data Analysis, Elsevier, vol. 80(C), pages 129-139.
    15. Sameer Al-Dahidi & Piero Baraldi & Miriam Fresc & Enrico Zio & Lorenzo Montelatici, 2024. "Feature Selection by Binary Differential Evolution for Predicting the Energy Production of a Wind Plant," Energies, MDPI, vol. 17(10), pages 1-19, May.
    16. Mohan Bi & Huiying Li & Peter Meidl & Yanjie Zhu & Masahiro Ryo & Matthias C. Rillig, 2024. "Number and dissimilarity of global change factors influences soil properties and functions," Nature Communications, Nature, vol. 15(1), pages 1-14, December.
    17. Barbara Baranowska & Anna Kajdy & Paulina Pawlicka & Ernest Pokropek & Michał Rabijewski & Dorota Sys & Artur Pokropek, 2020. "What are the Critical Elements of Satisfaction and Experience in Labor and Childbirth—A Cross-Sectional Study," IJERPH, MDPI, vol. 17(24), pages 1-13, December.
    18. Lkhagvadorj Munkhdalai & Tsendsuren Munkhdalai & Oyun-Erdene Namsrai & Jong Yun Lee & Keun Ho Ryu, 2019. "An Empirical Comparison of Machine-Learning Methods on Bank Client Credit Assessments," Sustainability, MDPI, vol. 11(3), pages 1-23, January.
    19. Edward Gage & David Cooper, 2015. "The Influence of Land Cover, Vertical Structure, and Socioeconomic Factors on Outdoor Water Use in a Western US City," Water Resources Management: An International Journal, Published for the European Water Resources Association (EWRA), Springer;European Water Resources Association (EWRA), vol. 29(10), pages 3877-3890, August.
    20. Bryan Keller, 2020. "Variable Selection for Causal Effect Estimation: Nonparametric Conditional Independence Testing With Random Forests," Journal of Educational and Behavioral Statistics, , vol. 45(2), pages 119-142, April.
    21. Silke Janitza & Ender Celik & Anne-Laure Boulesteix, 2018. "A computationally fast variable importance test for random forests for high-dimensional data," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 12(4), pages 885-915, December.
    22. Polasek, Tomas & Čadík, Martin, 2023. "Predicting photovoltaic power production using high-uncertainty weather forecasts," Applied Energy, Elsevier, vol. 339(C).

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Hapfelmeier, A. & Ulm, K., 2014. "Variable selection by Random Forests using data with missing values," Computational Statistics & Data Analysis, Elsevier, vol. 80(C), pages 129-139.
    2. Daniel L. Chen & Markus Loecher, 2022. "Mood and the Malleability of Moral Reasoning: The Impact of Irrelevant Factors on Judicial Decisions," Working Papers hal-03864854, HAL.
    3. Jianhong Shi & Qian Yang & Xiongya Li & Weixing Song, 2017. "Effects of measurement error on a class of single-index varying coefficient regression models," Computational Statistics, Springer, vol. 32(3), pages 977-1001, September.
    4. Hapfelmeier Alexander & Hothorn Torsten & Riediger Carina & Ulm Kurt, 2014. "Estimation of a Predictor’s Importance by Random Forests When There Is Missing Data: RISK Prediction in Liver Surgery using Laboratory Data," The International Journal of Biostatistics, De Gruyter, vol. 10(2), pages 165-183, November.
    5. Villalonga, Belen, 2004. "Intangible resources, Tobin's q, and sustainability of performance differences," Journal of Economic Behavior & Organization, Elsevier, vol. 54(2), pages 205-230, June.
    6. Brockmeier, M., 1991. "Entwicklung und Aufhebung von Reinheitsgeboten im Nahrungsmittelbereich – Analyse und Bewertung," Proceedings “Schriften der Gesellschaft für Wirtschafts- und Sozialwissenschaften des Landbaues e.V.”, German Association of Agricultural Economists (GEWISOLA), vol. 27.
    7. Miller, Steve & Startz, Richard, 2019. "Feasible generalized least squares using support vector regression," Economics Letters, Elsevier, vol. 175(C), pages 28-31.
    8. Umberto Amato & Anestis Antoniadis & Italia De Feis & Irene Gijbels, 2021. "Penalised robust estimators for sparse and high-dimensional linear models," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 30(1), pages 1-48, March.
    9. Prendergast, Luke A. & Li Wai Suen, Connie, 2011. "A new and practical influence measure for subsets of covariance matrix sample principal components with applications to high dimensional datasets," Computational Statistics & Data Analysis, Elsevier, vol. 55(1), pages 752-764, January.
    10. Tizheng Li & Xiaojuan Kang, 2022. "Variable selection of higher-order partially linear spatial autoregressive model with a diverging number of parameters," Statistical Papers, Springer, vol. 63(1), pages 243-285, February.
    11. Deac Dan Stelian & Schebesch Klaus Bruno, 2018. "Market Forecasts and Client Behavioral Data: Towards Finding Adequate Model Complexity," Studia Universitatis „Vasile Goldis” Arad – Economics Series, Sciendo, vol. 28(3), pages 50-75, September.
    12. Jörg Kalbfuß & Reto Odermatt & Alois Stutzer, 2018. "Medical marijuana laws and mental health in the United States," CEP Discussion Papers dp1546, Centre for Economic Performance, LSE.
    13. Lamperti, Francesco & Roventini, Andrea & Sani, Amir, 2018. "Agent-based model calibration using machine learning surrogates," Journal of Economic Dynamics and Control, Elsevier, vol. 90(C), pages 366-389.
    14. Juan Ignacio Zoloa, 2020. "Noise pollution and housing markets: A spatial hedonic analysis for La Plata City," Ensayos de Política Económica, Departamento de Investigación Francisco Valsecchi, Facultad de Ciencias Económicas, Pontificia Universidad Católica Argentina., vol. 3(2), pages 129-152, Octubre.
    15. Cheng, Tsung-Chi, 2012. "On simultaneously identifying outliers and heteroscedasticity without specific form," Computational Statistics & Data Analysis, Elsevier, vol. 56(7), pages 2258-2272.
    16. Bodhisattva Sen & Mary Meyer, 2017. "Testing against a linear regression model using ideas from shape-restricted estimation," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 79(2), pages 423-448, March.
    17. Benítez-Peña, Sandra & Blanquero, Rafael & Carrizosa, Emilio & Ramírez-Cobo, Pepa, 2024. "Cost-sensitive probabilistic predictions for support vector machines," European Journal of Operational Research, Elsevier, vol. 314(1), pages 268-279.
    18. repec:wyi:journl:002176 is not listed on IDEAS
    19. Steve Gibbons & Stephan Heblich & Esther Lho & Christopher Timmins, 2016. "Fear of Fracking? The Impact of the Shale Gas Exploration on House Prices in Britain," SERC Discussion Papers 0207, Centre for Economic Performance, LSE.
    20. Sanying Feng & Liugen Xue, 2014. "Bias-corrected statistical inference for partially linear varying coefficient errors-in-variables models with restricted condition," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 66(1), pages 121-140, February.
    21. Mohamed Zine & Fouzi Harrou & Mohammed Terbeche & Mohammed Bellahcene & Abdelkader Dairi & Ying Sun, 2023. "E-Learning Readiness Assessment Using Machine Learning Methods," Sustainability, MDPI, vol. 15(11), pages 1-22, June.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:csdana:v:60:y:2013:i:c:p:50-69. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/csda .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.