IDEAS home Printed from https://ideas.repec.org/a/gam/jmathe/v12y2024i20p3243-d1500258.html
   My bibliography  Save this article

An Enhanced Tree Ensemble for Classification in the Presence of Extreme Class Imbalance

Author

Listed:
  • Samir K. Safi

    (Department of Statistics and Business Analytics, College of Business and Economics, United Arab Emirates University, Al Ain P.O. Box 15551, United Arab Emirates)

  • Sheema Gul

    (Department of Statistics and Business Analytics, College of Business and Economics, United Arab Emirates University, Al Ain P.O. Box 15551, United Arab Emirates)

Abstract

Researchers using machine learning methods for classification can face challenges due to class imbalance, where a certain class is underrepresented. Over or under-sampling of minority or majority class observations, or solely relying on model selection for ensemble methods, may prove ineffective when the class imbalance ratio is extremely high. To address this issue, this paper proposes a method called enhance tree ensemble (ETE), based on generating synthetic data for minority class observations in conjunction with tree selection based on their performance on the training data. The proposed method first generates minority class instances to balance the training data and then uses the idea of tree selection by leveraging out-of-bag ( E T E O O B ) and sub-samples ( E T E S S ) observations, respectively. The efficacy of the proposed method is assessed using twenty benchmark problems for binary classification with moderate to extreme class imbalance, comparing it against other well-known methods such as optimal tree ensemble (OTE), SMOTE random forest ( R F S M O T E ), oversampling random forest ( R F O S ), under-sampling random forest ( R F U S ), k-nearest neighbor (k-NN), support vector machine (SVM), tree, and artificial neural network (ANN). Performance metrics such as classification error rate and precision are used for evaluation purposes. The analyses of the study revealed that the proposed method, based on data balancing and model selection, yielded better results than the other methods.

Suggested Citation

  • Samir K. Safi & Sheema Gul, 2024. "An Enhanced Tree Ensemble for Classification in the Presence of Extreme Class Imbalance," Mathematics, MDPI, vol. 12(20), pages 1-17, October.
  • Handle: RePEc:gam:jmathe:v:12:y:2024:i:20:p:3243-:d:1500258
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2227-7390/12/20/3243/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2227-7390/12/20/3243/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Wright, Marvin N. & Ziegler, Andreas, 2017. "ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 77(i01).
    2. Karatzoglou, Alexandros & Smola, Alexandros & Hornik, Kurt & Zeileis, Achim, 2004. "kernlab - An S4 Package for Kernel Methods in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 11(i09).
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Bommert, Andrea & Sun, Xudong & Bischl, Bernd & Rahnenführer, Jörg & Lang, Michel, 2020. "Benchmark for filter methods for feature selection in high-dimensional classification data," Computational Statistics & Data Analysis, Elsevier, vol. 143(C).
    2. Fitzpatrick, Trevor & Mues, Christophe, 2021. "How can lenders prosper? Comparing machine learning approaches to identify profitable peer-to-peer loan investments," European Journal of Operational Research, Elsevier, vol. 294(2), pages 711-722.
    3. Schratz, Patrick & Muenchow, Jannes & Iturritxa, Eugenia & Richter, Jakob & Brenning, Alexander, 2019. "Hyperparameter tuning and performance assessment of statistical and machine-learning algorithms using spatial data," Ecological Modelling, Elsevier, vol. 406(C), pages 109-120.
    4. Tsukioka, Yasutomo & Yanagi, Junya & Takada, Teruko, 2018. "Investor sentiment extracted from internet stock message boards and IPO puzzles," International Review of Economics & Finance, Elsevier, vol. 56(C), pages 205-217.
    5. Bokelmann, Björn & Lessmann, Stefan, 2024. "Improving uplift model evaluation on randomized controlled trial data," European Journal of Operational Research, Elsevier, vol. 313(2), pages 691-707.
    6. Joel Podgorski & Oliver Kracht & Luis Araguas-Araguas & Stefan Terzer-Wassmuth & Jodie Miller & Ralf Straub & Rolf Kipfer & Michael Berg, 2024. "Groundwater vulnerability to pollution in Africa’s Sahel region," Nature Sustainability, Nature, vol. 7(5), pages 558-567, May.
    7. Chakravorty, Bhaskar & Arulampalam, Wiji & Bhatiya, Apurav Yash & Imbert, Clément & Rathelot, Roland, 2024. "Can information about jobs improve the effectiveness of vocational training? Experimental evidence from India," Journal of Development Economics, Elsevier, vol. 169(C).
    8. Albert Stuart Reece & Gary Kenneth Hulse, 2022. "European Epidemiological Patterns of Cannabis- and Substance-Related Congenital Neurological Anomalies: Geospatiotemporal and Causal Inferential Study," IJERPH, MDPI, vol. 20(1), pages 1-35, December.
    9. Foutzopoulos, Giorgos & Pandis, Nikolaos & Tsagris, Michail, 2024. "Predicting full retirement attainment of NBA players," MPRA Paper 121540, University Library of Munich, Germany.
    10. Van Belle, Jente & Guns, Tias & Verbeke, Wouter, 2021. "Using shared sell-through data to forecast wholesaler demand in multi-echelon supply chains," European Journal of Operational Research, Elsevier, vol. 288(2), pages 466-479.
    11. Philipp Bach & Victor Chernozhukov & Malte S. Kurz & Martin Spindler & Sven Klaassen, 2021. "DoubleML -- An Object-Oriented Implementation of Double Machine Learning in R," Papers 2103.09603, arXiv.org, revised Jun 2024.
    12. Marchetto, Elisa & Da Re, Daniele & Tordoni, Enrico & Bazzichetto, Manuele & Zannini, Piero & Celebrin, Simone & Chieffallo, Ludovico & Malavasi, Marco & Rocchini, Duccio, 2023. "Testing the effect of sample prevalence and sampling methods on probability- and favourability-based SDMs," Ecological Modelling, Elsevier, vol. 477(C).
    13. Eeva-Katri Kumpula & Pauline Norris & Adam C Pomerleau, 2020. "Stocks of paracetamol products stored in urban New Zealand households: A cross-sectional study," PLOS ONE, Public Library of Science, vol. 15(6), pages 1-11, June.
    14. Michael Bucker & Gero Szepannek & Alicja Gosiewska & Przemyslaw Biecek, 2020. "Transparency, Auditability and eXplainability of Machine Learning Models in Credit Scoring," Papers 2009.13384, arXiv.org.
    15. Jian Lu & Raheel Ahmad & Thomas Nguyen & Jeffrey Cifello & Humza Hemani & Jiangyuan Li & Jinguo Chen & Siyi Li & Jing Wang & Achouak Achour & Joseph Chen & Meagan Colie & Ana Lustig & Christopher Dunn, 2022. "Heterogeneity and transcriptome changes of human CD8+ T cells across nine decades of life," Nature Communications, Nature, vol. 13(1), pages 1-13, December.
    16. Andrea S Martinez-Vernon & James A Covington & Ramesh P Arasaradnam & Siavash Esfahani & Nicola O’Connell & Ioannis Kyrou & Richard S Savage, 2018. "An improved machine learning pipeline for urinary volatiles disease detection: Diagnosing diabetes," PLOS ONE, Public Library of Science, vol. 13(9), pages 1-20, September.
    17. Timo Schulte & Tillmann Wurz & Oliver Groene & Sabine Bohnet-Joschko, 2023. "Big Data Analytics to Reduce Preventable Hospitalizations—Using Real-World Data to Predict Ambulatory Care-Sensitive Conditions," IJERPH, MDPI, vol. 20(6), pages 1-16, March.
    18. Madhumita Sahoo & Aman Kasot & Anirban Dhar & Amlanjyoti Kar, 2018. "On Predictability of Groundwater Level in Shallow Wells Using Satellite Observations," Water Resources Management: An International Journal, Published for the European Water Resources Association (EWRA), Springer;European Water Resources Association (EWRA), vol. 32(4), pages 1225-1244, March.
    19. Bennett, Donyetta & Mekelburg, Erik & Strauss, Jack & Williams, T.H., 2024. "Unlocking the black box of sentiment and cryptocurrency: What, which, why, when and how?," Global Finance Journal, Elsevier, vol. 60(C).
    20. P. J. Zarco-Tejada & T. Poblete & C. Camino & V. Gonzalez-Dugo & R. Calderon & A. Hornero & R. Hernandez-Clemente & M. Román-Écija & M. P. Velasco-Amo & B. B. Landa & P. S. A. Beck & M. Saponari & D. , 2021. "Divergent abiotic spectral pathways unravel pathogen stress signals across species," Nature Communications, Nature, vol. 12(1), pages 1-11, December.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:12:y:2024:i:20:p:3243-:d:1500258. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.