IDEAS home Printed from https://ideas.repec.org/a/ist/ekoist/v0y2023i38p87-104.html
   My bibliography  Save this article

Predicting Countries’ Development Levels Using the Decision Tree and Random Forest Methods

Author

Listed:
  • Batuhan Özkan

    (Yıldız Teknik Üniversitesi, Fen Edebiyat Fakültesi, İstatistik Bölümü, İstanbul, Türkiye.)

  • CoÅŸkun Parim

    (Yıldız Teknik Üniversitesi, Fen Edebiyat Fakültesi, İstatistik Bölümü, İstanbul, Türkiye.)

  • Erhan Çene

    (Yıldız Teknik Üniversitesi, Fen Edebiyat Fakültesi, İstatistik Bölümü, İstanbul, Türkiye.)

Abstract

A very close relationship exists between countries’ development levels and economic level. Countries can be examined according to various criteria and evaluated under different groups based on their level of development, from underdeveloped to highly developed. Socioeconomic factors generally play a decisive role in determining countries’ levels of development. Although the level of development is determined with the help of socioeconomic variables, different organizations (e.g., United Nations [UN], International Monetary Fund [IMF]) may make country classifications with different methods. This situation causes a country’s development level to occur in different categories based on the method used and the organization that performed it. The aim of this study is to propose a machine learning model that predicts the development level for 193 countries. Development level consists of the categories of high income, upper middle income, lower middle income, and low income. The 26 variables that affect countries’ development levels were obtained from the World Development Indicators (WDI) database. Firstly, random forest based variable importance was used to determine the variables which have the most important effects on countries’ development levels. Afterwards, countries’ development levels were classified using decision trees and random forest algorithms with the most important variables selected through variable importance. The model composed with the random forest algorithm was determined to have correctly classified countries’ development levels at an accuracy of 70%. In addition, the findings show the variables of adolescent fertility rate, total fertility rate, and the share of agriculture, forestry, and fisheries in gross domestic product GDP) to be the most important variables affecting countries’ development levels.

Suggested Citation

  • Batuhan Özkan & CoÅŸkun Parim & Erhan Çene, 2023. "Predicting Countries’ Development Levels Using the Decision Tree and Random Forest Methods," EKOIST Journal of Econometrics and Statistics, Istanbul University, Faculty of Economics, vol. 0(38), pages 87-104, June.
  • Handle: RePEc:ist:ekoist:v:0:y:2023:i:38:p:87-104
    DOI: 10.26650/ekoist.2023.38.1172190
    as

    Download full text from publisher

    File URL: https://cdn.istanbul.edu.tr/file/JTA6CLJ8T5/E34BEDEC8771401BB444F47B6A7BCC48
    Download Restriction: no

    File URL: https://iupress.istanbul.edu.tr/tr/journal/ekoist/article/ulkelerin-gelismislik-duzeylerinin-karar-agaci-ve-rastgele-orman-yontemleriyle-tahmin-edilmesi
    Download Restriction: no

    File URL: https://libkey.io/10.26650/ekoist.2023.38.1172190?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Ariel Kleiner & Ameet Talwalkar & Purnamrita Sarkar & Michael I. Jordan, 2014. "A scalable bootstrap for massive data," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 76(4), pages 795-816, September.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Guangbao Guo & Yue Sun & Xuejun Jiang, 2020. "A partitioned quasi-likelihood for distributed statistical inference," Computational Statistics, Springer, vol. 35(4), pages 1577-1596, December.
    2. Xingcai Zhou & Zhaoyang Jing & Chao Huang, 2024. "Distributed Bootstrap Simultaneous Inference for High-Dimensional Quantile Regression," Mathematics, MDPI, vol. 12(5), pages 1-54, February.
    3. Dimitris N Politis, 2024. "Scalable subsampling: computation, aggregation and inference," Biometrika, Biometrika Trust, vol. 111(1), pages 347-354.
    4. Amalan Mahendran & Helen Thompson & James M. McGree, 2023. "A model robust subsampling approach for Generalised Linear Models in big data settings," Statistical Papers, Springer, vol. 64(4), pages 1137-1157, August.
    5. Villoria, Nelson B. & Liu, Jing, 2018. "Using spatially explicit data to improve our understanding of land supply responses: An application to the cropland effects of global sustainable irrigation in the Americas," Land Use Policy, Elsevier, vol. 75(C), pages 411-419.
    6. Vaughan, Gregory, 2020. "Efficient big data model selection with applications to fraud detection," International Journal of Forecasting, Elsevier, vol. 36(3), pages 1116-1127.
    7. Wang, Xiaoqian & Kang, Yanfei & Hyndman, Rob J. & Li, Feng, 2023. "Distributed ARIMA models for ultra-long time series," International Journal of Forecasting, Elsevier, vol. 39(3), pages 1163-1184.
    8. Yang, Xinfeng & Yan, Xiaodong & Huang, Jian, 2019. "High-dimensional integrative analysis with homogeneity and sparsity recovery," Journal of Multivariate Analysis, Elsevier, vol. 174(C).
    9. Fang, Jianglin, 2023. "A split-and-conquer variable selection approach for high-dimensional general semiparametric models with massive data," Journal of Multivariate Analysis, Elsevier, vol. 194(C).
    10. Tang, Lu & Zhou, Ling & Song, Peter X.-K., 2020. "Distributed simultaneous inference in generalized linear models via confidence distribution," Journal of Multivariate Analysis, Elsevier, vol. 176(C).
    11. Beate Franke & Jean-FRANçois Plante & Ribana Roscher & En-shiun Annie Lee & Cathal Smyth & Armin Hatefi & Fuqi Chen & Einat Gil & Alexander Schwing & Alessandro Selvitella & Michael M. Hoffman & Roger, 2016. "Statistical Inference, Learning and Models in Big Data," International Statistical Review, International Statistical Institute, vol. 84(3), pages 371-389, December.
    12. Ma, Xuejun & Wang, Shaochen & Zhou, Wang, 2021. "Testing multivariate quantile by empirical likelihood," Journal of Multivariate Analysis, Elsevier, vol. 182(C).
    13. Dean Eckles & Maurits Kaptein, 2019. "Bootstrap Thompson Sampling and Sequential Decision Problems in the Behavioral Sciences," SAGE Open, , vol. 9(2), pages 21582440198, June.
    14. Badruddoza, Syed & Amin, Modhurima & McCluskey, Jill, 2019. "Assessing the Importance of an Attribute in a Demand SystemStructural Model versus Machine Learning," Working Papers 2019-5, School of Economic Sciences, Washington State University.
    15. Olhede, Sofia C. & Wolfe, Patrick J., 2018. "The future of statistics and data science," Statistics & Probability Letters, Elsevier, vol. 136(C), pages 46-50.
    16. Shi, Chengchun & Lu, Wenbin & Song, Rui, 2018. "A massive data framework for M-estimators with cubic-rate," LSE Research Online Documents on Economics 102111, London School of Economics and Political Science, LSE Library.
    17. Gérard Biau & Erwan Scornet, 2016. "A random forest guided tour," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 25(2), pages 197-227, June.
    18. Mercè Crosas & Gary King & James Honaker & Latanya Sweeney, 2015. "Automating Open Science for Big Data," The ANNALS of the American Academy of Political and Social Science, , vol. 659(1), pages 260-273, May.
    19. Zhang, Likun & Castillo, Enrique del & Berglund, Andrew J. & Tingley, Martin P. & Govind, Nirmal, 2020. "Computing confidence intervals from massive data via penalized quantile smoothing splines," Computational Statistics & Data Analysis, Elsevier, vol. 144(C).
    20. Milica Maricic & Jose A. Egea & Veljko Jeremic, 2019. "A Hybrid Enhanced Scatter Search—Composite I-Distance Indicator (eSS-CIDI) Optimization Approach for Determining Weights Within Composite Indicators," Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, Springer, vol. 144(2), pages 497-537, July.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:ist:ekoist:v:0:y:2023:i:38:p:87-104. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Ertugrul YASAR (email available below). General contact details of provider: https://edirc.repec.org/data/ifisttr.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.