IDEAS home Printed from https://ideas.repec.org/a/gam/jijerp/v17y2020i18p6513-d410103.html
   My bibliography  Save this article

XGBoost-Based Framework for Smoking-Induced Noncommunicable Disease Prediction

Author

Listed:
  • Khishigsuren Davagdorj

    (Database and Bioinformatics Laboratory, College of Electrical and Computer Engineering, Chungbuk National University, Cheongju 28644, Korea
    These authors contributed equally to the research.)

  • Van Huy Pham

    (Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh 700000, Vietnam
    These authors contributed equally to the research.)

  • Nipon Theera-Umpon

    (Department of Electrical Engineering, Faculty of Engineering, Chiang Mai University, Chiang Mai 50200, Thailand
    Biomedical Engineering Institute, Chiang Mai University, Chiang Mai 50200, Thailand)

  • Keun Ho Ryu

    (Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh 700000, Vietnam
    Biomedical Engineering Institute, Chiang Mai University, Chiang Mai 50200, Thailand)

Abstract

Smoking-induced noncommunicable diseases (SiNCDs) have become a significant threat to public health and cause of death globally. In the last decade, numerous studies have been proposed using artificial intelligence techniques to predict the risk of developing SiNCDs. However, determining the most significant features and developing interpretable models are rather challenging in such systems. In this study, we propose an efficient extreme gradient boosting (XGBoost) based framework incorporated with the hybrid feature selection (HFS) method for SiNCDs prediction among the general population in South Korea and the United States. Initially, HFS is performed in three stages: (I) significant features are selected by t-test and chi-square test; (II) multicollinearity analysis serves to obtain dissimilar features; (III) final selection of best representative features is done based on least absolute shrinkage and selection operator (LASSO). Then, selected features are fed into the XGBoost predictive model. The experimental results show that our proposed model outperforms several existing baseline models. In addition, the proposed model also provides important features in order to enhance the interpretability of the SiNCDs prediction model. Consequently, the XGBoost based framework is expected to contribute for early diagnosis and prevention of the SiNCDs in public health concerns.

Suggested Citation

  • Khishigsuren Davagdorj & Van Huy Pham & Nipon Theera-Umpon & Keun Ho Ryu, 2020. "XGBoost-Based Framework for Smoking-Induced Noncommunicable Disease Prediction," IJERPH, MDPI, vol. 17(18), pages 1-22, September.
  • Handle: RePEc:gam:jijerp:v:17:y:2020:i:18:p:6513-:d:410103
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/1660-4601/17/18/6513/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/1660-4601/17/18/6513/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Lukas Meier & Sara Van De Geer & Peter Bühlmann, 2008. "The group lasso for logistic regression," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 70(1), pages 53-71, February.
    2. Simiao Chen & Michael Kuhn & Klaus Prettner & David E Bloom, 2018. "The macroeconomic burden of noncommunicable diseases in the United States: Estimates and projections," PLOS ONE, Public Library of Science, vol. 13(11), pages 1-14, November.
    3. Xiao Hu & Yang Wang & Jidong Huang & Rong Zheng, 2019. "Cigarette Affordability and Cigarette Consumption among Adult and Elderly Chinese Smokers: Evidence from A Longitudinal Study," IJERPH, MDPI, vol. 16(23), pages 1-20, December.
    4. Alexandre Belloni & Victor Chernozhukov & Christian Hansen, 2014. "High-Dimensional Methods and Inference on Structural and Treatment Effects," Journal of Economic Perspectives, American Economic Association, vol. 28(2), pages 29-50, Spring.
    5. Esra Zihni & Vince Istvan Madai & Michelle Livne & Ivana Galinovic & Ahmed A Khalil & Jochen B Fiebach & Dietmar Frey, 2020. "Opening the black box of artificial intelligence for clinical decision support: A study predicting stroke outcome," PLOS ONE, Public Library of Science, vol. 15(4), pages 1-15, April.
    6. Roman Salmerón Gómez & José García Pérez & María Del Mar López Martín & Catalina García García, 2016. "Collinearity diagnostic applied in ridge estimation through the variance inflation factor," Journal of Applied Statistics, Taylor & Francis Journals, vol. 43(10), pages 1831-1849, August.
    7. Rongjun Chen & Jinhui Lin, 2020. "Identification of feature risk pathways of smoking-induced lung cancer based on SVM," PLOS ONE, Public Library of Science, vol. 15(6), pages 1-16, June.
    8. Hyunju Dan & Jiyoung Kim & Oksoo Kim, 2020. "Effects of Gender and Age on Dietary Intake and Body Mass Index in Hypertensive Patients: Analysis of the Korea National Health and Nutrition Examination," IJERPH, MDPI, vol. 17(12), pages 1-9, June.
    9. Charles B Breckenridge & Colin Berry & Ellen T Chang & Robert L Sielken Jr. & Jack S Mandel, 2016. "Association between Parkinson’s Disease and Cigarette Smoking, Rural Living, Well-Water Consumption, Farming and Pesticide Use: Systematic Review and Meta-Analysis," PLOS ONE, Public Library of Science, vol. 11(4), pages 1-42, April.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Kwang Ho Park & Erdenebileg Batbaatar & Yongjun Piao & Nipon Theera-Umpon & Keun Ho Ryu, 2021. "Deep Learning Feature Extraction Approach for Hematopoietic Cancer Subtype Classification," IJERPH, MDPI, vol. 18(4), pages 1-24, February.
    2. Cheuk-Kay Sun & Yun-Xuan Tang & Tzu-Chi Liu & Chi-Jie Lu, 2022. "An Integrated Machine Learning Scheme for Predicting Mammographic Anomalies in High-Risk Individuals Using Questionnaire-Based Predictors," IJERPH, MDPI, vol. 19(15), pages 1-17, August.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Croux, Christophe & Jagtiani, Julapa & Korivi, Tarunsai & Vulanovic, Milos, 2020. "Important factors determining Fintech loan default: Evidence from a lendingclub consumer platform," Journal of Economic Behavior & Organization, Elsevier, vol. 173(C), pages 270-296.
    2. Tutz, Gerhard & Pößnecker, Wolfgang & Uhlmann, Lorenz, 2015. "Variable selection in general multinomial logit models," Computational Statistics & Data Analysis, Elsevier, vol. 82(C), pages 207-222.
    3. Gonzalez, Felipe & Prem, Mounu & von Dessauer, Cristine, 2023. "Empowerment or Indoctrination? Women Centers Under Dictatorship," SocArXiv 64mf9, Center for Open Science.
    4. Anil Kumar, 2018. "Do Restrictions on Home Equity Extraction Contribute to Lower Mortgage Defaults? Evidence from a Policy Discontinuity at the Texas Border," American Economic Journal: Economic Policy, American Economic Association, vol. 10(1), pages 268-297, February.
    5. Ye, Ya-Fen & Shao, Yuan-Hai & Deng, Nai-Yang & Li, Chun-Na & Hua, Xiang-Yu, 2017. "Robust Lp-norm least squares support vector regression with feature selection," Applied Mathematics and Computation, Elsevier, vol. 305(C), pages 32-52.
    6. Ay, Jean-Sauveur & Le Gallo, Julie, 2021. "The Signaling Values of Nested Wine Names," Working Papers 321851, American Association of Wine Economists.
    7. Vincent, Martin & Hansen, Niels Richard, 2014. "Sparse group lasso and high dimensional multinomial classification," Computational Statistics & Data Analysis, Elsevier, vol. 71(C), pages 771-786.
    8. Yuexin Li & Xiaoyin Ma & Luc Renneboog, 2024. "In Art We Trust," Management Science, INFORMS, vol. 70(1), pages 98-127, January.
    9. Davide Viviano & Jelena Bradic, 2019. "Synthetic learner: model-free inference on treatments over time," Papers 1904.01490, arXiv.org, revised Aug 2022.
    10. Yoganathan, Vignesh & Osburg, Victoria-Sophie, 2024. "The mind in the machine: Estimating mind perception's effect on user satisfaction with voice-based conversational agents," Journal of Business Research, Elsevier, vol. 175(C).
    11. Pedro Carneiro & Sokbae Lee & Daniel Wilhelm, 2020. "Optimal data collection for randomized control trials," The Econometrics Journal, Royal Economic Society, vol. 23(1), pages 1-31.
    12. Bakx, Pieter & Wouterse, Bram & van Doorslaer, Eddy & Wong, Albert, 2020. "Better off at home? Effects of nursing home eligibility on costs, hospitalizations and survival," Journal of Health Economics, Elsevier, vol. 73(C).
    13. Reny Yuliati & Billy Koernianti Sarwono & Abdillah Ahsan & I Gusti Lanang Agung Kharisma Wibhisono & Dian Kusuma, 2021. "Effect of Message Approach and Image Size on Pictorial Health Warning Effectiveness on Cigarette Pack in Indonesia: A Mixed Factorial Experiment," IJERPH, MDPI, vol. 18(13), pages 1-11, June.
    14. Zhu, Manhong & Schmitz, Andrew & Schmitz, Troy G., "undated". "What are the Culprits Causing Obesity? A Machine Learning Approach in Variable Selection and Parameter Coefficient Inference," 2017 Annual Meeting, July 30-August 1, Chicago, Illinois 261220, Agricultural and Applied Economics Association.
    15. Andreas Wagner & Denise Fischer‐Kreer, 2024. "The role of CEO regulatory focus in increasing or reducing corporate carbon emissions," Business Strategy and the Environment, Wiley Blackwell, vol. 33(2), pages 1051-1065, February.
    16. Nelson, Kelly P. & Parton, Lee C. & Brown, Zachary S., 2022. "Biofuels policy and innovation impacts: Evidence from biofuels and agricultural patent indicators," Energy Policy, Elsevier, vol. 162(C).
    17. Sabrin Beg & Waqas Halim & Adrienne M. Lucas & Umar Saif, 2022. "Engaging Teachers with Technology Increased Achievement, Bypassing Teachers Did Not," American Economic Journal: Economic Policy, American Economic Association, vol. 14(2), pages 61-90, May.
    18. Michael C. Knaus, 2021. "A double machine learning approach to estimate the effects of musical practice on student’s skills," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 184(1), pages 282-300, January.
    19. Clément de Chaisemartin & Nicolás Navarrete H., 2023. "The Direct and Spillover Effects of a Nationwide Socioemotional Learning Program for Disruptive Students," Journal of Labor Economics, University of Chicago Press, vol. 41(3), pages 729-769.
    20. Michael C Knaus & Michael Lechner & Anthony Strittmatter, 2021. "Machine learning estimation of heterogeneous causal effects: Empirical Monte Carlo evidence," The Econometrics Journal, Royal Economic Society, vol. 24(1), pages 134-161.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jijerp:v:17:y:2020:i:18:p:6513-:d:410103. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.