IDEAS home Printed from https://ideas.repec.org/a/wsi/jikmxx/v19y2020i01ns0219649220400146.html
   My bibliography  Save this article

Data Imbalance in Autism Pre-Diagnosis Classification Systems: An Experimental Study

Author

Listed:
  • Neda Abdelhamid

    (IT Programme, Auckland Institute of Studies, Auckland, New Zealand)

  • Arun Padmavathy

    (Digital Technologies, Manukau Institute of Technology, Auckland, New Zealand)

  • David Peebles

    (Department of Psychology, University of Huddersfield, Queensgate, Huddersfield HD1 3DH, UK)

  • Fadi Thabtah

    (Digital Technologies, Manukau Institute of Technology, Auckland, New Zealand)

  • Daymond Goulder-Horobin

    (Digital Technologies, Manukau Institute of Technology, Auckland, New Zealand)

Abstract

Machine learning (ML) is a branch of computer science that is rapidly gaining popularity within the healthcare arena due to its ability to explore large datasets to discover useful patterns that can be interepreted for decision-making and prediction. ML techniques are used for the analysis of clinical parameters and their combinations for prognosis, therapy planning and support and patient management and wellbeing. In this research, we investigate a crucial problem associated with medical applications such as autism spectrum disorder (ASD) data imbalances in which cases are far more than just controls in the dataset. In autism diagnosis data, the number of possible instances is linked with one class, i.e. the no ASD is larger than the ASD, and this may cause performance issues such as models favouring the majority class and undermining the minority class. This research experimentally measures the impact of class imbalance issue on the performance of different classifiers on real autism datasets when various data imbalance approaches are utilised in the pre-processing phase. We employ oversampling techniques, such as Synthetic Minority Oversampling (SMOTE), and undersampling with different classifiers including Naive Bayes, RIPPER, C4.5 and Random Forest to measure the impact of these on the performance of the models derived in terms of area under curve and other metrics. Results pinpoint that oversampling techniques are superior to undersampling techniques, at least for the toddlers’ autism dataset that we consider, and suggest that further work should look at incorporating sampling techniques with feature selection to generate models that do not overfit the dataset.

Suggested Citation

  • Neda Abdelhamid & Arun Padmavathy & David Peebles & Fadi Thabtah & Daymond Goulder-Horobin, 2020. "Data Imbalance in Autism Pre-Diagnosis Classification Systems: An Experimental Study," Journal of Information & Knowledge Management (JIKM), World Scientific Publishing Co. Pte. Ltd., vol. 19(01), pages 1-16, March.
  • Handle: RePEc:wsi:jikmxx:v:19:y:2020:i:01:n:s0219649220400146
    DOI: 10.1142/S0219649220400146
    as

    Download full text from publisher

    File URL: https://www.worldscientific.com/doi/abs/10.1142/S0219649220400146
    Download Restriction: Access to full text is restricted to subscribers

    File URL: https://libkey.io/10.1142/S0219649220400146?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Qiang Yang & Xindong Wu, 2006. "10 Challenging Problems In Data Mining Research," International Journal of Information Technology & Decision Making (IJITDM), World Scientific Publishing Co. Pte. Ltd., vol. 5(04), pages 597-604.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Gulsum Alicioglu & Bo Sun & Shen Shyang Ho, 2022. "An Injury-Severity-Prediction-Driven Accident Prevention System," Sustainability, MDPI, vol. 14(11), pages 1-15, May.
    2. Sung-Mook Oh & Jin Park & Jinsun Yang & Young-Gyun Oh & Kyung-Woo Yi, 2023. "Smart classification method to detect irregular nozzle spray patterns inside carbon black reactor using ensemble transfer learning," Journal of Intelligent Manufacturing, Springer, vol. 34(6), pages 2729-2745, August.
    3. Yang Hui & Xuesong Mei & Gedong Jiang & Fei Zhao & Ziwei Ma & Tao Tao, 2022. "Assembly quality evaluation for linear axis of machine tool using data-driven modeling approach," Journal of Intelligent Manufacturing, Springer, vol. 33(3), pages 753-769, March.
    4. Lin, Fengming & Fang, Shu-Cherng & Fang, Xiaolei & Gao, Zheming & Luo, Jian, 2024. "A distributionally robust chance-constrained kernel-free quadratic surface support vector machine," European Journal of Operational Research, Elsevier, vol. 316(1), pages 46-60.
    5. Zhang, Sainan & Zhang, Jun & Song, Weiguo & Yang, Longnan & Zhao, Xuedan, 2024. "Hierarchical-attention-based neural network for gait emotion recognition," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 637(C).
    6. Xinchun Zhu & Yang Wu & Xu Zhao & Yunchen Yang & Shuangquan Liu & Luyi Shi & Yelong Wu, 2024. "Overview of Wind and Photovoltaic Data Stream Classification and Data Drift Issues," Energies, MDPI, vol. 17(17), pages 1-24, September.
    7. Janis Ivanovs & Andreas Haberl & Raitis Melniks, 2024. "Modeling Geospatial Distribution of Peat Layer Thickness Using Machine Learning and Aerial Laser Scanning Data," Land, MDPI, vol. 13(4), pages 1-14, April.
    8. Muhammad Asif Ali Rehmani & Saad Aslam & Shafiqur Rahman Tito & Snjezana Soltic & Pieter Nieuwoudt & Neel Pandey & Mollah Daud Ahmed, 2021. "Power Profile and Thresholding Assisted Multi-Label NILM Classification," Energies, MDPI, vol. 14(22), pages 1-18, November.
    9. Erdener, Burcin Cakir & Feng, Cong & Doubleday, Kate & Florita, Anthony & Hodge, Bri-Mathias, 2022. "A review of behind-the-meter solar forecasting," Renewable and Sustainable Energy Reviews, Elsevier, vol. 160(C).
    10. Mubarak Alrumaidhi & Mohamed M. G. Farag & Hesham A. Rakha, 2023. "Comparative Analysis of Parametric and Non-Parametric Data-Driven Models to Predict Road Crash Severity among Elderly Drivers Using Synthetic Resampling Techniques," Sustainability, MDPI, vol. 15(13), pages 1-30, June.
    11. Firuz Kamalov & Linda Smail & Ikhlaas Gurrib, 2021. "Forecasting with Deep Learning: S&P 500 index," Papers 2103.14080, arXiv.org.
    12. Chen, Shiuann-Shuoh & Choubey, Bhaskar & Singh, Vinay, 2021. "A neural network based price sensitive recommender model to predict customer choices based on price effect," Journal of Retailing and Consumer Services, Elsevier, vol. 61(C).
    13. Rahman, Md Jahidur & Zhu, Hongtao, 2024. "Detecting accounting fraud in family firms: Evidence from machine learning approaches," Advances in accounting, Elsevier, vol. 64(C).
    14. Maria Tragouda & Michalis Doumpos & Constantin Zopounidis, 2024. "Identification of fraudulent financial statements through a multi‐label classification approach," Intelligent Systems in Accounting, Finance and Management, John Wiley & Sons, Ltd., vol. 31(2), June.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. DE CNUDDE, Sofie & MARTENS, David & EVGENIOU, Theodoros & PROVOST, Foster, 2017. "A benchmarking study of classification techniques for behavioral data," Working Papers 2017005, University of Antwerp, Faculty of Business and Economics.
    2. Liao, Jui-Jung & Shih, Ching-Hui & Chen, Tai-Feng & Hsu, Ming-Fu, 2014. "An ensemble-based model for two-class imbalanced financial problem," Economic Modelling, Elsevier, vol. 37(C), pages 175-183.
    3. Pancheng Wang & Shasha Li & Haifang Zhou & Jintao Tang & Ting Wang, 2019. "Cited text spans identification with an improved balanced ensemble model," Scientometrics, Springer;Akadémiai Kiadó, vol. 120(3), pages 1111-1145, September.
    4. Keng-Hoong Ng & Chin-Kuan Ho & Somnuk Phon-Amnuaisuk, 2012. "A Hybrid Distance Measure for Clustering Expressed Sequence Tags Originating from the Same Gene Family," PLOS ONE, Public Library of Science, vol. 7(10), pages 1-14, October.
    5. Yan Li & Manoj Thomas & Kweku-Muata Osei-Bryson & Jason Levy, 2016. "Problem Formulation in Knowledge Discovery via Data Analytics (KDDA) for Environmental Risk Management," IJERPH, MDPI, vol. 13(12), pages 1-17, December.
    6. Harshita Patel & Dharmendra Singh Rajput & G Thippa Reddy & Celestine Iwendi & Ali Kashif Bashir & Ohyun Jo, 2020. "A review on classification of imbalanced data for wireless sensor networks," International Journal of Distributed Sensor Networks, , vol. 16(4), pages 15501477209, April.
    7. Qi Liu & Gengzhong Feng & Nengmin Wang & Giri Kumar Tayi, 2018. "A multi-objective model for discovering high-quality knowledge based on data quality and prior knowledge," Information Systems Frontiers, Springer, vol. 20(2), pages 401-416, April.
    8. Vilém Novák & Soheyla Mirshahi, 2021. "On the Similarity and Dependence of Time Series," Mathematics, MDPI, vol. 9(5), pages 1-14, March.
    9. Riesgo García, María Victoria & Krzemień, Alicja & Manzanedo del Campo, Miguel Ángel & Escanciano García-Miranda, Carmen & Sánchez Lasheras, Fernando, 2018. "Rare earth elements price forecasting by means of transgenic time series developed with ARIMA models," Resources Policy, Elsevier, vol. 59(C), pages 95-102.
    10. Ionuţ ŢĂRANU, 2016. "Data mining in healthcare: decision making and precision," Database Systems Journal, Academy of Economic Studies - Bucharest, Romania, vol. 6(4), pages 33-40, May.
    11. Li, Hailin, 2017. "Distance measure with improved lower bound for multivariate time series," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 468(C), pages 622-637.
    12. Qi Liu & Gengzhong Feng & Nengmin Wang & Giri Kumar Tayi, 0. "A multi-objective model for discovering high-quality knowledge based on data quality and prior knowledge," Information Systems Frontiers, Springer, vol. 0, pages 1-16.
    13. Hady Suryono & Heri Kuswanto & Nur Iriawan, 2022. "Two-Phase Stratified Random Forest for Paddy Growth Phase Classification: A Case of Imbalanced Data," Sustainability, MDPI, vol. 14(22), pages 1-13, November.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:wsi:jikmxx:v:19:y:2020:i:01:n:s0219649220400146. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Tai Tone Lim (email available below). General contact details of provider: http://www.worldscinet.com/jikm/jikm.shtml .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.