IDEAS home Printed from https://ideas.repec.org/a/gam/jdataj/v6y2021i2p11-d484845.html
   My bibliography  Save this article

The Effect of Preprocessing Techniques, Applied to Numeric Features, on Classification Algorithms’ Performance

Author

Listed:
  • Esra’a Alshdaifat

    (Department of Computer Information System, Faculty of Prince Al-Hussein Bin Abdallah II For Information Technology, The Hashemite University, P.O. Box 330127, Zarqa 13133, Jordan)

  • Doa’a Alshdaifat

    (Department of Computer Information System, Faculty of Prince Al-Hussein Bin Abdallah II For Information Technology, The Hashemite University, P.O. Box 330127, Zarqa 13133, Jordan)

  • Ayoub Alsarhan

    (Department of Computer Information System, Faculty of Prince Al-Hussein Bin Abdallah II For Information Technology, The Hashemite University, P.O. Box 330127, Zarqa 13133, Jordan)

  • Fairouz Hussein

    (Department of Computer Information System, Faculty of Prince Al-Hussein Bin Abdallah II For Information Technology, The Hashemite University, P.O. Box 330127, Zarqa 13133, Jordan)

  • Subhieh Moh’d Faraj S. El-Salhi

    (Department of Computer Information System, Faculty of Prince Al-Hussein Bin Abdallah II For Information Technology, The Hashemite University, P.O. Box 330127, Zarqa 13133, Jordan)

Abstract

It is recognized that the performance of any prediction model is a function of several factors. One of the most significant factors is the adopted preprocessing techniques. In other words, preprocessing is an essential process to generate an effective and efficient classification model. This paper investigates the impact of the most widely used preprocessing techniques, with respect to numerical features, on the performance of classification algorithms. The effect of combining various normalization techniques and handling missing values strategies is assessed on eighteen benchmark datasets using two well-known classification algorithms and adopting different performance evaluation metrics and statistical significance tests. According to the reported experimental results, the impact of the adopted preprocessing techniques varies from one classification algorithm to another. In addition, a statistically significant difference between the considered data preprocessing techniques is demonstrated.

Suggested Citation

  • Esra’a Alshdaifat & Doa’a Alshdaifat & Ayoub Alsarhan & Fairouz Hussein & Subhieh Moh’d Faraj S. El-Salhi, 2021. "The Effect of Preprocessing Techniques, Applied to Numeric Features, on Classification Algorithms’ Performance," Data, MDPI, vol. 6(2), pages 1-23, January.
  • Handle: RePEc:gam:jdataj:v:6:y:2021:i:2:p:11-:d:484845
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2306-5729/6/2/11/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2306-5729/6/2/11/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Crone, Sven F. & Lessmann, Stefan & Stahlbock, Robert, 2006. "The impact of preprocessing on data mining: An evaluation of classifier sensitivity in direct marketing," European Journal of Operational Research, Elsevier, vol. 173(3), pages 781-800, September.
    2. Akçay, Hüseyin & Filik, Tansu, 2017. "Short-term wind speed forecasting by spectral analysis from long-term observations with missing values," Applied Energy, Elsevier, vol. 191(C), pages 653-662.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Fairouz Hussein & Ayat Al-Ahmad & Subhieh El-Salhi & Esra’a Alshdaifat & Mo’taz Al-Hami, 2022. "Advances in Contextual Action Recognition: Automatic Cheating Detection Using Machine Learning Techniques," Data, MDPI, vol. 7(9), pages 1-13, August.
    2. Samuka Mohanty & Rajashree Dash, 2023. "A New Dual Normalization for Enhancing the Bitcoin Pricing Capability of an Optimized Low Complexity Neural Net with TOPSIS Evaluation," Mathematics, MDPI, vol. 11(5), pages 1-28, February.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Lee, In Gyu & Yoon, Sang Won & Won, Daehan, 2022. "A Mixed Integer Linear Programming Support Vector Machine for Cost-Effective Group Feature Selection: Branch-Cut-and-Price Approach," European Journal of Operational Research, Elsevier, vol. 299(3), pages 1055-1068.
    2. Crone, Sven F. & Finlay, Steven, 2012. "Instance sampling in credit scoring: An empirical study of sample size and balancing," International Journal of Forecasting, Elsevier, vol. 28(1), pages 224-238.
    3. Georgios Marinakos & Sophia Daskalaki, 2017. "Imbalanced customer classification for bank direct marketing," Journal of Marketing Analytics, Palgrave Macmillan, vol. 5(1), pages 14-30, March.
    4. Brandner, Hubertus & Lessmann, Stefan & Voß, Stefan, 2013. "A memetic approach to construct transductive discrete support vector machines," European Journal of Operational Research, Elsevier, vol. 230(3), pages 581-595.
    5. Lessmann, Stefan & Baesens, Bart & Seow, Hsin-Vonn & Thomas, Lyn C., 2015. "Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research," European Journal of Operational Research, Elsevier, vol. 247(1), pages 124-136.
    6. K. W. De Bock & D. Van Den Poel, 2012. "Reconciling Performance and Interpretability in Customer Churn Prediction using Ensemble Learning based on Generalized Additive Models," Working Papers of Faculty of Economics and Business Administration, Ghent University, Belgium 12/805, Ghent University, Faculty of Economics and Business Administration.
    7. Bose, Indranil & Chen, Xi, 2009. "Quantitative models for direct marketing: A review from systems perspective," European Journal of Operational Research, Elsevier, vol. 195(1), pages 1-16, May.
    8. Stefan Lessmann & Stefan Voß, 2010. "Customer-Centric Decision Support," Business & Information Systems Engineering: The International Journal of WIRTSCHAFTSINFORMATIK, Springer;Gesellschaft für Informatik e.V. (GI), vol. 2(2), pages 79-93, April.
    9. Coussement, Kristof & Buckinx, Wouter, 2011. "A probability-mapping algorithm for calibrating the posterior probabilities: A direct marketing application," European Journal of Operational Research, Elsevier, vol. 214(3), pages 732-738, November.
    10. Ibrahim Al-Shourbaji & Pramod H. Kachare & Samah Alshathri & Salahaldeen Duraibi & Bushra Elnaim & Mohamed Abd Elaziz, 2022. "An Efficient Parallel Reptile Search Algorithm and Snake Optimizer Approach for Feature Selection," Mathematics, MDPI, vol. 10(13), pages 1-20, July.
    11. De Bock, Koen W. & Coussement, Kristof & Caigny, Arno De & Słowiński, Roman & Baesens, Bart & Boute, Robert N. & Choi, Tsan-Ming & Delen, Dursun & Kraus, Mathias & Lessmann, Stefan & Maldonado, Sebast, 2024. "Explainable AI for Operational Research: A defining framework, methods, applications, and a research agenda," European Journal of Operational Research, Elsevier, vol. 317(2), pages 249-272.
    12. Koen W. de Bock & Kristof Coussement & Arno De Caigny & Roman Slowiński & Bart Baesens & Robert N Boute & Tsan-Ming Choi & Dursun Delen & Mathias Kraus & Stefan Lessmann & Sebastián Maldonado & David , 2023. "Explainable AI for Operational Research: A Defining Framework, Methods, Applications, and a Research Agenda," Post-Print hal-04219546, HAL.
    13. Meisel, Stephan & Mattfeld, Dirk, 2010. "Synergies of Operations Research and Data Mining," European Journal of Operational Research, Elsevier, vol. 206(1), pages 1-10, October.
    14. Lessmann, Stefan & Voß, Stefan, 2009. "A reference model for customer-centric data mining with support vector machines," European Journal of Operational Research, Elsevier, vol. 199(2), pages 520-530, December.
    15. Coussement, Kristof & De Bock, Koen W., 2013. "Customer churn prediction in the online gambling industry: The beneficial effect of ensemble learning," Journal of Business Research, Elsevier, vol. 66(9), pages 1629-1636.
    16. Li, Ranran & Jin, Yu, 2018. "A wind speed interval prediction system based on multi-objective optimization for machine learning method," Applied Energy, Elsevier, vol. 228(C), pages 2207-2220.
    17. Kirchner-Bossi, Nicolas & Kathari, Gabriel & Porté-Agel, Fernando, 2024. "A hybrid physics-based and data-driven model for intra-day and day-ahead wind power forecasting considering a drastically expanded predictor search space," Applied Energy, Elsevier, vol. 367(C).
    18. Song, Jingjing & Wang, Jianzhou & Lu, Haiyan, 2018. "A novel combined model based on advanced optimization algorithm for short-term wind speed forecasting," Applied Energy, Elsevier, vol. 215(C), pages 643-658.
    19. Peplinski, McKenna & Dilkina, Bistra & Chen, Mo & Silva, Sam J. & Ban-Weiss, George A. & Sanders, Kelly T., 2024. "A machine learning framework to estimate residential electricity demand based on smart meter electricity, climate, building characteristics, and socioeconomic datasets," Applied Energy, Elsevier, vol. 357(C).
    20. Aly, Hamed H.H., 2020. "A novel deep learning intelligent clustered hybrid models for wind speed and power forecasting," Energy, Elsevier, vol. 213(C).

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jdataj:v:6:y:2021:i:2:p:11-:d:484845. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.