IDEAS home Printed from https://ideas.repec.org/a/gam/jmathe/v8y2020i6p900-d366515.html
   My bibliography  Save this article

EvoPreprocess—Data Preprocessing Framework with Nature-Inspired Optimization Algorithms

Author

Listed:
  • Sašo Karakatič

    (Faculty of Electrical Engineering and Computer Science, University of Maribor, Maribor 2000, Slovenia)

Abstract

The quality of machine learning models can suffer when inappropriate data is used, which is especially prevalent in high-dimensional and imbalanced data sets. Data preparation and preprocessing can mitigate some problems and can thus result in better models. The use of meta-heuristic and nature-inspired methods for data preprocessing has become common, but these approaches are still not readily available to practitioners with a simple and extendable application programming interface (API). In this paper the EvoPreprocess open-source Python framework, that preprocesses data with the use of evolutionary and nature-inspired optimization algorithms, is presented. The main problems addressed by the framework are data sampling (simultaneous over- and under-sampling data instances), feature selection and data weighting for supervised machine learning problems. EvoPreprocess framework provides a simple object-oriented and parallelized API of the preprocessing tasks and can be used with scikit-learn and imbalanced-learn Python machine learning libraries. The framework uses self-adaptive well-known nature-inspired meta-heuristic algorithms and can easily be extended with custom optimization and evaluation strategies. The paper presents the architecture of the framework, its use, experiment results and comparison to other common preprocessing approaches.

Suggested Citation

  • Sašo Karakatič, 2020. "EvoPreprocess—Data Preprocessing Framework with Nature-Inspired Optimization Algorithms," Mathematics, MDPI, vol. 8(6), pages 1-29, June.
  • Handle: RePEc:gam:jmathe:v:8:y:2020:i:6:p:900-:d:366515
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2227-7390/8/6/900/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2227-7390/8/6/900/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Kursa, Miron B. & Rudnicki, Witold R., 2010. "Feature Selection with the Boruta Package," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 36(i11).
    2. C.R. Reeves, 1999. "Landscapes, operators and heuristic search," Annals of Operations Research, Springer, vol. 86(0), pages 473-490, January.
    3. Panos M Pardalos & Oleg A Prokopyev & Stanislav Busygin, 2006. "Continuous Approaches for Solving Discrete Optimization Problems," International Series in Operations Research & Management Science, in: Gautam Appa & Leonidas Pitsoulis & H. Paul Williams (ed.), Handbook on Modelling for Discrete Optimization, chapter 0, pages 39-60, Springer.
    4. Lagani, Vincenzo & Athineou, Giorgos & Farcomeni, Alessio & Tsagris, Michail & Tsamardinos, Ioannis, 2017. "Feature Selection with the R Package MXM: Discovering Statistically Equivalent Feature Subsets," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 80(i07).
    5. Othman Soufan & Dimitrios Kleftogiannis & Panos Kalnis & Vladimir B Bajic, 2015. "DWFS: A Wrapper Feature Selection Tool Based on a Parallel Genetic Algorithm," PLOS ONE, Public Library of Science, vol. 10(2), pages 1-23, February.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Tong, Jianfeng & Liu, Zhenxing & Zhang, Yong & Zheng, Xiujuan & Jin, Junyang, 2023. "Improved multi-gate mixture-of-experts framework for multi-step prediction of gas load," Energy, Elsevier, vol. 282(C).
    2. Asma Shaheen & Javed Iqbal, 2018. "Spatial Distribution and Mobility Assessment of Carcinogenic Heavy Metals in Soil Profiles Using Geostatistics and Random Forest, Boruta Algorithm," Sustainability, MDPI, vol. 10(3), pages 1-20, March.
    3. Ramón Ferri-García & María del Mar Rueda, 2022. "Variable selection in Propensity Score Adjustment to mitigate selection bias in online surveys," Statistical Papers, Springer, vol. 63(6), pages 1829-1881, December.
    4. Yvan Devaux & Lu Zhang & Andrew I. Lumley & Kanita Karaduzovic-Hadziabdic & Vincent Mooser & Simon Rousseau & Muhammad Shoaib & Venkata Satagopam & Muhamed Adilovic & Prashant Kumar Srivastava & Costa, 2024. "Development of a long noncoding RNA-based machine learning model to predict COVID-19 in-hospital mortality," Nature Communications, Nature, vol. 15(1), pages 1-12, December.
    5. Ghosh, Indranil & Chaudhuri, Tamal Datta & Alfaro-Cortés, Esteban & Gámez, Matías & García, Noelia, 2022. "A hybrid approach to forecasting futures prices with simultaneous consideration of optimality in ensemble feature selection and advanced artificial intelligence," Technological Forecasting and Social Change, Elsevier, vol. 181(C).
    6. Manuel J. García Rodríguez & Vicente Rodríguez Montequín & Francisco Ortega Fernández & Joaquín M. Villanueva Balsera, 2019. "Public Procurement Announcements in Spain: Regulations, Data Analysis, and Award Price Estimator Using Machine Learning," Complexity, Hindawi, vol. 2019, pages 1-20, November.
    7. Sangjin Kim & Jong-Min Kim, 2019. "Two-Stage Classification with SIS Using a New Filter Ranking Method in High Throughput Data," Mathematics, MDPI, vol. 7(6), pages 1-16, May.
    8. Arjan S. Gosal & Janine A. McMahon & Katharine M. Bowgen & Catherine H. Hoppe & Guy Ziv, 2021. "Identifying and Mapping Groups of Protected Area Visitors by Environmental Awareness," Land, MDPI, vol. 10(6), pages 1-14, May.
    9. C N Potts & V A Strusevich, 2009. "Fifty years of scheduling: a survey of milestones," Journal of the Operational Research Society, Palgrave Macmillan;The OR Society, vol. 60(1), pages 41-68, May.
    10. Zhao-Yue Chen & Hervé Petetin & Raúl Fernando Méndez Turrubiates & Hicham Achebak & Carlos Pérez García-Pando & Joan Ballester, 2024. "Population exposure to multiple air pollutants and its compound episodes in Europe," Nature Communications, Nature, vol. 15(1), pages 1-11, December.
    11. Schrader, Silja & Graham, Sonia & Campbell, Rebecca & Height, Kaitlyn & Hawkes, Gina, 2024. "Grower attitudes and practices toward area-wide management of cropping weeds in Australia," Land Use Policy, Elsevier, vol. 137(C).
    12. Bram Janssens & Matthias Bogaert & Mathijs Maton, 2023. "Predicting the next Pogačar: a data analytical approach to detect young professional cycling talents," Annals of Operations Research, Springer, vol. 325(1), pages 557-588, June.
    13. Cooray, Upul & Watt, Richard G. & Tsakos, Georgios & Heilmann, Anja & Hariyama, Masanori & Yamamoto, Takafumi & Kuruppuarachchige, Isuruni & Kondo, Katsunori & Osaka, Ken & Aida, Jun, 2021. "Importance of socioeconomic factors in predicting tooth loss among older adults in Japan: Evidence from a machine learning analysis," Social Science & Medicine, Elsevier, vol. 291(C).
    14. Simon Besnard & Nuno Carvalhais & M Altaf Arain & Andrew Black & Benjamin Brede & Nina Buchmann & Jiquan Chen & Jan G P W Clevers & Loïc P Dutrieux & Fabian Gans & Martin Herold & Martin Jung & Yoshik, 2019. "Memory effects of climate and vegetation affecting net ecosystem CO2 fluxes in global forests," PLOS ONE, Public Library of Science, vol. 14(2), pages 1-22, February.
    15. Francesco Sartor & Jonathan P. Moore & Hans-Peter Kubis, 2021. "Plasma Interleukin-10 and Cholesterol Levels May Inform about Interdependences between Fitness and Fatness in Healthy Individuals," IJERPH, MDPI, vol. 18(4), pages 1-19, February.
    16. Gary Kochenberger & Jin-Kao Hao & Fred Glover & Mark Lewis & Zhipeng Lü & Haibo Wang & Yang Wang, 2014. "The unconstrained binary quadratic programming problem: a survey," Journal of Combinatorial Optimization, Springer, vol. 28(1), pages 58-81, July.
    17. Nawin Raj, 2022. "Prediction of Sea Level with Vertical Land Movement Correction Using Deep Learning," Mathematics, MDPI, vol. 10(23), pages 1-23, November.
    18. Stefano Lucidi & Francesco Rinaldi, 2009. "Exact Penalty Functions for Nonlinear Integer Programming Problems," DIS Technical Reports 2009-10, Department of Computer, Control and Management Engineering, Universita' degli Studi di Roma "La Sapienza".
    19. Piotr Pomorski & Denise Gorse, 2023. "Improving Portfolio Performance Using a Novel Method for Predicting Financial Regimes," Papers 2310.04536, arXiv.org.
    20. Caperna, Giulio & Colagrossi, Marco & Geraci, Andrea & Mazzarella, Gianluca, 2022. "A babel of web-searches: Googling unemployment during the pandemic," Labour Economics, Elsevier, vol. 74(C).

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:8:y:2020:i:6:p:900-:d:366515. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.