IDEAS home Printed from https://ideas.repec.org/a/gam/jmathe/v12y2024i12p1898-d1417805.html
   My bibliography  Save this article

Exploring Data Augmentation and Active Learning Benefits in Imbalanced Datasets

Author

Listed:
  • Luis Moles

    (TECNALIA, Basque Research and Technology Alliance (BRTA), Parque Científico y Tecnológico de Gipuzkoa, 20009 Donostia-San Sebastián, Spain
    Department of Computer Sciences and Artificial Intelligence, University of the Basque Country (UPV/EHU), 20018 Donostia-San Sebastián, Spain)

  • Alain Andres

    (TECNALIA, Basque Research and Technology Alliance (BRTA), Parque Científico y Tecnológico de Gipuzkoa, 20009 Donostia-San Sebastián, Spain)

  • Goretti Echegaray

    (Department of Computer Sciences and Artificial Intelligence, University of the Basque Country (UPV/EHU), 20018 Donostia-San Sebastián, Spain)

  • Fernando Boto

    (Faculty of Engineering, University of Deusto, 20012 Donostia-San Sebastián, Spain)

Abstract

Despite the increasing availability of vast amounts of data, the challenge of acquiring labeled data persists. This issue is particularly serious in supervised learning scenarios, where labeled data are essential for model training. In addition, the rapid growth in data required by cutting-edge technologies such as deep learning makes the task of labeling large datasets impractical. Active learning methods offer a powerful solution by iteratively selecting the most informative unlabeled instances, thereby reducing the amount of labeled data required. However, active learning faces some limitations with imbalanced datasets, where majority class over-representation can bias sample selection. To address this, combining active learning with data augmentation techniques emerges as a promising strategy. Nonetheless, the best way to combine these techniques is not yet clear. Our research addresses this question by analyzing the effectiveness of combining both active learning and data augmentation techniques under different scenarios. Moreover, we focus on improving the generalization capabilities for minority classes, which tend to be overshadowed by the improvement seen in majority classes. For this purpose, we generate synthetic data using multiple data augmentation methods and evaluate the results considering two active learning strategies across three imbalanced datasets. Our study shows that data augmentation enhances prediction accuracy for minority classes, with approaches based on CTGANs obtaining improvements of nearly 50% in some cases. Moreover, we show that combining data augmentation techniques with active learning can reduce the amount of real data required.

Suggested Citation

  • Luis Moles & Alain Andres & Goretti Echegaray & Fernando Boto, 2024. "Exploring Data Augmentation and Active Learning Benefits in Imbalanced Datasets," Mathematics, MDPI, vol. 12(12), pages 1-39, June.
  • Handle: RePEc:gam:jmathe:v:12:y:2024:i:12:p:1898-:d:1417805
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2227-7390/12/12/1898/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2227-7390/12/12/1898/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Miao Zhong & Kevin Tran & Yimeng Min & Chuanhao Wang & Ziyun Wang & Cao-Thang Dinh & Phil De Luna & Zongqian Yu & Armin Sedighian Rasouli & Peter Brodersen & Song Sun & Oleksandr Voznyy & Chih-Shan Ta, 2020. "Accelerated discovery of CO2 electrocatalysts using active machine learning," Nature, Nature, vol. 581(7807), pages 178-183, May.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Yong Zhang & Feifei Chen & Xinyi Yang & Yiran Guo & Xinghua Zhang & Hong Dong & Weihua Wang & Feng Lu & Zunming Lu & Hui Liu & Hui Liu & Yao Xiao & Yahui Cheng, 2025. "Electronic metal-support interaction modulates Cu electronic structures for CO2 electroreduction to desired products," Nature Communications, Nature, vol. 16(1), pages 1-12, December.
    2. Jikai Sun & Rui Tu & Yuchun Xu & Hongyan Yang & Tie Yu & Dong Zhai & Xiuqin Ci & Weiqiao Deng, 2024. "Machine learning aided design of single-atom alloy catalysts for methane cracking," Nature Communications, Nature, vol. 15(1), pages 1-9, December.
    3. Zhiheng Li & Xin Mao & Desheng Feng & Mengran Li & Xiaoyong Xu & Yadan Luo & Linzhou Zhuang & Rijia Lin & Tianjiu Zhu & Fengli Liang & Zi Huang & Dong Liu & Zifeng Yan & Aijun Du & Zongping Shao & Zho, 2024. "Prediction of perovskite oxygen vacancies for oxygen electrocatalysis at different temperatures," Nature Communications, Nature, vol. 15(1), pages 1-12, December.
    4. Xiaojie She & Lingling Zhai & Yifei Wang & Pei Xiong & Molly Meng-Jung Li & Tai-Sing Wu & Man Chung Wong & Xuyun Guo & Zhihang Xu & Huaming Li & Hui Xu & Ye Zhu & Shik Chi Edman Tsang & Shu Ping Lau, 2024. "Pure-water-fed, electrocatalytic CO2 reduction to ethylene beyond 1,000 h stability at 10 A," Nature Energy, Nature, vol. 9(1), pages 81-91, January.
    5. Jing Xue & Xue Dong & Chunxiao Liu & Jiawei Li & Yizhou Dai & Weiqing Xue & Laihao Luo & Yuan Ji & Xiao Zhang & Xu Li & Qiu Jiang & Tingting Zheng & Jianping Xiao & Chuan Xia, 2024. "Turning copper into an efficient and stable CO evolution catalyst beyond noble metals," Nature Communications, Nature, vol. 15(1), pages 1-11, December.
    6. Jin Zhang & Chenxi Guo & Susu Fang & Xiaotong Zhao & Le Li & Haoyang Jiang & Zhaoyang Liu & Ziqi Fan & Weigao Xu & Jianping Xiao & Miao Zhong, 2023. "Accelerating electrochemical CO2 reduction to multi-carbon products via asymmetric intermediate binding at confined nanointerfaces," Nature Communications, Nature, vol. 14(1), pages 1-11, December.
    7. Dong Hyeon Mok & Hong Li & Guiru Zhang & Chaehyeon Lee & Kun Jiang & Seoin Back, 2023. "Data-driven discovery of electrocatalysts for CO2 reduction using active motifs-based machine learning," Nature Communications, Nature, vol. 14(1), pages 1-12, December.
    8. Simon Rufer & Michael P. Nitzsche & Sanjay Garimella & Jack R. Lake & Kripa K. Varanasi, 2024. "Hierarchically conductive electrodes unlock stable and scalable CO2 electrolysis," Nature Communications, Nature, vol. 15(1), pages 1-9, December.
    9. Zhilong Song & Linfeng Fan & Shuaihua Lu & Chongyi Ling & Qionghua Zhou & Jinlan Wang, 2025. "Inverse design of promising electrocatalysts for CO2 reduction via generative models and bird swarm algorithm," Nature Communications, Nature, vol. 16(1), pages 1-10, December.
    10. Kangming Li & Daniel Persaud & Kamal Choudhary & Brian DeCost & Michael Greenwood & Jason Hattrick-Simpers, 2023. "Exploiting redundancy in large materials datasets for efficient machine learning with less data," Nature Communications, Nature, vol. 14(1), pages 1-10, December.
    11. Jiawei Li & Hongliang Zeng & Xue Dong & Yimin Ding & Sunpei Hu & Runhao Zhang & Yizhou Dai & Peixin Cui & Zhou Xiao & Donghao Zhao & Liujiang Zhou & Tingting Zheng & Jianping Xiao & Jie Zeng & Chuan X, 2023. "Selective CO2 electrolysis to CO using isolated antimony alloyed copper," Nature Communications, Nature, vol. 14(1), pages 1-11, December.
    12. Yufei Cao & Zhu Chen & Peihao Li & Adnan Ozden & Pengfei Ou & Weiyan Ni & Jehad Abed & Erfan Shirzadi & Jinqiang Zhang & David Sinton & Jun Ge & Edward H. Sargent, 2023. "Surface hydroxide promotes CO2 electrolysis to ethylene in acidic conditions," Nature Communications, Nature, vol. 14(1), pages 1-8, December.
    13. Junmei Chen & Haoran Qiu & Yilin Zhao & Haozhou Yang & Lei Fan & Zhihe Liu & ShiBo Xi & Guangtai Zheng & Jiayi Chen & Lei Chen & Ya Liu & Liejin Guo & Lei Wang, 2024. "Selective and stable CO2 electroreduction at high rates via control of local H2O/CO2 ratio," Nature Communications, Nature, vol. 15(1), pages 1-15, December.
    14. Jiawei Zhu & Yu Zhang & Zitao Chen & Zhenbao Zhang & Xuezeng Tian & Minghua Huang & Xuedong Bai & Xue Wang & Yongfa Zhu & Heqing Jiang, 2024. "Superexchange-stabilized long-distance Cu sites in rock-salt-ordered double perovskite oxides for CO2 electromethanation," Nature Communications, Nature, vol. 15(1), pages 1-10, December.
    15. Stefan Ringe, 2023. "The importance of a charge transfer descriptor for screening potential CO2 reduction electrocatalysts," Nature Communications, Nature, vol. 14(1), pages 1-14, December.
    16. Xiaohan Yu & Yuting Xu & Le Li & Mingzhe Zhang & Wenhao Qin & Fanglin Che & Miao Zhong, 2024. "Coverage enhancement accelerates acidic CO2 electrolysis at ampere-level current with high energy and carbon efficiencies," Nature Communications, Nature, vol. 15(1), pages 1-9, December.
    17. Manu Suvarna & Tangsheng Zou & Sok Ho Chong & Yuzhen Ge & Antonio J. Martín & Javier Pérez-Ramírez, 2024. "Active learning streamlines development of high performance catalysts for higher alcohol synthesis," Nature Communications, Nature, vol. 15(1), pages 1-14, December.
    18. Chen, Zhangsen & Zhang, Gaixia & Chen, Hangrong & Prakash, Jai & Zheng, Yi & Sun, Shuhui, 2022. "Multi-metallic catalysts for the electroreduction of carbon dioxide: Recent advances and perspectives," Renewable and Sustainable Energy Reviews, Elsevier, vol. 155(C).
    19. Carina Yi Jing Lim & Meltem Yilmaz & Juan Manuel Arce-Ramos & Albertus D. Handoko & Wei Jie Teh & Yuangang Zheng & Zi Hui Jonathan Khoo & Ming Lin & Mark Isaacs & Teck Lip Dexter Tam & Yang Bai & Chee, 2023. "Surface charge as activity descriptors for electrochemical CO2 reduction to multi-carbon products on organic-functionalised Cu," Nature Communications, Nature, vol. 14(1), pages 1-11, December.
    20. Cheng Du & Joel P. Mills & Asfaw G. Yohannes & Wei Wei & Lei Wang & Siyan Lu & Jian-Xiang Lian & Maoyu Wang & Tao Guo & Xiyang Wang & Hua Zhou & Cheng-Jun Sun & John Z. Wen & Brian Kendall & Martin Co, 2023. "Cascade electrocatalysis via AgCu single-atom alloy and Ag nanoparticles in CO2 electroreduction toward multicarbon products," Nature Communications, Nature, vol. 14(1), pages 1-10, December.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:12:y:2024:i:12:p:1898-:d:1417805. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.