IDEAS home Printed from https://ideas.repec.org/a/gam/jmathe/v12y2024i12p1898-d1417805.html
   My bibliography  Save this article

Exploring Data Augmentation and Active Learning Benefits in Imbalanced Datasets

Author

Listed:
  • Luis Moles

    (TECNALIA, Basque Research and Technology Alliance (BRTA), Parque Científico y Tecnológico de Gipuzkoa, 20009 Donostia-San Sebastián, Spain
    Department of Computer Sciences and Artificial Intelligence, University of the Basque Country (UPV/EHU), 20018 Donostia-San Sebastián, Spain)

  • Alain Andres

    (TECNALIA, Basque Research and Technology Alliance (BRTA), Parque Científico y Tecnológico de Gipuzkoa, 20009 Donostia-San Sebastián, Spain)

  • Goretti Echegaray

    (Department of Computer Sciences and Artificial Intelligence, University of the Basque Country (UPV/EHU), 20018 Donostia-San Sebastián, Spain)

  • Fernando Boto

    (Faculty of Engineering, University of Deusto, 20012 Donostia-San Sebastián, Spain)

Abstract

Despite the increasing availability of vast amounts of data, the challenge of acquiring labeled data persists. This issue is particularly serious in supervised learning scenarios, where labeled data are essential for model training. In addition, the rapid growth in data required by cutting-edge technologies such as deep learning makes the task of labeling large datasets impractical. Active learning methods offer a powerful solution by iteratively selecting the most informative unlabeled instances, thereby reducing the amount of labeled data required. However, active learning faces some limitations with imbalanced datasets, where majority class over-representation can bias sample selection. To address this, combining active learning with data augmentation techniques emerges as a promising strategy. Nonetheless, the best way to combine these techniques is not yet clear. Our research addresses this question by analyzing the effectiveness of combining both active learning and data augmentation techniques under different scenarios. Moreover, we focus on improving the generalization capabilities for minority classes, which tend to be overshadowed by the improvement seen in majority classes. For this purpose, we generate synthetic data using multiple data augmentation methods and evaluate the results considering two active learning strategies across three imbalanced datasets. Our study shows that data augmentation enhances prediction accuracy for minority classes, with approaches based on CTGANs obtaining improvements of nearly 50% in some cases. Moreover, we show that combining data augmentation techniques with active learning can reduce the amount of real data required.

Suggested Citation

  • Luis Moles & Alain Andres & Goretti Echegaray & Fernando Boto, 2024. "Exploring Data Augmentation and Active Learning Benefits in Imbalanced Datasets," Mathematics, MDPI, vol. 12(12), pages 1-39, June.
  • Handle: RePEc:gam:jmathe:v:12:y:2024:i:12:p:1898-:d:1417805
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2227-7390/12/12/1898/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2227-7390/12/12/1898/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Miao Zhong & Kevin Tran & Yimeng Min & Chuanhao Wang & Ziyun Wang & Cao-Thang Dinh & Phil De Luna & Zongqian Yu & Armin Sedighian Rasouli & Peter Brodersen & Song Sun & Oleksandr Voznyy & Chih-Shan Ta, 2020. "Accelerated discovery of CO2 electrocatalysts using active machine learning," Nature, Nature, vol. 581(7807), pages 178-183, May.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Cheng Du & Joel P. Mills & Asfaw G. Yohannes & Wei Wei & Lei Wang & Siyan Lu & Jian-Xiang Lian & Maoyu Wang & Tao Guo & Xiyang Wang & Hua Zhou & Cheng-Jun Sun & John Z. Wen & Brian Kendall & Martin Co, 2023. "Cascade electrocatalysis via AgCu single-atom alloy and Ag nanoparticles in CO2 electroreduction toward multicarbon products," Nature Communications, Nature, vol. 14(1), pages 1-10, December.
    2. Xiaoyun Lin & Xiaowei Du & Shican Wu & Shiyu Zhen & Wei Liu & Chunlei Pei & Peng Zhang & Zhi-Jian Zhao & Jinlong Gong, 2024. "Machine learning-assisted dual-atom sites design with interpretable descriptors unifying electrocatalytic reactions," Nature Communications, Nature, vol. 15(1), pages 1-13, December.
    3. SJ, Balaji & Babu, Suresh Chandra & Pal, Suresh, 2021. "Understanding Science and Policy Making in Agriculture: A Machine Learning Application for India," 2021 Conference, August 17-31, 2021, Virtual 315227, International Association of Agricultural Economists.
    4. Bo Peng & Ye Wei & Yu Qin & Jiabao Dai & Yue Li & Aobo Liu & Yun Tian & Liuliu Han & Yufeng Zheng & Peng Wen, 2023. "Machine learning-enabled constrained multi-objective design of architected materials," Nature Communications, Nature, vol. 14(1), pages 1-12, December.
    5. Hefei Li & Pengfei Wei & Tianfu Liu & Mingrun Li & Chao Wang & Rongtan Li & Jinyu Ye & Zhi-You Zhou & Shi-Gang Sun & Qiang Fu & Dunfeng Gao & Guoxiong Wang & Xinhe Bao, 2024. "CO electrolysis to multicarbon products over grain boundary-rich Cu nanoparticles in membrane electrode assembly electrolyzers," Nature Communications, Nature, vol. 15(1), pages 1-11, December.
    6. Jikai Sun & Rui Tu & Yuchun Xu & Hongyan Yang & Tie Yu & Dong Zhai & Xiuqin Ci & Weiqiao Deng, 2024. "Machine learning aided design of single-atom alloy catalysts for methane cracking," Nature Communications, Nature, vol. 15(1), pages 1-9, December.
    7. Kaihang Yue & Yanyang Qin & Honghao Huang & Zhuoran Lv & Mingzhi Cai & Yaqiong Su & Fuqiang Huang & Ya Yan, 2024. "Stabilized Cu0 -Cu1+ dual sites in a cyanamide framework for selective CO2 electroreduction to ethylene," Nature Communications, Nature, vol. 15(1), pages 1-12, December.
    8. Jiaqi Feng & Limin Wu & Xinning Song & Libing Zhang & Shunhan Jia & Xiaodong Ma & Xingxing Tan & Xinchen Kang & Qinggong Zhu & Xiaofu Sun & Buxing Han, 2024. "CO2 electrolysis to multi-carbon products in strong acid at ampere-current levels on La-Cu spheres with channels," Nature Communications, Nature, vol. 15(1), pages 1-11, December.
    9. Tim Möller & Michael Filippi & Sven Brückner & Wen Ju & Peter Strasser, 2023. "A CO2 electrolyzer tandem cell system for CO2-CO co-feed valorization in a Ni-N-C/Cu-catalyzed reaction cascade," Nature Communications, Nature, vol. 14(1), pages 1-10, December.
    10. Yizhou Dai & Huan Li & Chuanhao Wang & Weiqing Xue & Menglu Zhang & Donghao Zhao & Jing Xue & Jiawei Li & Laihao Luo & Chunxiao Liu & Xu Li & Peixin Cui & Qiu Jiang & Tingting Zheng & Songqi Gu & Yao , 2023. "Manipulating local coordination of copper single atom catalyst enables efficient CO2-to-CH4 conversion," Nature Communications, Nature, vol. 14(1), pages 1-12, December.
    11. Xiaojie She & Lingling Zhai & Yifei Wang & Pei Xiong & Molly Meng-Jung Li & Tai-Sing Wu & Man Chung Wong & Xuyun Guo & Zhihang Xu & Huaming Li & Hui Xu & Ye Zhu & Shik Chi Edman Tsang & Shu Ping Lau, 2024. "Pure-water-fed, electrocatalytic CO2 reduction to ethylene beyond 1,000 h stability at 10 A," Nature Energy, Nature, vol. 9(1), pages 81-91, January.
    12. Jing Xue & Xue Dong & Chunxiao Liu & Jiawei Li & Yizhou Dai & Weiqing Xue & Laihao Luo & Yuan Ji & Xiao Zhang & Xu Li & Qiu Jiang & Tingting Zheng & Jianping Xiao & Chuan Xia, 2024. "Turning copper into an efficient and stable CO evolution catalyst beyond noble metals," Nature Communications, Nature, vol. 15(1), pages 1-11, December.
    13. Tian, Di & Wu, Ruobing & Qu, Zhiguo & Wang, Hui, 2024. "A systematic analysis and optimization of bicarbonate electrolysis based on a bipolar membrane through multiscale simulation," Applied Energy, Elsevier, vol. 364(C).
    14. Jin Zhang & Chenxi Guo & Susu Fang & Xiaotong Zhao & Le Li & Haoyang Jiang & Zhaoyang Liu & Ziqi Fan & Weigao Xu & Jianping Xiao & Miao Zhong, 2023. "Accelerating electrochemical CO2 reduction to multi-carbon products via asymmetric intermediate binding at confined nanointerfaces," Nature Communications, Nature, vol. 14(1), pages 1-11, December.
    15. Dong Hyeon Mok & Hong Li & Guiru Zhang & Chaehyeon Lee & Kun Jiang & Seoin Back, 2023. "Data-driven discovery of electrocatalysts for CO2 reduction using active motifs-based machine learning," Nature Communications, Nature, vol. 14(1), pages 1-12, December.
    16. Gong Zhang & Tuo Wang & Mengmeng Zhang & Lulu Li & Dongfang Cheng & Shiyu Zhen & Yongtao Wang & Jian Qin & Zhi-Jian Zhao & Jinlong Gong, 2022. "Selective CO2 electroreduction to methanol via enhanced oxygen bonding," Nature Communications, Nature, vol. 13(1), pages 1-11, December.
    17. Kangming Li & Daniel Persaud & Kamal Choudhary & Brian DeCost & Michael Greenwood & Jason Hattrick-Simpers, 2023. "Exploiting redundancy in large materials datasets for efficient machine learning with less data," Nature Communications, Nature, vol. 14(1), pages 1-10, December.
    18. Adarsh Dave & Jared Mitchell & Sven Burke & Hongyi Lin & Jay Whitacre & Venkatasubramanian Viswanathan, 2022. "Autonomous optimization of non-aqueous Li-ion battery electrolytes via robotic experimentation and machine learning coupling," Nature Communications, Nature, vol. 13(1), pages 1-9, December.
    19. Jiawei Li & Hongliang Zeng & Xue Dong & Yimin Ding & Sunpei Hu & Runhao Zhang & Yizhou Dai & Peixin Cui & Zhou Xiao & Donghao Zhao & Liujiang Zhou & Tingting Zheng & Jianping Xiao & Jie Zeng & Chuan X, 2023. "Selective CO2 electrolysis to CO using isolated antimony alloyed copper," Nature Communications, Nature, vol. 14(1), pages 1-11, December.
    20. Yufei Cao & Zhu Chen & Peihao Li & Adnan Ozden & Pengfei Ou & Weiyan Ni & Jehad Abed & Erfan Shirzadi & Jinqiang Zhang & David Sinton & Jun Ge & Edward H. Sargent, 2023. "Surface hydroxide promotes CO2 electrolysis to ethylene in acidic conditions," Nature Communications, Nature, vol. 14(1), pages 1-8, December.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:12:y:2024:i:12:p:1898-:d:1417805. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.