IDEAS home Printed from https://ideas.repec.org/a/gam/jmathe/v10y2022i7p1068-d780108.html
   My bibliography  Save this article

A Novel Low-Query-Budget Active Learner with Pseudo-Labels for Imbalanced Data

Author

Listed:
  • Alaa Tharwat

    (Center for Applied Data Science Gütersloh (CfADS), FH Bielefeld-University of Applied Sciences, 33619 Bielefeld, Germany)

  • Wolfram Schenck

    (Center for Applied Data Science Gütersloh (CfADS), FH Bielefeld-University of Applied Sciences, 33619 Bielefeld, Germany)

Abstract

Despite the availability of a large amount of free unlabeled data, collecting sufficient training data for supervised learning models is challenging due to the time and cost involved in the labeling process. The active learning technique we present here provides a solution by querying a small but highly informative set of unlabeled data. It ensures high generalizability across space, improving classification performance with test data that we have never seen before. Most active learners query either the most informative or the most representative data to annotate them. These two criteria are combined in the proposed algorithm by using two phases: exploration and exploitation phases. The former aims to explore the instance space by visiting new regions at each iteration. The second phase attempts to select highly informative points in uncertain regions. Without any predefined knowledge, such as initial training data, these two phases improve the search strategy of the proposed algorithm so that it can explore the minority class space with imbalanced data using a small query budget. Further, some pseudo-labeled points geometrically located in trusted explored regions around the new labeled points are added to the training data, but with lower weights than the original labeled points. These pseudo-labeled points play several roles in our model, such as (i) increasing the size of the training data and (ii) decreasing the size of the version space by reducing the number of hypotheses that are consistent with the training data. Experiments on synthetic and real datasets with different imbalance ratios and dimensions show that the proposed algorithm has significant advantages over various well-known active learners.

Suggested Citation

  • Alaa Tharwat & Wolfram Schenck, 2022. "A Novel Low-Query-Budget Active Learner with Pseudo-Labels for Imbalanced Data," Mathematics, MDPI, vol. 10(7), pages 1-32, March.
  • Handle: RePEc:gam:jmathe:v:10:y:2022:i:7:p:1068-:d:780108
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2227-7390/10/7/1068/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2227-7390/10/7/1068/
    Download Restriction: no
    ---><---

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Alaa Tharwat & Wolfram Schenck, 2023. "A Survey on Active Learning: State-of-the-Art, Practical Challenges and Research Directions," Mathematics, MDPI, vol. 11(4), pages 1-38, February.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:10:y:2022:i:7:p:1068-:d:780108. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.