IDEAS home Printed from https://ideas.repec.org/a/gam/jmathe/v11y2023i3p769-d1056585.html
   My bibliography  Save this article

Improving Intent Classification Using Unlabeled Data from Large Corpora

Author

Listed:
  • Gabriel Bercaru

    (SoftTehnica, RO-030128 Bucharest, Romania
    Computer Science and Engineering Department, Faculty of Automatic Control and Computers, University Politehnica of Bucharest, RO-060042 Bucharest, Romania)

  • Ciprian-Octavian Truică

    (SoftTehnica, RO-030128 Bucharest, Romania
    Computer Science and Engineering Department, Faculty of Automatic Control and Computers, University Politehnica of Bucharest, RO-060042 Bucharest, Romania)

  • Costin-Gabriel Chiru

    (SoftTehnica, RO-030128 Bucharest, Romania
    Computer Science and Engineering Department, Faculty of Automatic Control and Computers, University Politehnica of Bucharest, RO-060042 Bucharest, Romania)

  • Traian Rebedea

    (Computer Science and Engineering Department, Faculty of Automatic Control and Computers, University Politehnica of Bucharest, RO-060042 Bucharest, Romania)

Abstract

Intent classification is a central component of a Natural Language Understanding (NLU) pipeline for conversational agents. The quality of such a component depends on the quality of the training data, however, for many conversational scenarios, the data might be scarce; in these scenarios, data augmentation techniques are used. Having general data augmentation methods that can generalize to many datasets is highly desirable. The work presented in this paper is centered around two main components. First, we explore the influence of various feature vectors on the task of intent classification using RASA’s text classification capabilities. The second part of this work consists of a generic method for efficiently augmenting textual corpora using large datasets of unlabeled data. The proposed method is able to efficiently mine for examples similar to the ones that are already present in standard, natural language corpora. The experimental results show that using our corpus augmentation methods enables an increase in text classification accuracy in few-shot settings. Particularly, the gains in accuracy raise up to 16% when the number of labeled examples is very low (e.g., two examples). We believe that our method is important for any Natural Language Processing (NLP) or NLU task in which labeled training data are scarce or expensive to obtain. Lastly, we give some insights into future work, which aims at combining our proposed method with a semi-supervised learning approach.

Suggested Citation

  • Gabriel Bercaru & Ciprian-Octavian Truică & Costin-Gabriel Chiru & Traian Rebedea, 2023. "Improving Intent Classification Using Unlabeled Data from Large Corpora," Mathematics, MDPI, vol. 11(3), pages 1-20, February.
  • Handle: RePEc:gam:jmathe:v:11:y:2023:i:3:p:769-:d:1056585
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2227-7390/11/3/769/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2227-7390/11/3/769/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Vimala Balakrishnan & Zhongliang Shi & Chuan Liang Law & Regine Lim & Lee Leng Teh & Yue Fan & Jeyarani Periasamy, 2022. "A Comprehensive Analysis of Transformer-Deep Neural Network Models in Twitter Disaster Detection," Mathematics, MDPI, vol. 10(24), pages 1-14, December.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.

      Corrections

      All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:11:y:2023:i:3:p:769-:d:1056585. See general information about how to correct material in RePEc.

      If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

      If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

      If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

      For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

      Please note that corrections may take a couple of weeks to filter through the various RePEc services.

      IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.