IDEAS home Printed from https://ideas.repec.org/a/gam/jmathe/v11y2022i1p123-d1016598.html
   My bibliography  Save this article

Robust Data Augmentation for Neural Machine Translation through EVALNET

Author

Listed:
  • Yo-Han Park

    (Department of Radio and Information Communications Engineering, ChungNam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea)

  • Yong-Seok Choi

    (Department of Radio and Information Communications Engineering, ChungNam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea)

  • Seung Yun

    (Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute (ETRI), 218 Gajeong-ro, Yuseong-gu, Daejeon 34129, Republic of Korea)

  • Sang-Hun Kim

    (Artificial Intelligence Research Laboratory, Electronics and Telecommunications Research Institute (ETRI), 218 Gajeong-ro, Yuseong-gu, Daejeon 34129, Republic of Korea)

  • Kong-Joo Lee

    (Department of Radio and Information Communications Engineering, ChungNam National University, 99 Daehak-ro, Yuseong-gu, Daejeon 34134, Republic of Korea)

Abstract

Since building Neural Machine Translation (NMT) systems requires a large parallel corpus, various data augmentation techniques have been adopted, especially for low-resource languages. In order to achieve the best performance through data augmentation, the NMT systems should be able to evaluate the quality of augmented data. Several studies have addressed data weighting techniques to assess data quality. The basic idea of data weighting adopted in previous studies is the loss value that a system calculates when learning from training data. The weight derived from the loss value of the data, through simple heuristic rules or neural models, can adjust the loss used in the next step of the learning process. In this study, we propose EvalNet, a data evaluation network, to assess parallel data of NMT. EvalNet exploits a loss value, a cross-attention map, and a semantic similarity between parallel data as its features. The cross-attention map is an encoded representation of cross-attention layers of Transformer, which is a base architecture of an NMT system. The semantic similarity is a cosine distance between two semantic embeddings of a source sentence and a target sentence. Owing to the parallelism of data, the combination of the cross-attention map and the semantic similarity proved to be effective features for data quality evaluation, besides the loss value. EvalNet is the first NMT data evaluator network that introduces the cross-attention map and the semantic similarity as its features. Through various experiments, we conclude that EvalNet is simple yet beneficial for robust training of an NMT system and outperforms the previous studies as a data evaluator.

Suggested Citation

  • Yo-Han Park & Yong-Seok Choi & Seung Yun & Sang-Hun Kim & Kong-Joo Lee, 2022. "Robust Data Augmentation for Neural Machine Translation through EVALNET," Mathematics, MDPI, vol. 11(1), pages 1-15, December.
  • Handle: RePEc:gam:jmathe:v:11:y:2022:i:1:p:123-:d:1016598
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2227-7390/11/1/123/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2227-7390/11/1/123/
    Download Restriction: no
    ---><---

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:11:y:2022:i:1:p:123-:d:1016598. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.