IDEAS home Printed from https://ideas.repec.org/a/gam/jftint/v13y2021i11p275-d666378.html
   My bibliography  Save this article

Introducing Various Semantic Models for Amharic: Experimentation and Evaluation with Multiple Tasks and Datasets

Author

Listed:
  • Seid Muhie Yimam

    (Language Technology Group, Universität Hamburg, Grindelallee 117, 20146 Hamburg, Germany)

  • Abinew Ali Ayele

    (Language Technology Group, Universität Hamburg, Grindelallee 117, 20146 Hamburg, Germany
    Faculty of Computing, Bahir Dar Institute of Technology, Bahir Dar University, Bahir Dar 6000, Ethiopia)

  • Gopalakrishnan Venkatesh

    (International Institute of Information Technology, Bangalore 560100, India)

  • Ibrahim Gashaw

    (College of Informatics, University of Gondar, Gondar 6200, Ethiopia)

  • Chris Biemann

    (Language Technology Group, Universität Hamburg, Grindelallee 117, 20146 Hamburg, Germany)

Abstract

The availability of different pre-trained semantic models has enabled the quick development of machine learning components for downstream applications. However, even if texts are abundant for low-resource languages, there are very few semantic models publicly available. Most of the publicly available pre-trained models are usually built as a multilingual version of semantic models that will not fit well with the need for low-resource languages. We introduce different semantic models for Amharic, a morphologically complex Ethio-Semitic language. After we investigate the publicly available pre-trained semantic models, we fine-tune two pre-trained models and train seven new different models. The models include Word2Vec embeddings, distributional thesaurus (DT), BERT-like contextual embeddings, and DT embeddings obtained via network embedding algorithms. Moreover, we employ these models for different NLP tasks and study their impact. We find that newly-trained models perform better than pre-trained multilingual models. Furthermore, models based on contextual embeddings from FLAIR and RoBERTa perform better than word2Vec models for the NER and POS tagging tasks. DT-based network embeddings are suitable for the sentiment classification task. We publicly release all the semantic models, machine learning components, and several benchmark datasets such as NER, POS tagging, sentiment classification, as well as Amharic versions of WordSim353 and SimLex999.

Suggested Citation

  • Seid Muhie Yimam & Abinew Ali Ayele & Gopalakrishnan Venkatesh & Ibrahim Gashaw & Chris Biemann, 2021. "Introducing Various Semantic Models for Amharic: Experimentation and Evaluation with Multiple Tasks and Datasets," Future Internet, MDPI, vol. 13(11), pages 1-18, October.
  • Handle: RePEc:gam:jftint:v:13:y:2021:i:11:p:275-:d:666378
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/1999-5903/13/11/275/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/1999-5903/13/11/275/
    Download Restriction: no
    ---><---

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jftint:v:13:y:2021:i:11:p:275-:d:666378. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.