IDEAS home Printed from https://ideas.repec.org/a/gam/jdataj/v9y2024i11p134-d1518130.html
   My bibliography  Save this article

The Design of a Script Identification Algorithm and Its Application in Constructing a Text Language Identification Dataset

Author

Listed:
  • Mamtimin Qasim

    (School of Information Technology and Engineering, Guangzhou College of Commerce, Guangzhou 511363, China)

  • Wushour Silamu

    (School of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
    Key Multi-Lingual Laboratory of Xinjiang, Urumqi 830046, China)

  • Minghui Qiu

    (School of Information Technology and Engineering, Guangzhou College of Commerce, Guangzhou 511363, China)

Abstract

Script identification is easier to implement than language identification, and its identification rate is very high. The fewer languages are identified when using a language identification algorithm, the higher the identification rate is. However, no systematic study on SI involving multiple languages and determining how to construct relevant language identification datasets has been conducted. Therefore, in this paper, we discuss and design a script identification algorithm and the construction of a language identification dataset based on script groups. The data sources in this paper comprise 261 different languages’ text corpora from the Leipzig Corpora Collection, which are grouped into 23 different script groups. In the Unicode encoding scheme, different scripts are arranged into different code regions. Based on this feature, we propose a written script identification algorithm based on regular expression matching, the micro F-score of which reaches 0.9929 in sentence-level script identification experiments. To reduce noise when constructing the language identification dataset for each script, a script identification algorithm is used to filter out other-script content in each text.

Suggested Citation

  • Mamtimin Qasim & Wushour Silamu & Minghui Qiu, 2024. "The Design of a Script Identification Algorithm and Its Application in Constructing a Text Language Identification Dataset," Data, MDPI, vol. 9(11), pages 1-11, November.
  • Handle: RePEc:gam:jdataj:v:9:y:2024:i:11:p:134-:d:1518130
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2306-5729/9/11/134/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2306-5729/9/11/134/
    Download Restriction: no
    ---><---

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jdataj:v:9:y:2024:i:11:p:134-:d:1518130. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.