The Design of a Script Identification Algorithm and Its Application in Constructing a Text Language Identification Dataset

My bibliography Save this article

The Design of a Script Identification Algorithm and Its Application in Constructing a Text Language Identification Dataset

Author

Listed:

Mamtimin Qasim
(School of Information Technology and Engineering, Guangzhou College of Commerce, Guangzhou 511363, China)
Wushour Silamu
(School of Information Science and Engineering, Xinjiang University, Urumqi 830046, China
Key Multi-Lingual Laboratory of Xinjiang, Urumqi 830046, China)
Minghui Qiu
(School of Information Technology and Engineering, Guangzhou College of Commerce, Guangzhou 511363, China)

Registered:

Abstract

Script identification is easier to implement than language identification, and its identification rate is very high. The fewer languages are identified when using a language identification algorithm, the higher the identification rate is. However, no systematic study on SI involving multiple languages and determining how to construct relevant language identification datasets has been conducted. Therefore, in this paper, we discuss and design a script identification algorithm and the construction of a language identification dataset based on script groups. The data sources in this paper comprise 261 different languages’ text corpora from the Leipzig Corpora Collection, which are grouped into 23 different script groups. In the Unicode encoding scheme, different scripts are arranged into different code regions. Based on this feature, we propose a written script identification algorithm based on regular expression matching, the micro F-score of which reaches 0.9929 in sentence-level script identification experiments. To reduce noise when constructing the language identification dataset for each script, a script identification algorithm is used to filter out other-script content in each text.

Suggested Citation

Mamtimin Qasim & Wushour Silamu & Minghui Qiu, 2024. "The Design of a Script Identification Algorithm and Its Application in Constructing a Text Language Identification Dataset," Data, MDPI, vol. 9(11), pages 1-11, November.

Handle: RePEc:gam:jdataj:v:9:y:2024:i:11:p:134-:d:1518130

Download full text from publisher

More about this item

Keywords

script; script identification; language identification; language identification dataset;
All these keywords.

Statistics

Access and download statistics

Corrections

All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jdataj:v:9:y:2024:i:11:p:134-:d:1518130. See general information about how to correct material in RePEc.

If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

We have no bibliographic references for this item. You can help adding them by using this form .

If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

Please note that corrections may take a couple of weeks to filter through the various RePEc services.

IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.

Browse Econ Literature

More features

The Design of a Script Identification Algorithm and Its Application in Constructing a Text Language Identification Dataset

Author

Abstract

Suggested Citation

Download full text from publisher

More about this item

Keywords

Statistics

Corrections

More services and features

MyIDEAS

Author registration

Rankings

RePEc Genealogy

RePEc Biblio

MPRA

New papers by email

EconAcademics

Plagiarism

About RePEc

RePEc home

Blog

Help/FAQ

RePEc team

Participating archives

Privacy statement

Help us

Corrections

Volunteers

Get papers listed

Open a RePEc archive

Get RePEc data