IDEAS home Printed from https://ideas.repec.org/a/gam/jdataj/v10y2025i4p43-d1619994.html
   My bibliography  Save this article

Improved Script Identification Algorithm Using Unicode-Based Regular Expression Matching Strategy

Author

Listed:
  • Mamtimin Qasim

    (School of Information Technology and Engineering, Guangzhou College of Commerce, Guangzhou 511363, China)

  • Wushour Silamu

    (School of Computer Science and Technology, Xinjiang University, Urumqi 830046, China
    Key Multi-Lingual Laboratory of Xinjiang, Urumqi 830046, China)

Abstract

While script identification is the first step in many natural language processing and text mining tasks, at present, there is no open-source script identification algorithm for text. For this reason, we analyze the Unicode encoding of each type of script and construct regular expressions in this study, in order to design an improved script identification algorithm. Because some scripts share common characters, it’s impossible to count and summarize them. As a result, some extracted scripts are incomplete, which affects subsequent text processing tasks; furthermore, if a new script identification feature is required, the regular expression for each script must be re-adjusted. To improve the performance and scalability of script identification, we analyze the encoding range of each script provided on the official Unicode website and identify the shared characters, allowing us to design an improved script identification algorithm. Using this approach, we can fully consider all 169 Unicode script types. The proposed method is scalable and does not require numbers, punctuation marks, or other symbols to be filtered during script identification; furthermore, these items in the text are also included in the script identification results, thus ensuring the integrity of the provided information. The experimental results show that the proposed algorithm performs almost as well as our previous script identification algorithm while providing improvements on its basis.

Suggested Citation

  • Mamtimin Qasim & Wushour Silamu, 2025. "Improved Script Identification Algorithm Using Unicode-Based Regular Expression Matching Strategy," Data, MDPI, vol. 10(4), pages 1-22, March.
  • Handle: RePEc:gam:jdataj:v:10:y:2025:i:4:p:43-:d:1619994
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2306-5729/10/4/43/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2306-5729/10/4/43/
    Download Restriction: no
    ---><---

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jdataj:v:10:y:2025:i:4:p:43-:d:1619994. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.