Author
Listed:
- Azilawati Azizan
(College of Computing, Informatics and Mathematics, Universiti Teknologi MARA (UiTM), Perak Branch, Tapah Campus, Malaysia.)
- Nurkhairizan Khairuddin
(College of Computing, Informatics and Mathematics, Universiti Teknologi MARA (UiTM), Perak Branch, Tapah Campus, Malaysia.)
- Nur Husna Anuar
(Yayasan Warisan Anak Selangor, Syarikat Pengurusan Projek TAWAS, Kompleks Belia & Kebudayaan Negeri Selangor, Shah Alam Selangor, Malaysia.)
- Rohana Ismail
(Faculty of Informatics and Computing, Universiti Sultan Zainal Abidin, Besut Campus, Terengganu, Malaysia.)
Abstract
The rise of digital communication via hand phone and Internet has led to the widespread use of short-form words and abbreviations in text messaging. This trend poses challenges for data mining activities involving text processing and analysis, particularly in social media platforms where users employ a wide variety of abbreviations, slang, misspellings, and grammatical errors. To address this challenge, this study aimed to develop an algorithm for normalizing Malay noisy text using Levenshtein Distance (LD) and rule-based techniques. The LD is used to transform Malay spelling error words into their standard form, while rule-based techniques enhanced the conversion success rate for three categories of noisy term, namely slang, common Malay noisy text, and mixed language. The project was implemented using Python programming language, which demonstrated the effectiveness of the LD and rule-based techniques in normalizing noisy text in social media. The approach successfully normalized 80% of Malay noisy text into their standard text, which provides strong foundation for further study. Furthermore, this work open opportunities for introducing new approaches and rules to improve the normalization success rate, which can facilitate the analysis of text data in social media platforms. It is recommended that future studies focus on expanding the dataset and applying statistical validation methods to ensure the robustness and accuracy of the normalization model.
Suggested Citation
Azilawati Azizan & Nurkhairizan Khairuddin & Nur Husna Anuar & Rohana Ismail, 2024.
"Normalization of Malay Noisy Text in Social Media using Levenshtein Distance and Rule-Based Techniques,"
International Journal of Research and Innovation in Social Science, International Journal of Research and Innovation in Social Science (IJRISS), vol. 8(9), pages 1535-1544, September.
Handle:
RePEc:bcp:journl:v:8:y:2024:i:9:p:1535-1544
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bcp:journl:v:8:y:2024:i:9:p:1535-1544. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Dr. Pawan Verma (email available below). General contact details of provider: https://rsisinternational.org/journals/ijriss/ .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.