Author
Abstract
A lexicon-based ensemble web data classification approach is designed for classic machine learning techniques to emphasize the accuracy and efficiency of textual data from the web. As the volume of internet material expands dramatically, effective and scalable techniques for classifying it are crucial. Traditional classifiers such as Support Vector Machines (SVM), Naive Bayes (NB), and Decision Trees (DT) rely on statistical learning from labeled datasets, which necessitates a huge quantity of training data and processing resources. Lexicon-based techniques, on the other hand, employ prepared collections of words (lexicons) linked with certain classes or sentiments, eliminating the need for extensive training but frequently lacking generalizability. This comprehensive paper suggests a lexicon-based ensemble classification system that incorporates several lexicons, each optimized for particular features of web data, and compares it to conventional approaches in terms of accuracy, scalability, and performance in order to overcome the drawbacks of both lexicon-based and traditional classifiers. By using the benefits of many lexicons, the ensemble technique reduces individual biases and boosts robustness. Additionally, the use of ensemble approaches enhances classification accuracy by adding a layer of decision-making, especially when dealing with noisy and unstructured online data like news articles, blogs, and social media postings. Through a series of tests, the paper compares the ensemble lexicon-based approach to SVM, NB, DT, and Random Forests (RF) using a number of benchmark datasets. Performance is evaluated using metrics including accuracy, recall, F1 score, and computational efficiency. The findings demonstrate that the lexicon-based ensemble approach provides more precision in sentiment and topic classification tasks and performs better than conventional classifiers in situations with sparse or noisy labeled data. However, when large, high-quality labeled datasets are available, classical classifiers perform better, showing stronger recall and generalization ability. By showing that lexicon-based models, when appropriately adjusted and combined, can compete with or even surpass traditional classifiers in particular situations, this study adds to the expanding corpus of research on hybrid and ensemble learning approaches and makes them a useful tool in the larger field of web data analysis.
Suggested Citation
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:aac:ijirss:v:8:y:2025:i:2:p:1736-1745:id:5537. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Natalie Jean (email available below). General contact details of provider: https://ijirss.com/index.php/ijirss/ .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.