IDEAS home Printed from https://ideas.repec.org/a/gam/jmathe/v10y2022i21p4124-d963760.html
   My bibliography  Save this article

Binned Term Count: An Alternative to Term Frequency for Text Categorization

Author

Listed:
  • Farhan Shehzad

    (Department of Computer Science, University of Gujrat, Gujrat 50700, Pakistan)

  • Abdur Rehman

    (Department of Computer Science, University of Gujrat, Gujrat 50700, Pakistan)

  • Kashif Javed

    (Department of Electrical Engineering, University of Engineering and Technology, Lahore 54890, Pakistan)

  • Khalid A. Alnowibet

    (Statistics and Operations Research Department, College of Science, King Saud University, Riyadh 11451, Saudi Arabia)

  • Haroon A. Babri

    (Department of Electrical Engineering, University of Engineering and Technology, Lahore 54890, Pakistan)

  • Hafiz Tayyab Rauf

    (Centre for Smart Systems, AI and Cybersecurity, Staffordshire University, Stoke-on-Trent ST4 2DE, UK)

Abstract

In text categorization, a well-known problem related to document length is that larger term counts in longer documents cause classification algorithms to become biased. The effect of document length can be eliminated by normalizing term counts, thus reducing the bias towards longer documents. This gives us term frequency (TF), which in conjunction with inverse document frequency (IDF) became the most commonly used term weighting scheme to capture the importance of a term in a document and corpus. However, normalization may cause term frequency of a term in a related document to become equal or smaller than its term frequency in an unrelated document, thus perturbing a term’s strength from its true worth. In this paper, we solve this problem by introducing a non-linear mapping of term frequency. This alternative to TF is called binned term count (BTC). The newly proposed term frequency factor trims large term counts before normalization, thus moderating the normalization effect on large documents. To investigate the effectiveness of BTC, we compare it against the original TF and its more recently proposed alternative named modified term frequency (MTF). In our experiments, each of these term frequency factors (BTC, TF, and MTF) is combined with four well-known collection frequency factors (IDF), RF, IGM, and MONO and the performance of each of the resulting term weighting schemes is evaluated on three standard datasets (Reuters (R8-21578), 20-Newsgroups, and WebKB) using support vector machines and K-nearest neighbor classifiers. To determine whether BTC is statistically better than TF and MTF, we have applied the paired two-sided t -test on the macro F 1 results. Overall, BTC is found to be 52% statistically significant than TF and MTF. Furthermore, the highest macro F 1 value on the three datasets was achieved by BTC-based term weighting schemes.

Suggested Citation

  • Farhan Shehzad & Abdur Rehman & Kashif Javed & Khalid A. Alnowibet & Haroon A. Babri & Hafiz Tayyab Rauf, 2022. "Binned Term Count: An Alternative to Term Frequency for Text Categorization," Mathematics, MDPI, vol. 10(21), pages 1-25, November.
  • Handle: RePEc:gam:jmathe:v:10:y:2022:i:21:p:4124-:d:963760
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2227-7390/10/21/4124/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2227-7390/10/21/4124/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Dogan, Turgut & Uysal, Alper Kursat, 2020. "A novel term weighting scheme for text classification: TF-MONO," Journal of Informetrics, Elsevier, vol. 14(4).
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Kitti Nagy & Jozef Kapusta, 2021. "Improving fake news classification using dependency grammar," PLOS ONE, Public Library of Science, vol. 16(9), pages 1-22, September.
    2. Xuan Liu & Tianyi Shi & Guohui Zhou & Mingzhe Liu & Zhengtong Yin & Lirong Yin & Wenfeng Zheng, 2023. "Emotion classification for short texts: an improved multi-label method," Palgrave Communications, Palgrave Macmillan, vol. 10(1), pages 1-9, December.
    3. Masood, Muhammad Ali & Abbasi, Rabeeh Ayaz, 2021. "Using graph embedding and machine learning to identify rebels on twitter," Journal of Informetrics, Elsevier, vol. 15(1).

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:10:y:2022:i:21:p:4124-:d:963760. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.