IDEAS home Printed from https://ideas.repec.org/a/gam/jmathe/v11y2023i9p2053-d1133795.html
   My bibliography  Save this article

Finding Patient Zero and Tracking Narrative Changes in the Context of Online Disinformation Using Semantic Similarity Analysis

Author

Listed:
  • Codruț-Georgian Artene

    (Department of Computer Science and Engineering, “Gheorghe Asachi” Technical University of Iasi, 700050 Iasi, Romania)

  • Ciprian Oprișa

    (Computer Science Department, Technical University of Cluj-Napoca, 400114 Cluj-Napoca, Romania)

  • Cristian Nicolae Buțincu

    (Department of Computer Science and Engineering, “Gheorghe Asachi” Technical University of Iasi, 700050 Iasi, Romania)

  • Florin Leon

    (Department of Computer Science and Engineering, “Gheorghe Asachi” Technical University of Iasi, 700050 Iasi, Romania)

Abstract

Disinformation in the form of news articles, also called fake news, is used by multiple actors for nefarious purposes, such as gaining political advantages. A key component for fake news detection is the ability to find similar articles in a large documents corpus, for tracking narrative changes and identifying the root source (patient zero) of a particular piece of information. This paper presents new techniques based on textual and semantic similarity that were adapted for achieving this goal on large datasets of news articles. The aim is to determine which of the implemented text similarity techniques is more suitable for this task. For text similarity, a Locality-Sensitive Hashing is applied on n -grams extracted from text to produce representations that are further indexed to facilitate the quick discovery of similar articles. The semantic textual similarity technique is based on sentence embeddings from pre-trained language models, such as BERT, and Named Entity Recognition. The proposed techniques are evaluated on a collection of Romanian articles to determine their performance in terms of quality of results and scalability. The presented techniques produce competitive results. The experimental results show that the proposed semantic textual similarity technique is better at identifying similar text documents, while the Locality-Sensitive Hashing text similarity technique outperforms it in terms of execution time and scalability. Even if they were evaluated only on Romanian texts and some of them are based on pre-trained models for the Romanian language, the methods that are the basis of these techniques allow their extension to other languages, with few to no changes, provided that there are pre-trained models for other languages as well. As for a cross-lingual setup, more changes are needed along with tests to demonstrate this capability. Based on the obtained results, one may conclude that the presented techniques are suitable to be integrated into a decentralized anti-disinformation platform for fact-checking and trust assessment.

Suggested Citation

  • Codruț-Georgian Artene & Ciprian Oprișa & Cristian Nicolae Buțincu & Florin Leon, 2023. "Finding Patient Zero and Tracking Narrative Changes in the Context of Online Disinformation Using Semantic Similarity Analysis," Mathematics, MDPI, vol. 11(9), pages 1-26, April.
  • Handle: RePEc:gam:jmathe:v:11:y:2023:i:9:p:2053-:d:1133795
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2227-7390/11/9/2053/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2227-7390/11/9/2053/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Jonathan Cook & Vikram Ramadas, 2020. "When to consult precision-recall curves," Stata Journal, StataCorp LP, vol. 20(1), pages 131-148, March.
    2. Kreps, Sarah & McCain, R. Miles & Brundage, Miles, 2022. "All the News That’s Fit to Fabricate: AI-Generated Text as a Tool of Media Misinformation," Journal of Experimental Political Science, Cambridge University Press, vol. 9(1), pages 104-117, March.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Kajal Lahiri & Cheng Yang, 2023. "ROC and PRC Approaches to Evaluate Recession Forecasts," Journal of Business Cycle Research, Springer;Centre for International Research on Economic Tendency Surveys (CIRET), vol. 19(2), pages 119-148, September.
    2. Esmeli, Ramazan & Bader-El-Den, Mohamed & Abdullahi, Hassana, 2022. "An analyses of the effect of using contextual and loyalty features on early purchase prediction of shoppers in e-commerce domain," Journal of Business Research, Elsevier, vol. 147(C), pages 420-434.
    3. Zachary Wojtowicz, 2024. "When and Why is Persuasion Hard? A Computational Complexity Result," Papers 2408.07923, arXiv.org.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:11:y:2023:i:9:p:2053-:d:1133795. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.