IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0012704.html
   My bibliography  Save this article

Systematic Characterizations of Text Similarity in Full Text Biomedical Publications

Author

Listed:
  • Zhaohui Sun
  • Mounir Errami
  • Tara Long
  • Chris Renard
  • Nishant Choradia
  • Harold Garner

Abstract

Background: Computational methods have been used to find duplicate biomedical publications in MEDLINE. Full text articles are becoming increasingly available, yet the similarities among them have not been systematically studied. Here, we quantitatively investigated the full text similarity of biomedical publications in PubMed Central. Methodology/Principal Findings: 72,011 full text articles from PubMed Central (PMC) were parsed to generate three different datasets: full texts, sections, and paragraphs. Text similarity comparisons were performed on these datasets using the text similarity algorithm eTBLAST. We measured the frequency of similar text pairs and compared it among different datasets. We found that high abstract similarity can be used to predict high full text similarity with a specificity of 20.1% (95% CI [17.3%, 23.1%]) and sensitivity of 99.999%. Abstract similarity and full text similarity have a moderate correlation (Pearson correlation coefficient: −0.423) when the similarity ratio is above 0.4. Among pairs of articles in PMC, method sections are found to be the most repetitive (frequency of similar pairs, methods: 0.029, introduction: 0.0076, results: 0.0043). In contrast, among a set of manually verified duplicate articles, results are the most repetitive sections (frequency of similar pairs, results: 0.94, methods: 0.89, introduction: 0.82). Repetition of introduction and methods sections is more likely to be committed by the same authors (odds of a highly similar pair having at least one shared author, introduction: 2.31, methods: 1.83, results: 1.03). There is also significantly more similarity in pairs of review articles than in pairs containing one review and one nonreview paper (frequency of similar pairs: 0.0167 and 0.0023, respectively). Conclusion/Significance: While quantifying abstract similarity is an effective approach for finding duplicate citations, a comprehensive full text analysis is necessary to uncover all potential duplicate citations in the scientific literature and is helpful when establishing ethical guidelines for scientific publications.

Suggested Citation

  • Zhaohui Sun & Mounir Errami & Tara Long & Chris Renard & Nishant Choradia & Harold Garner, 2010. "Systematic Characterizations of Text Similarity in Full Text Biomedical Publications," PLOS ONE, Public Library of Science, vol. 5(9), pages 1-6, September.
  • Handle: RePEc:plo:pone00:0012704
    DOI: 10.1371/journal.pone.0012704
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0012704
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0012704&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0012704?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Mounir Errami & Harold Garner, 2008. "A tale of two citations," Nature, Nature, vol. 451(7177), pages 397-399, January.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Mercedes Echeverria & David Stuart & Tobias Blanke, 2015. "Medical theses and derivative articles: dissemination of contents and publication patterns," Scientometrics, Springer;Akadémiai Kiadó, vol. 102(1), pages 559-586, January.
    2. Vanja Pupovac, 2021. "The frequency of plagiarism identified by text-matching software in scientific articles: a systematic review and meta-analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(11), pages 8981-9003, November.
    3. Antonio García-Romero & José Manuel Estrada-Lorenzo, 2014. "A bibliometric analysis of plagiarism and self-plagiarism through Déjà vu," Scientometrics, Springer;Akadémiai Kiadó, vol. 101(1), pages 381-396, October.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Jerome K. Vanclay, 2012. "Impact factor: outdated artefact or stepping-stone to journal certification?," Scientometrics, Springer;Akadémiai Kiadó, vol. 92(2), pages 211-238, August.
    2. Michael McAleer & Judit Olah & Jozsef Popp, 2018. "Pros and Cons of the Impact Factor in a Rapidly Changing Digital World," Tinbergen Institute Discussion Papers 18-014/III, Tinbergen Institute.
    3. Antonio García-Romero & José Manuel Estrada-Lorenzo, 2014. "A bibliometric analysis of plagiarism and self-plagiarism through Déjà vu," Scientometrics, Springer;Akadémiai Kiadó, vol. 101(1), pages 381-396, October.
    4. John M McPartland, 2009. "Obesity, the Endocannabinoid System, and Bias Arising from Pharmaceutical Sponsorship," PLOS ONE, Public Library of Science, vol. 4(3), pages 1-7, March.
    5. Chekhovich, Yury V. & Khazov, Andrey V., 2022. "Analysis of duplicated publications in Russian journals," Journal of Informetrics, Elsevier, vol. 16(1).

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0012704. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.