IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0248663.html
   My bibliography  Save this article

Protocol for a reproducible experimental survey on biomedical sentence similarity

Author

Listed:
  • Alicia Lara-Clares
  • Juan J Lastra-Díaz
  • Ana Garcia-Serrano

Abstract

Measuring semantic similarity between sentences is a significant task in the fields of Natural Language Processing (NLP), Information Retrieval (IR), and biomedical text mining. For this reason, the proposal of sentence similarity methods for the biomedical domain has attracted a lot of attention in recent years. However, most sentence similarity methods and experimental results reported in the biomedical domain cannot be reproduced for multiple reasons as follows: the copying of previous results without confirmation, the lack of source code and data to replicate both methods and experiments, and the lack of a detailed definition of the experimental setup, among others. As a consequence of this reproducibility gap, the state of the problem can be neither elucidated nor new lines of research be soundly set. On the other hand, there are other significant gaps in the literature on biomedical sentence similarity as follows: (1) the evaluation of several unexplored sentence similarity methods which deserve to be studied; (2) the evaluation of an unexplored benchmark on biomedical sentence similarity, called Corpus-Transcriptional-Regulation (CTR); (3) a study on the impact of the pre-processing stage and Named Entity Recognition (NER) tools on the performance of the sentence similarity methods; and finally, (4) the lack of software and data resources for the reproducibility of methods and experiments in this line of research. Identified these open problems, this registered report introduces a detailed experimental setup, together with a categorization of the literature, to develop the largest, updated, and for the first time, reproducible experimental survey on biomedical sentence similarity. Our aforementioned experimental survey will be based on our own software replication and the evaluation of all methods being studied on the same software platform, which will be specially developed for this work, and it will become the first publicly available software library for biomedical sentence similarity. Finally, we will provide a very detailed reproducibility protocol and dataset as supplementary material to allow the exact replication of all our experiments and results.

Suggested Citation

  • Alicia Lara-Clares & Juan J Lastra-Díaz & Ana Garcia-Serrano, 2021. "Protocol for a reproducible experimental survey on biomedical sentence similarity," PLOS ONE, Public Library of Science, vol. 16(3), pages 1-28, March.
  • Handle: RePEc:plo:pone00:0248663
    DOI: 10.1371/journal.pone.0248663
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0248663
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0248663&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0248663?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Hamed Hassanzadeh & Tudor Groza & Anthony Nguyen & Jane Hunter, 2015. "A Supervised Approach to Quantifying Sentence Similarity: With Application to Evidence Based Medicine," PLOS ONE, Public Library of Science, vol. 10(6), pages 1-25, June.
    2. Haibin Liu & Lawrence Hunter & Vlado Kešelj & Karin Verspoor, 2013. "Approximate Subgraph Matching-Based Literature Mining for Biomedical Events and Relations," PLOS ONE, Public Library of Science, vol. 8(4), pages 1-16, April.
    3. Tomoyuki Kajiwara & Danushka Bollegala & Yuichi Yoshida & Ken-ichi Kawarabayashi, 2017. "An iterative approach for the global estimation of sentence similarity," PLOS ONE, Public Library of Science, vol. 12(9), pages 1-15, September.
    4. Yue Shang & Yanpeng Li & Hongfei Lin & Zhihao Yang, 2011. "Enhancing Biomedical Text Summarization Using Semantic Relation Extraction," PLOS ONE, Public Library of Science, vol. 6(8), pages 1-10, August.
    5. Kevin W Boyack & David Newman & Russell J Duhon & Richard Klavans & Michael Patek & Joseph R Biberstine & Bob Schijvenaars & André Skupin & Nianli Ma & Katy Börner, 2011. "Clustering More than Two Million Biomedical Publications: Comparing the Accuracies of Nine Text-Based Similarity Approaches," PLOS ONE, Public Library of Science, vol. 6(3), pages 1-11, March.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Peter Sjögårde & Fereshteh Didegah, 2022. "The association between topic growth and citation impact of research publications," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(4), pages 1903-1921, April.
    2. Paul Donner, 2021. "Validation of the Astro dataset clustering solutions with external data," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(2), pages 1619-1645, February.
    3. Lin Zhang & Beibei Sun & Fei Shu & Ying Huang, 2022. "Comparing paper level classifications across different methods and systems: an investigation of Nature publications," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(12), pages 7633-7651, December.
    4. Manuel A. Vázquez & Jorge Pereira-Delgado & Jesús Cid-Sueiro & Jerónimo Arenas-García, 2022. "Validation of scientific topic models using graph analysis and corpus metadata," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(9), pages 5441-5458, September.
    5. Lovro Šubelj & Nees Jan van Eck & Ludo Waltman, 2016. "Clustering Scientific Publications Based on Citation Relations: A Systematic Comparison of Different Methods," PLOS ONE, Public Library of Science, vol. 11(4), pages 1-23, April.
    6. Ballester, Omar & Penner, Orion, 2022. "Robustness, replicability and scalability in topic modelling," Journal of Informetrics, Elsevier, vol. 16(1).
    7. Milad Dehghani & Ki Joon Kim, 2019. "Past and Present Research on Wearable Technologies: Bibliometric and Cluster Analyses of Published Research from 2000 to 2016," International Journal of Innovation and Technology Management (IJITM), World Scientific Publishing Co. Pte. Ltd., vol. 16(01), pages 1-21, February.
    8. Juste Raimbault, 2019. "Exploration of an interdisciplinary scientific landscape," Scientometrics, Springer;Akadémiai Kiadó, vol. 119(2), pages 617-641, May.
    9. Renchu Guan & Chen Yang & Maurizio Marchese & Yanchun Liang & Xiaohu Shi, 2014. "Full Text Clustering and Relationship Network Analysis of Biomedical Publications," PLOS ONE, Public Library of Science, vol. 9(9), pages 1-9, September.
    10. Francesco Giovanni Avallone & Alberto Quagli & Paola Ramassa, 2022. "Interdisciplinary research by accounting scholars: An exploratory study," FINANCIAL REPORTING, FrancoAngeli Editore, vol. 2022(2), pages 5-34.
    11. Michael Rennings & Philipp Baaden & Carolin Block & Marcus John & Stefanie Bröring, 2024. "Assessing emerging sustainability-oriented technologies: the case of precision agriculture," Scientometrics, Springer;Akadémiai Kiadó, vol. 129(6), pages 2969-2998, June.
    12. Yun, Jinhyuk, 2022. "Generalization of bibliographic coupling and co-citation using the node split network," Journal of Informetrics, Elsevier, vol. 16(2).
    13. Rey-Long Liu, 2015. "Passage-Based Bibliographic Coupling: An Inter-Article Similarity Measure for Biomedical Articles," PLOS ONE, Public Library of Science, vol. 10(10), pages 1-22, October.
    14. Hanwen Xu & Addie Woicik & Hoifung Poon & Russ B. Altman & Sheng Wang, 2023. "Multilingual translation for zero-shot biomedical classification using BioTranslator," Nature Communications, Nature, vol. 14(1), pages 1-13, December.
    15. Xu, Shuo & Hao, Liyuan & An, Xin & Yang, Guancan & Wang, Feifei, 2019. "Emerging research topics detection with multiple machine learning models," Journal of Informetrics, Elsevier, vol. 13(4).
    16. Sitaram Devarakonda & Dmitriy Korobskiy & Tandy Warnow & George Chacko, 2020. "Viewing computer science through citation analysis: Salton and Bergmark Redux," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(1), pages 271-287, October.
    17. Urdiales, Cristina & Guzmán, Eduardo, 2024. "An automatic and association-based procedure for hierarchical publication subject categorization," Journal of Informetrics, Elsevier, vol. 18(1).
    18. Chen, Liang & Xu, Shuo & Zhu, Lijun & Zhang, Jing & Xu, Haiyun & Yang, Guancan, 2022. "A semantic main path analysis method to identify multiple developmental trajectories," Journal of Informetrics, Elsevier, vol. 16(2).
    19. Ai Linh Nguyen & Wenyuan Liu & Khiam Aik Khor & Andrea Nanetti & Siew Ann Cheong, 2022. "Strategic differences between regional investments into graphene technology and how corporations and universities manage patent portfolios," Papers 2208.03719, arXiv.org.
    20. Fei Shu & Yue Ma & Junping Qiu & Vincent Larivière, 2020. "Classifications of science and their effects on bibliometric evaluations," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(3), pages 2727-2744, December.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0248663. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.