IDEAS home Printed from https://ideas.repec.org/a/spr/scient/v127y2022i6d10.1007_s11192-022-04381-y.html
   My bibliography  Save this article

A refinement strategy for identification of scientific software from bioinformatics publications

Author

Listed:
  • Lu Jiang

    (Nanjing Agricultural University
    Chengdu Library and Information Center of Chinese Academy of Sciences)

  • Xinyu Kang

    (Chengdu University of Technology)

  • Shan Huang

    (Sun Yat-Sen University)

  • Bo Yang

    (Nanjing Agricultural University
    Nanjing Agricultural University)

Abstract

In the field of bioinformatics, a large number of classical software becomes a necessary research tool. To measure the influence of scientific software as one kind of important intellectual products, a few strategies have been proposed to identify the software names from full texts of papers to collect the usage data of packages in bioinformatics research. However, the performance of these strategies is limited because of the highly imbalance of data in the full texts. This study proposes EnsembleSVMs-CRF, a two-step refinement strategy based on ensemble learning that gradually increases the sentences that contain software mentions to improve the performance of named entity recognition. The experiment on the bioinformatics corpus shows that the performance of EnsembleSVMs-CRF, in terms of the local F1 (78.81%) and the global F1-A (73.49%), is superior to the rule-based bootstrapping method and direct CRF. Application of this strategy to the articles published between 2013 and 2017 in 27 bioinformatics journals extracted 8,239 unique packages. The most popular 50 packages thus identified demonstrate that most of them are professional software which generally requires inter-discipline knowledge, rather than programming skill. Meanwhile, we found that researchers in bioinformatics tend to use free scientific software, and the application of general software is increasing compared with professional software.

Suggested Citation

  • Lu Jiang & Xinyu Kang & Shan Huang & Bo Yang, 2022. "A refinement strategy for identification of scientific software from bioinformatics publications," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(6), pages 3293-3316, June.
  • Handle: RePEc:spr:scient:v:127:y:2022:i:6:d:10.1007_s11192-022-04381-y
    DOI: 10.1007/s11192-022-04381-y
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11192-022-04381-y
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11192-022-04381-y?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Bi-Min Hsu, 2020. "Comparison of Supervised Classification Models on Textual Data," Mathematics, MDPI, vol. 8(5), pages 1-16, May.
    2. Enrique Orduña-Malea & Rodrigo Costas, 2021. "Link-based approach to study scientific software usage: the case of VOSviewer," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(9), pages 8153-8186, September.
    3. Heather Piwowar, 2013. "Value all research products," Nature, Nature, vol. 493(7431), pages 159-159, January.
    4. Bruce G. Marcot & Anca M. Hanea, 2021. "What is an optimal value of k in k-fold cross-validation in discrete Bayesian network analysis?," Computational Statistics, Springer, vol. 36(3), pages 2009-2031, September.
    5. Bo Yang & Ronald Rousseau & Xue Wang & Shuiqing Huang, 2018. "How important is scientific software in bioinformatics research? A comparative study between international and Chinese research communities," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 69(9), pages 1122-1133, September.
    6. Vivien Marx, 2013. "The big challenges of big data," Nature, Nature, vol. 498(7453), pages 255-260, June.
    7. Nada Boudjellal & Huaping Zhang & Asif Khan & Arshad Ahmad & Rashid Naseem & Jianyun Shang & Lin Dai & Atif Khan, 2021. "ABioNER: A BERT-Based Model for Arabic Biomedical Named-Entity Recognition," Complexity, Hindawi, vol. 2021, pages 1-6, March.
    8. Pan, Xuelian & Yan, Erjia & Wang, Qianqian & Hua, Weina, 2015. "Assessing the impact of software on science: A bootstrapped learning of software entities in full-text papers," Journal of Informetrics, Elsevier, vol. 9(4), pages 860-871.
    9. Park, Hyoungjoo & Wolfram, Dietmar, 2019. "Research software citation in the Data Citation Index: Current practices and implications for research software sharing and reuse," Journal of Informetrics, Elsevier, vol. 13(2), pages 574-582.
    10. Li, Kai & Chen, Pei-Ying & Yan, Erjia, 2019. "Challenges of measuring software impact through citations: An examination of the lme4 R package," Journal of Informetrics, Elsevier, vol. 13(1), pages 449-461.
    11. James Howison & Julia Bullard, 2016. "Software in the scientific literature: Problems with seeing, finding, and using software mentioned in the biology literature," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 67(9), pages 2137-2155, September.
    12. Li, Kai & Yan, Erjia & Feng, Yuanyuan, 2017. "How is R cited in research outputs? Structure, impacts, and citation standard," Journal of Informetrics, Elsevier, vol. 11(4), pages 989-1002.
    13. Xuelian Pan & Erjia Yan & Weina Hua, 2016. "Disciplinary differences of software use and impact in scientific literature," Scientometrics, Springer;Akadémiai Kiadó, vol. 109(3), pages 1593-1610, December.
    14. Fei Zhu & Bairong Shen, 2012. "Combined SVM-CRFs for Biological Named Entity Recognition with Maximal Bidirectional Squeezing," PLOS ONE, Public Library of Science, vol. 7(6), pages 1-9, June.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Enrique Orduña-Malea & Rodrigo Costas, 2021. "Link-based approach to study scientific software usage: the case of VOSviewer," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(9), pages 8153-8186, September.
    2. Pan, Xuelian & Yan, Erjia & Cui, Ming & Hua, Weina, 2018. "Examining the usage, citation, and diffusion patterns of bibliometric mapping software: A comparative study of three tools," Journal of Informetrics, Elsevier, vol. 12(2), pages 481-493.
    3. Yuzhuo Wang & Chengzhi Zhang & Kai Li, 2022. "A review on method entities in the academic literature: extraction, evaluation, and application," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(5), pages 2479-2520, May.
    4. Wang, Yuzhuo & Zhang, Chengzhi, 2020. "Using the full-text content of academic articles to identify and evaluate algorithm entities in the domain of natural language processing," Journal of Informetrics, Elsevier, vol. 14(4).
    5. Robert Tomaszewski, 2023. "Visibility, impact, and applications of bibliometric software tools through citation analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 128(7), pages 4007-4028, July.
    6. Pan, Xuelian & Yan, Erjia & Cui, Ming & Hua, Weina, 2019. "How important is software to library and information science research? A content analysis of full-text publications," Journal of Informetrics, Elsevier, vol. 13(1), pages 397-406.
    7. Li, Kai & Chen, Pei-Ying & Yan, Erjia, 2019. "Challenges of measuring software impact through citations: An examination of the lme4 R package," Journal of Informetrics, Elsevier, vol. 13(1), pages 449-461.
    8. Caifan Du & Johanna Cohoon & Patrice Lopez & James Howison, 2021. "Softcite dataset: A dataset of software mentions in biomedical and economic research publications," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 72(7), pages 870-884, July.
    9. Xuelian Pan & Erjia Yan & Weina Hua, 2016. "Disciplinary differences of software use and impact in scientific literature," Scientometrics, Springer;Akadémiai Kiadó, vol. 109(3), pages 1593-1610, December.
    10. Alsudais, Abdulkareem, 2021. "In-code citation practices in open research software libraries," Journal of Informetrics, Elsevier, vol. 15(2).
    11. Li, Kai & Yan, Erjia, 2018. "Co-mention network of R packages: Scientific impact and clustering structure," Journal of Informetrics, Elsevier, vol. 12(1), pages 87-100.
    12. Avick Kumar Dey & Pijush Kanti Dutta Pramanik & Prasenjit Choudhury & Goutam Bandopadhyay, 2021. "Distinctive author ranking using DEA indexing," Quality & Quantity: International Journal of Methodology, Springer, vol. 55(2), pages 601-620, April.
    13. Bikun Chen & Dannan Deng & Zhouyan Zhong & Chengzhi Zhang, 2020. "Exploring linguistic characteristics of highly browsed and downloaded academic articles," Scientometrics, Springer;Akadémiai Kiadó, vol. 122(3), pages 1769-1790, March.
    14. Shifan Qin & Longjiang Li, 2023. "Visual Analysis of Image Processing in the Mining Field Based on a Knowledge Map," Sustainability, MDPI, vol. 15(3), pages 1-18, January.
    15. Lin Zhu & Xiantao Liu & Sha He & Jun Shi & Ming Pang, 2015. "Keywords co-occurrence mapping knowledge domain research base on the theory of Big Data in oil and gas industry," Scientometrics, Springer;Akadémiai Kiadó, vol. 105(1), pages 249-260, October.
    16. Zhang, Yi & Huang, Ying & Porter, Alan L. & Zhang, Guangquan & Lu, Jie, 2019. "Discovering and forecasting interactions in big data research: A learning-enhanced bibliometric study," Technological Forecasting and Social Change, Elsevier, vol. 146(C), pages 795-807.
    17. Cristina López-Duarte & Jane F. Maley & Marta M. Vidal-Suárez, 2021. "Main challenges to international student mobility in the European arena," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(11), pages 8957-8980, November.
    18. Stefano Bianchini & Moritz Müller & Pierre Pelletier, 2022. "Artificial intelligence in science: An emerging general method of invention," Post-Print hal-03958025, HAL.
    19. Yu, Houqiang & Li, Longfei & Cao, Xueting & Chen, Tao, 2022. "Exploring country's preference over news mentions to academic papers," Journal of Informetrics, Elsevier, vol. 16(4).
    20. Shiwangi Singh & Sanjay Dhir, 2019. "Structured review using TCCM and bibliometric analysis of international cause-related marketing, social marketing, and innovation of the firm," International Review on Public and Nonprofit Marketing, Springer;International Association of Public and Non-Profit Marketing, vol. 16(2), pages 335-347, December.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:scient:v:127:y:2022:i:6:d:10.1007_s11192-022-04381-y. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.