IDEAS home Printed from https://ideas.repec.org/a/spr/scient/v129y2024i6d10.1007_s11192-024-05048-6.html
   My bibliography  Save this article

Extracting problem and method sentence from scientific papers: a context-enhanced transformer using formulaic expression desensitization

Author

Listed:
  • Yingyi Zhang

    (Soochow University)

  • Chengzhi Zhang

    (Nanjing University of Science and Technology)

Abstract

Billions of scientific papers lead to the need to identify essential parts from the massive text. Scientific research is an activity from putting forward problems to using methods. To learn the main idea from scientific papers, we focus on extracting problem and method sentences. Annotating sentences within scientific papers is labor-intensive, resulting in small-scale datasets that limit the amount of information models can learn. This limited information leads models to rely heavily on specific forms, which in turn reduces their generalization capabilities. This paper addresses the problems caused by small-scale datasets from three perspectives: increasing dataset scale, reducing dependence on specific forms, and enriching the information within sentences. To implement the first two ideas, we introduce the concept of formulaic expression (FE) desensitization and propose FE desensitization-based data augmenters to generate synthetic data and reduce models’ reliance on FEs. For the third idea, we propose a context-enhanced transformer that utilizes context to measure the importance of words in target sentences and to reduce noise in the context. Furthermore, this paper conducts experiments using large language model (LLM) based in-context learning (ICL) methods. Quantitative and qualitative experiments demonstrate that our proposed models achieve a higher macro F1 score compared to the baseline models on two scientific paper datasets, with improvements of 3.71% and 2.67%, respectively. The LLM based ICL methods are found to be not suitable for the task of problem and method extraction.

Suggested Citation

  • Yingyi Zhang & Chengzhi Zhang, 2024. "Extracting problem and method sentence from scientific papers: a context-enhanced transformer using formulaic expression desensitization," Scientometrics, Springer;Akadémiai Kiadó, vol. 129(6), pages 3433-3468, June.
  • Handle: RePEc:spr:scient:v:129:y:2024:i:6:d:10.1007_s11192-024-05048-6
    DOI: 10.1007/s11192-024-05048-6
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11192-024-05048-6
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11192-024-05048-6?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Yuan Zhou & Fang Dong & Yufei Liu & Zhaofu Li & JunFei Du & Li Zhang, 2020. "Forecasting emerging technologies using data augmentation and deep learning," Scientometrics, Springer;Akadémiai Kiadó, vol. 123(1), pages 1-29, April.
    2. Luo, Zhuoran & Lu, Wei & He, Jiangen & Wang, Yuqi, 2022. "Combination of research questions and methods: A new measurement of scientific novelty," Journal of Informetrics, Elsevier, vol. 16(2).
    3. Yingyi Zhang & Chengzhi Zhang & Jing Li, 2020. "Joint Modeling of Characters, Words, and Conversation Contexts for Microblog Keyphrase Extraction," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 71(5), pages 553-567, May.
    4. Pan, Xuelian & Yan, Erjia & Wang, Qianqian & Hua, Weina, 2015. "Assessing the impact of software on science: A bootstrapped learning of software entities in full-text papers," Journal of Informetrics, Elsevier, vol. 9(4), pages 860-871.
    5. Iqra Safder & Saeed-Ul Hassan, 2019. "Bibliometric-enhanced information retrieval: a novel deep feature engineering approach for algorithm searching from full-text publications," Scientometrics, Springer;Akadémiai Kiadó, vol. 119(1), pages 257-277, April.
    6. Mengnan Zhao & Erjia Yan & Kai Li, 2018. "Data set mentions and citations: A content analysis of full†text publications," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 69(1), pages 32-46, January.
    7. Lutz Bornmann & Rüdiger Mutz, 2015. "Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 66(11), pages 2215-2222, November.
    8. Wenhan Chao & Mengyuan Chen & Xian Zhou & Zhunchen Luo, 2023. "A joint framework for identifying the type and arguments of scientific contribution," Scientometrics, Springer;Akadémiai Kiadó, vol. 128(6), pages 3347-3376, June.
    9. Kevin Heffernan & Simone Teufel, 2018. "Identifying problems and solutions in scientific text," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(2), pages 1367-1382, August.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Wang, Yuzhuo & Zhang, Chengzhi, 2020. "Using the full-text content of academic articles to identify and evaluate algorithm entities in the domain of natural language processing," Journal of Informetrics, Elsevier, vol. 14(4).
    2. Yuzhuo Wang & Chengzhi Zhang & Kai Li, 2022. "A review on method entities in the academic literature: extraction, evaluation, and application," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(5), pages 2479-2520, May.
    3. Saeed-Ul Hassan & Naif R. Aljohani & Mudassir Shabbir & Umair Ali & Sehrish Iqbal & Raheem Sarwar & Eugenio Martínez-Cámara & Sebastián Ventura & Francisco Herrera, 2020. "Tweet Coupling: a social media methodology for clustering scientific publications," Scientometrics, Springer;Akadémiai Kiadó, vol. 124(2), pages 973-991, August.
    4. Iqra Safder & Saeed-Ul Hassan, 2019. "Bibliometric-enhanced information retrieval: a novel deep feature engineering approach for algorithm searching from full-text publications," Scientometrics, Springer;Akadémiai Kiadó, vol. 119(1), pages 257-277, April.
    5. Dejing Kong & Jianzhong Yang & Lingfeng Li, 2020. "Early identification of technological convergence in numerical control machine tool: a deep learning approach," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(3), pages 1983-2009, December.
    6. Ramona Weinrich, 2019. "Opportunities for the Adoption of Health-Based Sustainable Dietary Patterns: A Review on Consumer Research of Meat Substitutes," Sustainability, MDPI, vol. 11(15), pages 1-15, July.
    7. Piers Steel & Sjoerd Beugelsdijk & Herman Aguinis, 2021. "The anatomy of an award-winning meta-analysis: Recommendations for authors, reviewers, and readers of meta-analytic reviews," Journal of International Business Studies, Palgrave Macmillan;Academy of International Business, vol. 52(1), pages 23-44, February.
    8. Dunaiski, Marcel & Geldenhuys, Jaco & Visser, Willem, 2019. "On the interplay between normalisation, bias, and performance of paper impact metrics," Journal of Informetrics, Elsevier, vol. 13(1), pages 270-290.
    9. Augusteijn, Hilde Elisabeth Maria & van Aert, Robbie Cornelis Maria & van Assen, Marcel A. L. M., 2021. "Posterior Probabilities of Effect Sizes and Heterogeneity in Meta-Analysis: An Intuitive Approach of Dealing with Publication Bias," OSF Preprints avkgj, Center for Open Science.
    10. Ruhua Huang & Yuting Huang & Fan Qi & Leyi Shi & Baiyang Li & Wei Yu, 2022. "Exploring the characteristics of special issues: distribution, topicality, and citation impact," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(9), pages 5233-5256, September.
    11. Neal R. Haddaway & Max W. Callaghan & Alexandra M. Collins & William F. Lamb & Jan C. Minx & James Thomas & Denny John, 2020. "On the use of computer‐assistance to facilitate systematic mapping," Campbell Systematic Reviews, John Wiley & Sons, vol. 16(4), December.
    12. Vincent Raoult, 2020. "How Many Papers Should Scientists Be Reviewing? An Analysis Using Verified Peer Review Reports," Publications, MDPI, vol. 8(1), pages 1-9, January.
    13. Eloy López-Meneses & Esteban Vázquez-Cano & Mariana-Daniela González-Zamar & Emilio Abad-Segura, 2020. "Socioeconomic Effects in Cyberbullying: Global Research Trends in the Educational Context," IJERPH, MDPI, vol. 17(12), pages 1-31, June.
    14. Zamani, Mehdi & Yalcin, Haydar & Naeini, Ali Bonyadi & Zeba, Gordana & Daim, Tugrul U, 2022. "Developing metrics for emerging technologies: identification and assessment," Technological Forecasting and Social Change, Elsevier, vol. 176(C).
    15. Sam Arts & Nicola Melluso & Reinhilde Veugelers, 2023. "Beyond Citations: Measuring Novel Scientific Ideas and their Impact in Publication Text," Papers 2309.16437, arXiv.org, revised Oct 2024.
    16. Jinseok Kim & Jinmo Kim & Jason Owen-Smith, 2019. "Generating automatically labeled data for author name disambiguation: an iterative clustering method," Scientometrics, Springer;Akadémiai Kiadó, vol. 118(1), pages 253-280, January.
    17. Pan, Xuelian & Yan, Erjia & Cui, Ming & Hua, Weina, 2018. "Examining the usage, citation, and diffusion patterns of bibliometric mapping software: A comparative study of three tools," Journal of Informetrics, Elsevier, vol. 12(2), pages 481-493.
    18. June Young Lee & Sejung Ahn & Dohyun Kim, 2021. "Deep learning-based prediction of future growth potential of technologies," PLOS ONE, Public Library of Science, vol. 16(6), pages 1-16, June.
    19. Yunlei Lin & Yuan Zhou, 2023. "Identification of Hydrogen-Energy-Related Emerging Technologies Based on Text Mining," Sustainability, MDPI, vol. 16(1), pages 1-19, December.
    20. Enrique Orduña-Malea & Rodrigo Costas, 2021. "Link-based approach to study scientific software usage: the case of VOSviewer," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(9), pages 8153-8186, September.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:scient:v:129:y:2024:i:6:d:10.1007_s11192-024-05048-6. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.