IDEAS home Printed from https://ideas.repec.org/a/spr/scient/v127y2022i8d10.1007_s11192-022-04471-x.html
   My bibliography  Save this article

SDCF: semi-automatically structured dataset of citation functions

Author

Listed:
  • Setio Basuki

    (Toyohashi University of Technology)

  • Masatoshi Tsuchiya

    (Toyohashi University of Technology)

Abstract

There is increasing research interest in the automatic detection of citation functions, which is why authors of academic papers cite previous works. A machine learning approach for such a task requires a large dataset consisting of varied labels of citation functions. However, existing datasets contain a few instances and a limited number of labels. Furthermore, most labels have been built using narrow research fields. Addressing these issues, this paper proposes a semiautomatic approach to develop a large dataset of citation functions based on two types of datasets. The first type contains 5668 manually labeled instances to develop a new labeling scheme of citation functions, and the second type is the final dataset that is built automatically. Our labeling scheme covers papers from various areas of computer science, resulting in five coarse labels and 21 fine-grained labels. To validate the scheme, two annotators were employed for annotation experiments on 421 instances that produced Cohen’s Kappa values of 0.85 for coarse labels and 0.71 for fine-grained labels. Following this, we performed two classification stages, i.e., filtering, and fine-grained to build models using the first dataset. The classification followed several scenarios, including active learning (AL) in a low-resource setting. Our experiments show that Bidirectional Encoder Representations from Transformers (BERT)-based AL achieved 90.29% accuracy, which outperformed other methods in the filtering stage. In the fine-grained stage, the SciBERT-based AL strategy achieved a competitive 81.15% accuracy, which was slightly lower than the non-AL strategy. These results show that the AL is promising since it requires less than half of the dataset. Considering the number of labels, this paper released the largest dataset consisting of 1,840,815 instances.

Suggested Citation

  • Setio Basuki & Masatoshi Tsuchiya, 2022. "SDCF: semi-automatically structured dataset of citation functions," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(8), pages 4569-4608, August.
  • Handle: RePEc:spr:scient:v:127:y:2022:i:8:d:10.1007_s11192-022-04471-x
    DOI: 10.1007/s11192-022-04471-x
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11192-022-04471-x
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11192-022-04471-x?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Mingyang Wang & Jiaqi Zhang & Shijia Jiao & Xiangrong Zhang & Na Zhu & Guangsheng Chen, 2020. "Important citation identification by exploiting the syntactic and contextual information of citations," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(3), pages 2109-2129, December.
    2. Quinn McNemar, 1947. "Note on the sampling error of the difference between correlated proportions or percentages," Psychometrika, Springer;The Psychometric Society, vol. 12(2), pages 153-157, June.
    3. Iman Tahamtan & Lutz Bornmann, 2019. "What do citation counts measure? An updated review of studies on citations in scientific documents published between 2006 and 2018," Scientometrics, Springer;Akadémiai Kiadó, vol. 121(3), pages 1635-1684, December.
    4. Xiaodan Zhu & Peter Turney & Daniel Lemire & André Vellino, 2015. "Measuring academic influence: Not all citations are equal," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 66(2), pages 408-427, February.
    5. Shahzad Nazir & Muhammad Asif & Shahbaz Ahmad & Faisal Bukhari & Muhammad Tanvir Afzal & Hanan Aljuaid, 2020. "Important citation identification by exploiting content and section-wise in-text citation count," PLOS ONE, Public Library of Science, vol. 15(3), pages 1-19, March.
    6. Saeed-Ul Hassan & Mubashir Imran & Sehrish Iqbal & Naif Radi Aljohani & Raheel Nawaz, 2018. "Deep context of citations using machine-learning models in scholarly full-text articles," Scientometrics, Springer;Akadémiai Kiadó, vol. 117(3), pages 1645-1662, December.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Faiza Qayyum & Harun Jamil & Naeem Iqbal & DoHyeun Kim & Muhammad Tanvir Afzal, 2022. "Toward potential hybrid features evaluation using MLP-ANN binary classification model to tackle meaningful citations," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(11), pages 6471-6499, November.
    2. Naif Radi Aljohani & Ayman Fayoumi & Saeed-Ul Hassan, 2021. "An in-text citation classification predictive model for a scholarly search system," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(7), pages 5509-5529, July.
    3. Xin An & Xin Sun & Shuo Xu, 2022. "Important citations identification with semi-supervised classification model," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(11), pages 6533-6555, November.
    4. Xiaorui Jiang & Jingqiang Chen, 2023. "Contextualised segment-wise citation function classification," Scientometrics, Springer;Akadémiai Kiadó, vol. 128(9), pages 5117-5158, September.
    5. Sehrish Iqbal & Saeed-Ul Hassan & Naif Radi Aljohani & Salem Alelyani & Raheel Nawaz & Lutz Bornmann, 2021. "A decade of in-text citation analysis based on natural language processing and machine learning techniques: an overview of empirical studies," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(8), pages 6551-6599, August.
    6. Matthias Sebastian Rüdiger & David Antons & Torsten-Oliver Salge, 2021. "The explanatory power of citations: a new approach to unpacking impact in science," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(12), pages 9779-9809, December.
    7. Uttam Bandyopadhyay & Atanu Biswas & Shirsendu Mukherjee, 2009. "Adaptive two-treatment two-period crossover design for binary treatment responses incorporating carry-over effects," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 18(1), pages 13-33, March.
    8. Yi Bu & Binglu Wang & Win-bin Huang & Shangkun Che & Yong Huang, 2018. "Using the appearance of citations in full text on author co-citation analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(1), pages 275-289, July.
    9. Preety Srivastava & Xueyan Zhao, 2010. "What Do the Bingers Drink? Micro‐Unit Evidence on Negative Externalities and Drinker Characteristics of Alcohol Consumption by Beverage Types," Economic Papers, The Economic Society of Australia, vol. 29(2), pages 229-250, June.
    10. Holger Schwender & Margaret A. Taub & Terri H. Beaty & Mary L. Marazita & Ingo Ruczinski, 2012. "Rapid Testing of SNPs and Gene–Environment Interactions in Case–Parent Trio Data Based on Exact Analytic Parameter Estimation," Biometrics, The International Biometric Society, vol. 68(3), pages 766-773, September.
    11. Matysková, Ludmila & Rogers, Brian & Steiner, Jakub & Sun, Keh-Kuan, 2020. "Habits as adaptations: An experimental study," Games and Economic Behavior, Elsevier, vol. 122(C), pages 391-406.
    12. André, Kévin, 2013. "Applying the Capability Approach to the French Education System: An Assessment of the "Pourquoi pas moi ?"," ESSEC Working Papers WP1316, ESSEC Research Center, ESSEC Business School.
    13. Constantin Bürgi & Klaus Wohlrabe, 2022. "The influence of Covid-19 on publications in economics: bibliometric evidence from five working paper series," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(9), pages 5175-5189, September.
    14. Ruiz-Frau, A. & Krause, T. & Marbà, N., 2018. "The use of sociocultural valuation in sustainable environmental management," Ecosystem Services, Elsevier, vol. 29(PA), pages 158-167.
    15. Busrul Iman & Imam Yuadi & Badri Munir Sukoco & Rudi Purwono & Chih-Chien Hu, 2023. "Mapping Research Trends With Factorial Analysis in Organizational Politics," SAGE Open, , vol. 13(4), pages 21582440231, December.
    16. Abramo, Giovanni & D'Angelo, Ciriaco Andrea & Grilli, Leonardo, 2021. "The effects of citation-based research evaluation schemes on self-citation behavior," Journal of Informetrics, Elsevier, vol. 15(4).
    17. AlMalki, Hameeda A. & Durugbo, Christopher M., 2023. "Evaluating critical institutional factors of Industry 4.0 for education reform," Technological Forecasting and Social Change, Elsevier, vol. 188(C).
    18. Guevara, C. Angelo & Fukushi, Mitsuyoshi, 2016. "Modeling the decoy effect with context-RUM Models: Diagrammatic analysis and empirical evidence from route choice SP and mode choice RP case studies," Transportation Research Part B: Methodological, Elsevier, vol. 93(PA), pages 318-337.
    19. Melo, Grace & Palma, Marco A. & Ribera, Luis A., 2024. "Are experts overoptimistic about the success of food market labeling information?," 2024 Annual Meeting, July 28-30, New Orleans, LA 343870, Agricultural and Applied Economics Association.
    20. Mahira Ahmad & Amina Muazzam & Ambreen Anjum & Anna Visvizi & Raheel Nawaz, 2020. "Linking Work-Family Conflict (WFC) and Talent Management: Insights from a Developing Country," Sustainability, MDPI, vol. 12(7), pages 1-17, April.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:scient:v:127:y:2022:i:8:d:10.1007_s11192-022-04471-x. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.