IDEAS home Printed from https://ideas.repec.org/a/spr/scient/v127y2022i8d10.1007_s11192-022-04471-x.html
   My bibliography  Save this article

SDCF: semi-automatically structured dataset of citation functions

Author

Listed:
  • Setio Basuki

    (Toyohashi University of Technology)

  • Masatoshi Tsuchiya

    (Toyohashi University of Technology)

Abstract

There is increasing research interest in the automatic detection of citation functions, which is why authors of academic papers cite previous works. A machine learning approach for such a task requires a large dataset consisting of varied labels of citation functions. However, existing datasets contain a few instances and a limited number of labels. Furthermore, most labels have been built using narrow research fields. Addressing these issues, this paper proposes a semiautomatic approach to develop a large dataset of citation functions based on two types of datasets. The first type contains 5668 manually labeled instances to develop a new labeling scheme of citation functions, and the second type is the final dataset that is built automatically. Our labeling scheme covers papers from various areas of computer science, resulting in five coarse labels and 21 fine-grained labels. To validate the scheme, two annotators were employed for annotation experiments on 421 instances that produced Cohen’s Kappa values of 0.85 for coarse labels and 0.71 for fine-grained labels. Following this, we performed two classification stages, i.e., filtering, and fine-grained to build models using the first dataset. The classification followed several scenarios, including active learning (AL) in a low-resource setting. Our experiments show that Bidirectional Encoder Representations from Transformers (BERT)-based AL achieved 90.29% accuracy, which outperformed other methods in the filtering stage. In the fine-grained stage, the SciBERT-based AL strategy achieved a competitive 81.15% accuracy, which was slightly lower than the non-AL strategy. These results show that the AL is promising since it requires less than half of the dataset. Considering the number of labels, this paper released the largest dataset consisting of 1,840,815 instances.

Suggested Citation

  • Setio Basuki & Masatoshi Tsuchiya, 2022. "SDCF: semi-automatically structured dataset of citation functions," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(8), pages 4569-4608, August.
  • Handle: RePEc:spr:scient:v:127:y:2022:i:8:d:10.1007_s11192-022-04471-x
    DOI: 10.1007/s11192-022-04471-x
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11192-022-04471-x
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11192-022-04471-x?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Mingyang Wang & Jiaqi Zhang & Shijia Jiao & Xiangrong Zhang & Na Zhu & Guangsheng Chen, 2020. "Important citation identification by exploiting the syntactic and contextual information of citations," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(3), pages 2109-2129, December.
    2. Saeed-Ul Hassan & Mubashir Imran & Sehrish Iqbal & Naif Radi Aljohani & Raheel Nawaz, 2018. "Deep context of citations using machine-learning models in scholarly full-text articles," Scientometrics, Springer;Akadémiai Kiadó, vol. 117(3), pages 1645-1662, December.
    3. Quinn McNemar, 1947. "Note on the sampling error of the difference between correlated proportions or percentages," Psychometrika, Springer;The Psychometric Society, vol. 12(2), pages 153-157, June.
    4. Xiaodan Zhu & Peter Turney & Daniel Lemire & André Vellino, 2015. "Measuring academic influence: Not all citations are equal," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 66(2), pages 408-427, February.
    5. Iman Tahamtan & Lutz Bornmann, 2019. "What do citation counts measure? An updated review of studies on citations in scientific documents published between 2006 and 2018," Scientometrics, Springer;Akadémiai Kiadó, vol. 121(3), pages 1635-1684, December.
    6. Shahzad Nazir & Muhammad Asif & Shahbaz Ahmad & Faisal Bukhari & Muhammad Tanvir Afzal & Hanan Aljuaid, 2020. "Important citation identification by exploiting content and section-wise in-text citation count," PLOS ONE, Public Library of Science, vol. 15(3), pages 1-19, March.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Faiza Qayyum & Harun Jamil & Naeem Iqbal & DoHyeun Kim & Muhammad Tanvir Afzal, 2022. "Toward potential hybrid features evaluation using MLP-ANN binary classification model to tackle meaningful citations," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(11), pages 6471-6499, November.
    2. Naif Radi Aljohani & Ayman Fayoumi & Saeed-Ul Hassan, 2021. "An in-text citation classification predictive model for a scholarly search system," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(7), pages 5509-5529, July.
    3. Xiaorui Jiang & Jingqiang Chen, 2023. "Contextualised segment-wise citation function classification," Scientometrics, Springer;Akadémiai Kiadó, vol. 128(9), pages 5117-5158, September.
    4. Xin An & Xin Sun & Shuo Xu, 2022. "Important citations identification with semi-supervised classification model," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(11), pages 6533-6555, November.
    5. Sehrish Iqbal & Saeed-Ul Hassan & Naif Radi Aljohani & Salem Alelyani & Raheel Nawaz & Lutz Bornmann, 2021. "A decade of in-text citation analysis based on natural language processing and machine learning techniques: an overview of empirical studies," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(8), pages 6551-6599, August.
    6. Matthias Sebastian Rüdiger & David Antons & Torsten-Oliver Salge, 2021. "The explanatory power of citations: a new approach to unpacking impact in science," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(12), pages 9779-9809, December.
    7. Uttam Bandyopadhyay & Atanu Biswas & Shirsendu Mukherjee, 2009. "Adaptive two-treatment two-period crossover design for binary treatment responses incorporating carry-over effects," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 18(1), pages 13-33, March.
    8. Chao Min & Qingyu Chen & Erjia Yan & Yi Bu & Jianjun Sun, 2021. "Citation cascade and the evolution of topic relevance," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 72(1), pages 110-127, January.
    9. Chacón, José E. & Fernández Serrano, Javier, 2024. "Bayesian taut splines for estimating the number of modes," Computational Statistics & Data Analysis, Elsevier, vol. 196(C).
    10. Bester Tawona Mudereri & Elfatih M. Abdel-Rahman & Shepard Ndlela & Louisa Delfin Mutsa Makumbe & Christabel Chiedza Nyanga & Henri E. Z. Tonnang & Samira A. Mohamed, 2022. "Integrating the Strength of Multi-Date Sentinel-1 and -2 Datasets for Detecting Mango ( Mangifera indica L.) Orchards in a Semi-Arid Environment in Zimbabwe," Sustainability, MDPI, vol. 14(10), pages 1-23, May.
    11. Yi Bu & Binglu Wang & Win-bin Huang & Shangkun Che & Yong Huang, 2018. "Using the appearance of citations in full text on author co-citation analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(1), pages 275-289, July.
    12. Sergiu Mihai Haţegan, 2021. "A Mapping Of The Literature On Econophysics," Annals of Faculty of Economics, University of Oradea, Faculty of Economics, vol. 1(1), pages 92-100, July.
    13. Dangzhi Zhao & Andreas Strotmann, 2020. "Telescopic and panoramic views of library and information science research 2011–2018: a comparison of four weighting schemes for author co-citation analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 124(1), pages 255-270, July.
    14. Nosi, Costanza & D’Agostino, Antonella & Pratesi, Carlo Alberto & Barbarossa, Camilla, 2021. "Evaluating a social marketing campaign on healthy nutrition and lifestyle among primary-school children: A mixed-method research design," Evaluation and Program Planning, Elsevier, vol. 89(C).
    15. John E. Core, 2010. "Discussion of Chief Executive Officer Equity Incentives and Accounting Irregularities," Journal of Accounting Research, Wiley Blackwell, vol. 48(2), pages 273-287, May.
    16. Preety Srivastava & Xueyan Zhao, 2010. "What Do the Bingers Drink? Micro‐Unit Evidence on Negative Externalities and Drinker Characteristics of Alcohol Consumption by Beverage Types," Economic Papers, The Economic Society of Australia, vol. 29(2), pages 229-250, June.
    17. Hanousek Jan & Kočenda Evžen & Novotný Jan, 2012. "The identification of price jumps," Monte Carlo Methods and Applications, De Gruyter, vol. 18(1), pages 53-77, January.
    18. Monnery, Benjamin & Wolff, François-Charles & Henneguelle, Anaïs, 2020. "Prison, semi-liberty and recidivism: Bounding causal effects in a survival model," International Review of Law and Economics, Elsevier, vol. 61(C).
    19. Holger Schwender & Margaret A. Taub & Terri H. Beaty & Mary L. Marazita & Ingo Ruczinski, 2012. "Rapid Testing of SNPs and Gene–Environment Interactions in Case–Parent Trio Data Based on Exact Analytic Parameter Estimation," Biometrics, The International Biometric Society, vol. 68(3), pages 766-773, September.
    20. Yuanyuan Liu & Qiang Wu & Shijie Wu & Yong Gao, 2021. "Weighted citation based on ranking-related contribution: a new index for evaluating article impact," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(10), pages 8653-8672, October.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:scient:v:127:y:2022:i:8:d:10.1007_s11192-022-04471-x. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.