IDEAS home Printed from https://ideas.repec.org/a/spr/scient/v116y2018i2d10.1007_s11192-018-2754-2.html
   My bibliography  Save this article

Automatic identification of cited text spans: a multi-classifier approach over imbalanced dataset

Author

Listed:
  • Shutian Ma

    (Nanjing University of Science and Technology)

  • Jin Xu

    (Nanjing University of Science and Technology)

  • Chengzhi Zhang

    (Nanjing University of Science and Technology
    Fujian Provincial Key Laboratory of Information Processing and Intelligent Control (Minjiang University))

Abstract

Recently, a new form of structured summary on scientific papers is explored by grouping cited text spans from the reference paper. Its primary goal is to generate summaries based on the cited paper itself. Previously, traditional scientific summarization focused on citation-based methods by aggregating all citances that cite one unique paper without doing content-based citation analysis, while sometimes citations might differ between researchers or time slots. By investigating original text spans where scholars cited, the new method can reflect exact contributions of reference papers more. Therefore, how to identify cited text spans accurately becomes the first important problem to solve. Generally, it can be converted into finding the sentences in reference paper that is more similar with citation sentences. Taking it as a classification task, we investigate the potential of four actions to improve identification performance. Firstly, feature selections are conducted carefully according to multi-classifiers. Secondly, we apply sampling-based algorithms to preprocess class-imbalanced datasets. Since we integrated results via a weighted voting system, the third action is tuning parameters like, voting weights for multi-classifiers integration or running settings to see if we can improve performance further. Evaluation results show effectiveness of each action and demonstrate that researchers can take these actions for more accurate cited text spans identification when doing scientific summarization.

Suggested Citation

  • Shutian Ma & Jin Xu & Chengzhi Zhang, 2018. "Automatic identification of cited text spans: a multi-classifier approach over imbalanced dataset," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(2), pages 1303-1330, August.
  • Handle: RePEc:spr:scient:v:116:y:2018:i:2:d:10.1007_s11192-018-2754-2
    DOI: 10.1007/s11192-018-2754-2
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11192-018-2754-2
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11192-018-2754-2?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Kevin W. Boyack & Henry Small & Richard Klavans, 2013. "Improving the accuracy of co-citation clustering using full text," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 64(9), pages 1759-1767, September.
    2. Aaron Elkiss & Siwei Shen & Anthony Fader & Güneş Erkan & David States & Dragomir Radev, 2008. "Blind men and elephants: What do citation summaries tell us about a research article?," Journal of the American Society for Information Science and Technology, Association for Information Science & Technology, vol. 59(1), pages 51-62, January.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Pancheng Wang & Shasha Li & Haifang Zhou & Jintao Tang & Ting Wang, 2019. "Cited text spans identification with an improved balanced ensemble model," Scientometrics, Springer;Akadémiai Kiadó, vol. 120(3), pages 1111-1145, September.
    2. Moreno La Quatra & Luca Cagliero & Elena Baralis, 2020. "Exploiting pivot words to classify and summarize discourse facets of scientific papers," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(3), pages 3139-3157, December.
    3. Wang, Shiyun & Mao, Jin & Lu, Kun & Cao, Yujie & Li, Gang, 2021. "Understanding interdisciplinary knowledge integration through citance analysis: A case study on eHealth," Journal of Informetrics, Elsevier, vol. 15(4).
    4. Iqra Safder & Saeed-Ul Hassan, 2019. "Bibliometric-enhanced information retrieval: a novel deep feature engineering approach for algorithm searching from full-text publications," Scientometrics, Springer;Akadémiai Kiadó, vol. 119(1), pages 257-277, April.
    5. Guillaume Cabanac & Ingo Frommholz & Philipp Mayr, 2018. "Bibliometric-enhanced information retrieval: preface," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(2), pages 1225-1227, August.
    6. Moreno La Quatra & Luca Cagliero & Elena Baralis, 2021. "Leveraging full-text article exploration for citation analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(10), pages 8275-8293, October.
    7. Sehrish Iqbal & Saeed-Ul Hassan & Naif Radi Aljohani & Salem Alelyani & Raheel Nawaz & Lutz Bornmann, 2021. "A decade of in-text citation analysis based on natural language processing and machine learning techniques: an overview of empirical studies," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(8), pages 6551-6599, August.
    8. Naif Radi Aljohani & Ayman Fayoumi & Saeed-Ul Hassan, 2021. "An in-text citation classification predictive model for a scholarly search system," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(7), pages 5509-5529, July.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Rey-Long Liu, 2017. "A new bibliographic coupling measure with descriptive capability," Scientometrics, Springer;Akadémiai Kiadó, vol. 110(2), pages 915-935, February.
    2. Dangzhi Zhao & Andreas Strotmann, 2020. "Telescopic and panoramic views of library and information science research 2011–2018: a comparison of four weighting schemes for author co-citation analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 124(1), pages 255-270, July.
    3. Kim, Ha Jin & Jeong, Yoo Kyung & Song, Min, 2016. "Content- and proximity-based author co-citation analysis using citation sentences," Journal of Informetrics, Elsevier, vol. 10(4), pages 954-966.
    4. Michel Zitt, 2015. "Meso-level retrieval: IR-bibliometrics interplay and hybrid citation-words methods in scientific fields delineation," Scientometrics, Springer;Akadémiai Kiadó, vol. 102(3), pages 2223-2245, March.
    5. Marc Bertin & Iana Atanassova & Cassidy R. Sugimoto & Vincent Lariviere, 2016. "The linguistic patterns and rhetorical structure of citation context: an approach using n-grams," Scientometrics, Springer;Akadémiai Kiadó, vol. 109(3), pages 1417-1434, December.
    6. Kamal Sanguri & Atanu Bhuyan & Sabyasachi Patra, 2020. "A semantic similarity adjusted document co-citation analysis: a case of tourism supply chain," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(1), pages 233-269, October.
    7. Maryam Yaghtin & Hajar Sotudeh & Mahdieh Mirzabeigi & Seyed Mostafa Fakhrahmad & Mehdi Mohammadi, 2019. "In quest of new document relations: evaluating co-opinion relations between co-citations and its impact on Information retrieval effectiveness," Scientometrics, Springer;Akadémiai Kiadó, vol. 119(2), pages 987-1008, May.
    8. Raja Habib & Muhammad Tanvir Afzal, 2019. "Sections-based bibliographic coupling for research paper recommendation," Scientometrics, Springer;Akadémiai Kiadó, vol. 119(2), pages 643-656, May.
    9. Jeong, Yoo Kyung & Song, Min & Ding, Ying, 2014. "Content-based author co-citation analysis," Journal of Informetrics, Elsevier, vol. 8(1), pages 197-211.
    10. Takahiro Kawamura & Katsutaro Watanabe & Naoya Matsumoto & Shusaku Egami & Mari Jibu, 2018. "Funding map using paragraph embedding based on semantic diversity," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(2), pages 941-958, August.
    11. Masaki Eto, 2013. "Evaluations of context-based co-citation searching," Scientometrics, Springer;Akadémiai Kiadó, vol. 94(2), pages 651-673, February.
    12. Small, Henry & Tseng, Hung & Patek, Mike, 2017. "Discovering discoveries: Identifying biomedical discoveries using citation contexts," Journal of Informetrics, Elsevier, vol. 11(1), pages 46-62.
    13. Kun Sun & Haitao Liu & Wenxin Xiong, 2021. "The evolutionary pattern of language in scientific writings: A case study of Philosophical Transactions of Royal Society (1665–1869)," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(2), pages 1695-1724, February.
    14. Annarelli, Alessandro & Battistella, Cinzia & Nonino, Fabio & Parida, Vinit & Pessot, Elena, 2021. "Literature review on digitalization capabilities: Co-citation analysis of antecedents, conceptualization and consequences," Technological Forecasting and Social Change, Elsevier, vol. 166(C).
    15. Shenghui Wang & Rob Koopman, 2017. "Clustering articles based on semantic similarity," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(2), pages 1017-1031, May.
    16. Shengbo Liu & Chaomei Chen, 2012. "The proximity of co-citation," Scientometrics, Springer;Akadémiai Kiadó, vol. 91(2), pages 495-511, May.
    17. Tahamtan, Iman & Bornmann, Lutz, 2018. "Core elements in the process of citing publications: Conceptual overview of the literature," Journal of Informetrics, Elsevier, vol. 12(1), pages 203-216.
    18. Pancheng Wang & Shasha Li & Haifang Zhou & Jintao Tang & Ting Wang, 2019. "Cited text spans identification with an improved balanced ensemble model," Scientometrics, Springer;Akadémiai Kiadó, vol. 120(3), pages 1111-1145, September.
    19. Mengyu Yu & Mazie Krehbiel & Samantha Thompson & Tatjana Miljkovic, 2020. "An exploration of gender gap using advanced data science tools: actuarial research community," Scientometrics, Springer;Akadémiai Kiadó, vol. 123(2), pages 767-789, May.
    20. Bikun Chen & Dannan Deng & Zhouyan Zhong & Chengzhi Zhang, 2020. "Exploring linguistic characteristics of highly browsed and downloaded academic articles," Scientometrics, Springer;Akadémiai Kiadó, vol. 122(3), pages 1769-1790, March.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:scient:v:116:y:2018:i:2:d:10.1007_s11192-018-2754-2. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.