IDEAS home Printed from https://ideas.repec.org/a/gam/jdataj/v6y2021i8p84-d608256.html
   My bibliography  Save this article

The Automatic Detection of Dataset Names in Scientific Articles

Author

Listed:
  • Jenny Heddes

    (Informatics Institute, Faculty of Science, University of Amsterdam, Science Park 908, 1098 XH Amsterdam, The Netherlands)

  • Pim Meerdink

    (Informatics Institute, Faculty of Science, University of Amsterdam, Science Park 908, 1098 XH Amsterdam, The Netherlands)

  • Miguel Pieters

    (Informatics Institute, Faculty of Science, University of Amsterdam, Science Park 908, 1098 XH Amsterdam, The Netherlands)

  • Maarten Marx

    (Informatics Institute, Faculty of Science, University of Amsterdam, Science Park 908, 1098 XH Amsterdam, The Netherlands)

Abstract

We study the task of recognizing named datasets in scientific articles as a Named Entity Recognition (NER) problem. Noticing that available annotated datasets were not adequate for our goals, we annotated 6000 sentences extracted from four major AI conferences, with roughly half of them containing one or more named datasets. A distinguishing feature of this set is the many sentences using enumerations, conjunctions and ellipses, resulting in long BI+ tag sequences. On all measures, the SciBERT NER tagger performed best and most robustly. Our baseline rule based tagger performed remarkably well and better than several state-of-the-art methods. The gold standard dataset, with links and offsets from each sentence to the (open access available) articles together with the annotation guidelines and all code used in the experiments, is available on GitHub.

Suggested Citation

  • Jenny Heddes & Pim Meerdink & Miguel Pieters & Maarten Marx, 2021. "The Automatic Detection of Dataset Names in Scientific Articles," Data, MDPI, vol. 6(8), pages 1-19, August.
  • Handle: RePEc:gam:jdataj:v:6:y:2021:i:8:p:84-:d:608256
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2306-5729/6/8/84/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2306-5729/6/8/84/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Zeng, Tong & Wu, Longfeng & Bratt, Sarah & Acuna, Daniel E., 2020. "Assigning credit to scientific datasets using article citation networks," Journal of Informetrics, Elsevier, vol. 14(2).
    2. Jinseok Kim & Jenna Kim, 2018. "The impact of imbalanced training data on machine learning for author name disambiguation," Scientometrics, Springer;Akadémiai Kiadó, vol. 117(1), pages 511-526, October.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Fernandez Martinez, Roberto & Lostado Lorza, Ruben & Santos Delgado, Ana Alexandra & Piedra, Nelson, 2021. "Use of classification trees and rule-based models to optimize the funding assignment to research projects: A case study of UTPL," Journal of Informetrics, Elsevier, vol. 15(1).
    2. Jinseok Kim & Jinmo Kim & Jason Owen-Smith, 2019. "Generating automatically labeled data for author name disambiguation: an iterative clustering method," Scientometrics, Springer;Akadémiai Kiadó, vol. 118(1), pages 253-280, January.
    3. ShaoPeng Che & Yuanhang Zhou & Shunan Zhang & Dongyan Nan & Jang Hyun Kim, 2023. "Impact of ByteDance crisis communication strategies on different social media users," Palgrave Communications, Palgrave Macmillan, vol. 10(1), pages 1-16, December.
    4. Jinseok Kim, 2019. "A fast and integrative algorithm for clustering performance evaluation in author name disambiguation," Scientometrics, Springer;Akadémiai Kiadó, vol. 120(2), pages 661-681, August.
    5. Christian Thiele & Gerrit Hirschfeld & Ruth Brachel, 2021. "Clinical trial registries as Scientometric data: A novel solution for linking and deduplicating clinical trials from multiple registries," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(12), pages 9733-9750, December.
    6. Seok-Jae Heo & Yangwook Kim & Sehyun Yun & Sung-Shil Lim & Jihyun Kim & Chung-Mo Nam & Eun-Cheol Park & Inkyung Jung & Jin-Ha Yoon, 2019. "Deep Learning Algorithms with Demographic Information Help to Detect Tuberculosis in Chest Radiographs in Annual Workers’ Health Examination Data," IJERPH, MDPI, vol. 16(2), pages 1-9, January.
    7. Saarela, Mirka & Kärkkäinen, Tommi, 2020. "Can we automate expert-based journal rankings? Analysis of the Finnish publication indicator," Journal of Informetrics, Elsevier, vol. 14(2).
    8. Helena Mihaljević & Lucía Santamaría, 2021. "Disambiguation of author entities in ADS using supervised learning and graph theory methods," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(5), pages 3893-3917, May.
    9. Yadav, Pratyush & Pervin, Nargis, 2022. "Towards efficient navigation in digital libraries: Leveraging popularity, semantics and communities to recommend scholarly articles," Journal of Informetrics, Elsevier, vol. 16(4).
    10. Tokmachev, Andrey M., 2023. "Hidden scales in statistics of citation indicators," Journal of Informetrics, Elsevier, vol. 17(1).
    11. Jinseok Kim & Jenna Kim & Jason Owen‐Smith, 2021. "Ethnicity‐based name partitioning for author name disambiguation using supervised machine learning," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 72(8), pages 979-994, August.
    12. Dosso, Dennis & Silvello, Gianmaria, 2020. "Data credit distribution: A new method to estimate databases impact," Journal of Informetrics, Elsevier, vol. 14(4).
    13. Rehs, Andreas, 2021. "A supervised machine learning approach to author disambiguation in the Web of Science," Journal of Informetrics, Elsevier, vol. 15(3).
    14. Jinseok Kim & Jenna Kim, 2020. "Effect of forename string on author name disambiguation," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 71(7), pages 839-855, July.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jdataj:v:6:y:2021:i:8:p:84-:d:608256. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.