IDEAS home Printed from https://ideas.repec.org/a/spr/scient/v109y2016i3d10.1007_s11192-016-2066-3.html
   My bibliography  Save this article

Using character n-grams to match a list of publications to references in bibliographic databases

Author

Listed:
  • Mehmet Ali Abdulhayoglu

    (KU Leuven)

  • Bart Thijs

    (KU Leuven)

  • Wouter Jeuris

    (KU Leuven)

Abstract

For research evaluation, publication lists need to be matched to entries in large bibliographic databases, such as Thomson Reuters Web of Science. This matching process is often done manually, making it very time consuming. This paper presents the use of character n-grams as automated indicator to inform and ease the manual matching process. The similarity of two references was identified by calculating Salton’s cosine for their common character n-grams. As a complementary and confirmatory measure, Kondrak’s Levenshtein distance score, based on the character n-grams, is used to re-measure the similarity of the top matches resulting from Salton’s cosine. These automated matches were compared to results from completely manual matching. Incorrect matches were examined in depth and possible solutions suggested. This method was applied to two independent datasets, to validate the results and inferences drawn. For both datasets, the Salton’s score based on character n-grams proves to be a useful indicator to distinguish between correct and incorrect matches. The suggested method is compared with a baseline which is based on word unigrams. Accuracy of the character and word based systems are 96.0 and 94.7 %, respectively. Despite a small difference in accuracy, we observed that the character based system provides more correct matches when the data contains abbreviations, mathematical expressions or erroneous text.

Suggested Citation

  • Mehmet Ali Abdulhayoglu & Bart Thijs & Wouter Jeuris, 2016. "Using character n-grams to match a list of publications to references in bibliographic databases," Scientometrics, Springer;Akadémiai Kiadó, vol. 109(3), pages 1525-1546, December.
  • Handle: RePEc:spr:scient:v:109:y:2016:i:3:d:10.1007_s11192-016-2066-3
    DOI: 10.1007/s11192-016-2066-3
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11192-016-2066-3
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11192-016-2066-3?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Jonathan D. Cohen, 1995. "Highlights: Language‐ and domain‐independent automatic indexing terms for abstracting," Journal of the American Society for Information Science, Association for Information Science & Technology, vol. 46(3), pages 162-174, April.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Mehmet Ali Abdulhayoglu & Bart Thijs, 2018. "Use of locality sensitive hashing (LSH) algorithm to match Web of Science and Scopus," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(2), pages 1229-1245, August.
    2. Mehmet Ali Abdulhayoglu & Bart Thijs, 2017. "Use of ResearchGate and Google CSE for author name disambiguation," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(3), pages 1965-1985, June.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Leo Egghe, 2000. "The Distribution of N-Grams," Scientometrics, Springer;Akadémiai Kiadó, vol. 47(2), pages 237-252, February.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:scient:v:109:y:2016:i:3:d:10.1007_s11192-016-2066-3. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.