IDEAS home Printed from https://ideas.repec.org/a/spr/scient/v116y2018i2d10.1007_s11192-017-2569-6.html
   My bibliography  Save this article

Use of locality sensitive hashing (LSH) algorithm to match Web of Science and Scopus

Author

Listed:
  • Mehmet Ali Abdulhayoglu

    (KU Leuven
    KU Leuven)

  • Bart Thijs

    (KU Leuven)

Abstract

A novel hashing algorithm is applied to match two prominent and important bibliographic databases at the paper level. In the literature, such tasks have been studied and conducted many times, but relying only on journal information due to massive volume of indexed publications. As a result of paper based match, missing or erroneous items can be completed from other source or the overlap can be measured more reliably. In this context, we focus on measuring the overlap between Clarivate Analytics Web of Science (WoS) and Elsevier’s Scopus at the paper level. Our focus is on detecting exact matches, that is, no false positives are tolerated at all. To this end, we follow a twofold matching procedure. First, a locality sensitive hashing algorithm is applied, which provides fast approximate nearest neighbours and similarities, in order to obtain WoS-Scopus pair suggestions. Second, for each suggested pair, different heuristics are applied to identify those pair of records that indeed refer to the same publication. We observe that at least 74% of WoS publications are also indexed by Scopus. The percentage increases to 92% when only the cited publications are retained. The overlapped WoS records are also presented based on Institute for Scientific Information subject categories (SC). Of those, three big SCs, whose overlap ratios are relatively low, are chosen and examined in detail. Last but not the least, it takes just about an hour to match 14.2 million versus 19.6 million publications from a publication year range of 2004–2013 in a high performance computer environment.

Suggested Citation

  • Mehmet Ali Abdulhayoglu & Bart Thijs, 2018. "Use of locality sensitive hashing (LSH) algorithm to match Web of Science and Scopus," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(2), pages 1229-1245, August.
  • Handle: RePEc:spr:scient:v:116:y:2018:i:2:d:10.1007_s11192-017-2569-6
    DOI: 10.1007/s11192-017-2569-6
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11192-017-2569-6
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11192-017-2569-6?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Myke Gluck, 1990. "A review of journal coverage overlap with an extension to the definition of overlap," Journal of the American Society for Information Science, Association for Information Science & Technology, vol. 41(1), pages 43-60, January.
    2. Lokman I. Meho & Yvonne Rogers, 2008. "Citation counting, citation ranking, and h‐index of human‐computer interaction researchers: A comparison of Scopus and Web of Science," Journal of the American Society for Information Science and Technology, Association for Information Science & Technology, vol. 59(11), pages 1711-1726, September.
    3. Mehmet Ali Abdulhayoglu & Bart Thijs & Wouter Jeuris, 2016. "Using character n-grams to match a list of publications to references in bibliographic databases," Scientometrics, Springer;Akadémiai Kiadó, vol. 109(3), pages 1525-1546, December.
    4. William W. Hood & Concepción S. Wilson, 2003. "Overlap in bibliographic databases," Journal of the American Society for Information Science and Technology, Association for Information Science & Technology, vol. 54(12), pages 1091-1103, October.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Nur Chasanah & Indra Gunawan & Bassam Baroudi, 2024. "International development project success: A literature review," Journal of International Development, John Wiley & Sons, Ltd., vol. 36(1), pages 146-171, January.
    2. Andrea Caputo & Mariya Kargina, 2022. "A user-friendly method to merge Scopus and Web of Science data during bibliometric analysis," Journal of Marketing Analytics, Palgrave Macmillan, vol. 10(1), pages 82-88, March.
    3. Guillaume Cabanac & Ingo Frommholz & Philipp Mayr, 2018. "Bibliometric-enhanced information retrieval: preface," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(2), pages 1225-1227, August.
    4. Matthew Harsh & Ravtosh Bal & Alex Weryha & Justin Whatley & Charles C. Onu & Lisa M. Negro, 2021. "Mapping computer science research in Africa: using academic networking sites for assessing research activity," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(1), pages 305-334, January.
    5. Sahar Mohamadi & Abbas Abbasi & Habib-Allah Ranaei Kordshouli & Kazem Askarifar, 2022. "Conceptualizing sustainable–responsible tourism indicators: an interpretive structural modeling approach," Environment, Development and Sustainability: A Multidisciplinary Approach to the Theory and Practice of Sustainable Development, Springer, vol. 24(1), pages 399-425, January.
    6. Junwen Zhu & Weishu Liu, 2020. "A tale of two databases: the use of Web of Science and Scopus in academic papers," Scientometrics, Springer;Akadémiai Kiadó, vol. 123(1), pages 321-335, April.
    7. Kristina Galjanić & Ivan Marović & Nikša Jajac, 2022. "Decision Support Systems for Managing Construction Projects: A Scientific Evolution Analysis," Sustainability, MDPI, vol. 14(9), pages 1-23, April.
    8. Tanja Mihalic & Sahar Mohamadi & Abbas Abbasi & Lóránt Dénes Dávid, 2021. "Mapping a Sustainable and Responsible Tourism Paradigm: A Bibliometric and Citation Network Analysis," Sustainability, MDPI, vol. 13(2), pages 1-22, January.
    9. Christian Thiele & Gerrit Hirschfeld & Ruth Brachel, 2021. "Clinical trial registries as Scientometric data: A novel solution for linking and deduplicating clinical trials from multiple registries," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(12), pages 9733-9750, December.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Amador Durán-Sánchez & María de la Cruz del Río-Rama & José à lvarez-García & Cristiana Oliveira, 2022. "Analysis of Worldwide Research on Craft Beer," SAGE Open, , vol. 12(2), pages 21582440221, June.
    2. Amador Durán-Sánchez & José Álvarez-García & María de la Cruz del Río-Rama & Beatriz Rosado-Cebrián, 2019. "Science Mapping of the Knowledge Base on Tourism Innovation," Sustainability, MDPI, vol. 11(12), pages 1-17, June.
    3. Deming Lin & Tianhui Gong & Wenbin Liu & Martin Meyer, 2020. "An entropy-based measure for the evolution of h index research," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(3), pages 2283-2298, December.
    4. De Andrés Fazio, Salvador & Urquía Grande, Elena & Pérez Estébanez, Raquel, 2022. "The “secret life” of the Statement of Cash Flow: A bibliometric analysis," Cuadernos de Gestión, Universidad del País Vasco - Instituto de Economía Aplicada a la Empresa (IEAE).
    5. García-Pérez, Miguel A., 2011. "Strange attractors in the Web of Science database," Journal of Informetrics, Elsevier, vol. 5(1), pages 214-218.
    6. Jakub Rybacki & Dobromił Serwa, 2021. "What Makes a Successful Scientist in a Central Bank? Evidence From the RePEc Database," Central European Journal of Economic Modelling and Econometrics, Central European Journal of Economic Modelling and Econometrics, vol. 13(3), pages 331-357, September.
    7. Mojtaba Ashour & Amir Mahdiyar & Syarmila Hany Haron, 2021. "A Comprehensive Review of Deterrents to the Practice of Sustainable Interior Architecture and Design," Sustainability, MDPI, vol. 13(18), pages 1-19, September.
    8. William W. Hood & Concepción S. Wilson, 2003. "Informetric studies using databases: Opportunities and challenges," Scientometrics, Springer;Akadémiai Kiadó, vol. 58(3), pages 587-608, November.
    9. Marek Gągolewski & Przemysław Grzegorzewski, 2009. "A geometric approach to the construction of scientific impact indices," Scientometrics, Springer;Akadémiai Kiadó, vol. 81(3), pages 617-634, December.
    10. José Álvarez-García & Claudia Patricia Maldonado-Erazo & María de la Cruz Del Río-Rama & Francisco Javier Castellano-Álvarez, 2019. "Cultural Heritage and Tourism Basis for Regional Development: Mapping of Scientific Coverage," Sustainability, MDPI, vol. 11(21), pages 1-21, October.
    11. Bar-Ilan, Judit, 2008. "Informetrics at the beginning of the 21st century—A review," Journal of Informetrics, Elsevier, vol. 2(1), pages 1-52.
    12. Gordana Budimir & Sophia Rahimeh & Sameh Tamimi & Primož Južnič, 2021. "Comparison of self-citation patterns in WoS and Scopus databases based on national scientific production in Slovenia (1996–2020)," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(3), pages 2249-2267, March.
    13. D. Checchi & S. Cicognani & N. Kulic, 2015. "Gender quotas or girls networks? Towards an understanding of recruitment in the research profession in Italy," Working Papers wp1047, Dipartimento Scienze Economiche, Universita' di Bologna.
    14. Shaher H. Zyoud & Ahed H. Zyoud, 2021. "Visualization and Mapping of Knowledge and Science Landscapes in Expert Systems With Applications Journal: A 30 Years’ Bibliometric Analysis," SAGE Open, , vol. 11(2), pages 21582440211, June.
    15. Bornmann, Lutz & Marx, Werner & Schier, Hermann & Rahm, Erhard & Thor, Andreas & Daniel, Hans-Dieter, 2009. "Convergent validity of bibliometric Google Scholar data in the field of chemistry—Citation counts for papers that were accepted by Angewandte Chemie International Edition or rejected but published els," Journal of Informetrics, Elsevier, vol. 3(1), pages 27-35.
    16. Waltman, Ludo, 2016. "A review of the literature on citation impact indicators," Journal of Informetrics, Elsevier, vol. 10(2), pages 365-391.
    17. Claudia Patricia Maldonado-Erazo & José Álvarez-García & María de la Cruz del Río-Rama & Amador Durán-Sánchez, 2021. "Scientific Mapping on the Impact of Climate Change on Cultural and Natural Heritage: A Systematic Scientometric Analysis," Land, MDPI, vol. 10(1), pages 1-19, January.
    18. Carolin Michels & Jun-Ying Fu, 2014. "Systematic analysis of coverage and usage of conference proceedings in web of science," Scientometrics, Springer;Akadémiai Kiadó, vol. 100(2), pages 307-327, August.
    19. Meho, Lokman I., 2019. "Using Scopus’s CiteScore for assessing the quality of computer science conferences," Journal of Informetrics, Elsevier, vol. 13(1), pages 419-433.
    20. Miguel A. García-Pérez, 2013. "Limited validity of equations to predict the future h index," Scientometrics, Springer;Akadémiai Kiadó, vol. 96(3), pages 901-909, September.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:scient:v:116:y:2018:i:2:d:10.1007_s11192-017-2569-6. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.