IDEAS home Printed from https://ideas.repec.org/a/nat/natcom/v15y2024i1d10.1038_s41467-024-50779-y.html
   My bibliography  Save this article

PatCID: an open-access dataset of chemical structures in patent documents

Author

Listed:
  • Lucas Morin

    (IBM Research
    ETH Zürich)

  • Valéry Weber

    (IBM Research)

  • Gerhard Ingmar Meijer

    (IBM Research)

  • Fisher Yu

    (ETH Zürich)

  • Peter W. J. Staar

    (IBM Research)

Abstract

The automatic analysis of patent publications has potential to accelerate research across various domains, including drug discovery and material science. Within patent documents, crucial information often resides in visual depictions of molecule structures. PatCID (Patent-extracted Chemical-structure Images database for Discovery) allows to access such information at scale. It enables users to search which molecules are displayed in which documents. PatCID contains 81M chemical-structure images and 14M unique chemical structures. Here, we compare PatCID with state-of-the-art chemical patent-databases. On a random set, PatCID retrieves 56.0% of molecules, which is higher than automatically-created databases, Google Patents (41.5%) and SureChEMBL (23.5%), as well as manually-created databases, Reaxys (53.5%) and SciFinder (49.5%). Leveraging state-of-the-art methods of document understanding, PatCID high-quality data outperforms currently available automatically-generated patent-databases. PatCID even competes with proprietary manually-created patent-databases. This enables promising applications for automatic literature review and learning-based molecular generation methods. The dataset is freely accessible for download.

Suggested Citation

  • Lucas Morin & Valéry Weber & Gerhard Ingmar Meijer & Fisher Yu & Peter W. J. Staar, 2024. "PatCID: an open-access dataset of chemical structures in patent documents," Nature Communications, Nature, vol. 15(1), pages 1-11, December.
  • Handle: RePEc:nat:natcom:v:15:y:2024:i:1:d:10.1038_s41467-024-50779-y
    DOI: 10.1038/s41467-024-50779-y
    as

    Download full text from publisher

    File URL: https://www.nature.com/articles/s41467-024-50779-y
    File Function: Abstract
    Download Restriction: no

    File URL: https://libkey.io/10.1038/s41467-024-50779-y?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Kohulan Rajan & Henning Otto Brinkhaus & M. Isabel Agea & Achim Zielesny & Christoph Steinbeck, 2023. "DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications," Nature Communications, Nature, vol. 14(1), pages 1-18, December.
    2. Michael Park & Erin Leahey & Russell J. Funk, 2023. "Papers and patents are becoming less disruptive over time," Nature, Nature, vol. 613(7942), pages 138-144, January.
    3. Kohulan Rajan & Henning Otto Brinkhaus & M. Isabel Agea & Achim Zielesny & Christoph Steinbeck, 2023. "Author Correction: DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications," Nature Communications, Nature, vol. 14(1), pages 1-1, December.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Orhan, Mehmet A. & van Rossenberg, Yvonne & Bal, P. Matthijs, 2024. "Authorship inequality and elite dominance in management and organizational research: A review of six decades," OSF Preprints tzx92, Center for Open Science.
    2. André Luis Araujo da Fonseca & Paula Castro Pires de Souza Chimenti & Maribel Carvalho Suarez, 2023. "Using deep learning language models as scaffolding tools in interpretive research," RAC - Revista de Administração Contemporânea (Journal of Contemporary Administration), ANPAD - Associação Nacional de Pós-Graduação e Pesquisa em Administração, vol. 27(Vol. 27 N), pages 230021-2300.
    3. Wang, Cheng-Jun & Yan, Lihan & Cui, Haochuan, 2023. "Unpacking the essential tension of knowledge recombination: Analyzing the impact of knowledge spanning on citation impact and disruptive innovation," Journal of Informetrics, Elsevier, vol. 17(4).
    4. Zhang, Ming-Ze & Wang, Tang-Rong & Lyu, Peng-Hui & Chen, Qi-Mei & Li, Ze-Xia & Ngai, Eric W.T., 2024. "Impact of gender composition of academic teams on disruptive output," Journal of Informetrics, Elsevier, vol. 18(2).
    5. Howell, Bronwyn E. & Potgieter, Petrus H., 2023. "AI-generated lemons: a sour outlook for content producers?," 32nd European Regional ITS Conference, Madrid 2023: Realising the digital decade in the European Union – Easier said than done? 277971, International Telecommunications Society (ITS).
    6. Naudé, Wim, 2024. "What They Don't Teach You about Artificial Intelligence at Business School: Stagnation, Oil, and War," IZA Discussion Papers 17306, Institute of Labor Economics (IZA).
    7. Marta Pacheco & Patrícia Moura & Carla Silva, 2023. "A Systematic Review of Syngas Bioconversion to Value-Added Products from 2012 to 2022," Energies, MDPI, vol. 16(7), pages 1-24, April.
    8. Thomas Davoine, 2023. "Flexicurity, education and optimal labour market policies," LABOUR, CEIS, vol. 37(4), pages 592-625, December.
    9. Keye Wu & Ziyue Xie & Jia Tina Du, 2024. "Does science disrupt technology? Examining science intensity, novelty, and recency through patent-paper citations in the pharmaceutical field," Scientometrics, Springer;Akadémiai Kiadó, vol. 129(9), pages 5469-5491, September.
    10. Naudé, Wim, 2023. "Melancholy Hues: The Futility of Green Growth and Degrowth, and the Inevitability of Societal Collapse," IZA Discussion Papers 16139, Institute of Labor Economics (IZA).
    11. Yuefen Wang & Lipeng Fan & Lei Wu, 2024. "A validation test of the Uzzi et al. novelty measure of innovation and applications to collaboration patterns between institutions," Scientometrics, Springer;Akadémiai Kiadó, vol. 129(7), pages 4379-4394, July.
    12. Garg, Prashant & Fetzer, Thiemo, 2024. "Causal Claims in Economics," OSF Preprints u4vgs, Center for Open Science.
    13. Ziyan Zhang & Junyan Zhang & Pushi Wang, 2024. "Measurement of disruptive innovation and its validity based on improved disruption index," Scientometrics, Springer;Akadémiai Kiadó, vol. 129(11), pages 6477-6531, November.
    14. Guderian, Carsten C. & Posth, Jan-Alexander & Grob, Linus, 2023. "Investment decisions and passive portfolio construction utilizing patent analytics: A multi-case study on COVID-19 treatment technologies," The Quarterly Review of Economics and Finance, Elsevier, vol. 92(C), pages 66-87.
    15. Macher, Jeffrey T. & Rutzer, Christian & Weder, Rolf, 2024. "Is there a secular decline in disruptive patents? Correcting for measurement bias," Research Policy, Elsevier, vol. 53(5).
    16. Boeing, Philipp & Brandt, Loren & Dai, Ruochen & Lim, Kevin & Peters, Bettina, 2024. "The Anatomy of Chinese Innovation: Insights on Patent Quality and Ownership," IZA Discussion Papers 16869, Institute of Labor Economics (IZA).
    17. Stephan Puehringer, 2023. "Wie viel Wettbewerb wollen wir (uns leisten)? Zur Verwettbewerblichung der Universitaeten in Oesterreich und darueber hinaus," ICAE Working Papers 149, Johannes Kepler University, Institute for Comprehensive Analysis of the Economy.
    18. Sabien Dobbelaere & Michael D. König & Andrin Spescha & Martin Wörter, 2023. "R&D Decisions and Productivity Growth: Evidence from Switzerland and the Netherlands," Tinbergen Institute Discussion Papers 23-080/VI, Tinbergen Institute.
    19. Benjamin Schneider & Hillary Vipond, 2023. "The Past and Future of Work: How History Can Inform the Age of Automation," CESifo Working Paper Series 10766, CESifo.
    20. Stephan Puehringer & Georg Wolfmayr, 2023. "Organizers and promotors of academic competition? The role of (academic) social networks and platforms in the competitization of science," ICAE Working Papers 152, Johannes Kepler University, Institute for Comprehensive Analysis of the Economy.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:nat:natcom:v:15:y:2024:i:1:d:10.1038_s41467-024-50779-y. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.nature.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.