IDEAS home Printed from https://ideas.repec.org/a/nat/natcom/v15y2024i1d10.1038_s41467-024-50779-y.html
   My bibliography  Save this article

PatCID: an open-access dataset of chemical structures in patent documents

Author

Listed:
  • Lucas Morin

    (IBM Research
    ETH Zürich)

  • Valéry Weber

    (IBM Research)

  • Gerhard Ingmar Meijer

    (IBM Research)

  • Fisher Yu

    (ETH Zürich)

  • Peter W. J. Staar

    (IBM Research)

Abstract

The automatic analysis of patent publications has potential to accelerate research across various domains, including drug discovery and material science. Within patent documents, crucial information often resides in visual depictions of molecule structures. PatCID (Patent-extracted Chemical-structure Images database for Discovery) allows to access such information at scale. It enables users to search which molecules are displayed in which documents. PatCID contains 81M chemical-structure images and 14M unique chemical structures. Here, we compare PatCID with state-of-the-art chemical patent-databases. On a random set, PatCID retrieves 56.0% of molecules, which is higher than automatically-created databases, Google Patents (41.5%) and SureChEMBL (23.5%), as well as manually-created databases, Reaxys (53.5%) and SciFinder (49.5%). Leveraging state-of-the-art methods of document understanding, PatCID high-quality data outperforms currently available automatically-generated patent-databases. PatCID even competes with proprietary manually-created patent-databases. This enables promising applications for automatic literature review and learning-based molecular generation methods. The dataset is freely accessible for download.

Suggested Citation

  • Lucas Morin & Valéry Weber & Gerhard Ingmar Meijer & Fisher Yu & Peter W. J. Staar, 2024. "PatCID: an open-access dataset of chemical structures in patent documents," Nature Communications, Nature, vol. 15(1), pages 1-11, December.
  • Handle: RePEc:nat:natcom:v:15:y:2024:i:1:d:10.1038_s41467-024-50779-y
    DOI: 10.1038/s41467-024-50779-y
    as

    Download full text from publisher

    File URL: https://www.nature.com/articles/s41467-024-50779-y
    File Function: Abstract
    Download Restriction: no

    File URL: https://libkey.io/10.1038/s41467-024-50779-y?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Kohulan Rajan & Henning Otto Brinkhaus & M. Isabel Agea & Achim Zielesny & Christoph Steinbeck, 2023. "DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications," Nature Communications, Nature, vol. 14(1), pages 1-18, December.
    2. Michael Park & Erin Leahey & Russell J. Funk, 2023. "Papers and patents are becoming less disruptive over time," Nature, Nature, vol. 613(7942), pages 138-144, January.
    3. Kohulan Rajan & Henning Otto Brinkhaus & M. Isabel Agea & Achim Zielesny & Christoph Steinbeck, 2023. "Author Correction: DECIMER.ai: an open platform for automated optical chemical structure identification, segmentation and recognition in scientific publications," Nature Communications, Nature, vol. 14(1), pages 1-1, December.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Orhan, Mehmet A. & van Rossenberg, Yvonne & Bal, P. Matthijs, 2024. "Authorship inequality and elite dominance in management and organizational research: A review of six decades," OSF Preprints tzx92, Center for Open Science.
    2. Sam Arts & Nicola Melluso & Reinhilde Veugelers, 2023. "Beyond Citations: Measuring Novel Scientific Ideas and their Impact in Publication Text," Papers 2309.16437, arXiv.org, revised Oct 2024.
    3. André Luis Araujo da Fonseca & Paula Castro Pires de Souza Chimenti & Maribel Carvalho Suarez, 2023. "Using deep learning language models as scaffolding tools in interpretive research," RAC - Revista de Administração Contemporânea (Journal of Contemporary Administration), ANPAD - Associação Nacional de Pós-Graduação e Pesquisa em Administração, vol. 27(Vol. 27 N), pages 230021-2300.
    4. Yang, Alex Jie & Wu, Linwei & Zhang, Qi & Wang, Hao & Deng, Sanhong, 2023. "The k-step h-index in citation networks at the paper, author, and institution levels," Journal of Informetrics, Elsevier, vol. 17(4).
    5. Yining Wang & Qiang Wu & Liangyu Li, 2024. "Examining the influence of women scientists on scientific impact and novelty: insights from top business journals," Scientometrics, Springer;Akadémiai Kiadó, vol. 129(6), pages 3517-3542, June.
    6. Wang, Cheng-Jun & Yan, Lihan & Cui, Haochuan, 2023. "Unpacking the essential tension of knowledge recombination: Analyzing the impact of knowledge spanning on citation impact and disruptive innovation," Journal of Informetrics, Elsevier, vol. 17(4).
    7. Zhang, Ming-Ze & Wang, Tang-Rong & Lyu, Peng-Hui & Chen, Qi-Mei & Li, Ze-Xia & Ngai, Eric W.T., 2024. "Impact of gender composition of academic teams on disruptive output," Journal of Informetrics, Elsevier, vol. 18(2).
    8. Jeffrey T. Macher & Christian Rutzer & Rolf Weder, 2023. "The Illusive Slump of Disruptive Patents," Papers 2306.10774, arXiv.org.
    9. Rosalie L. Tung & Gary Knight & Pervez Ghauri & Shameen Prashantham & Tony Fang, 2023. "Disruptive knowledge in international business research: A pipe dream or attainable target?," Journal of International Business Studies, Palgrave Macmillan;Academy of International Business, vol. 54(9), pages 1589-1598, December.
    10. Howell, Bronwyn E. & Potgieter, Petrus H., 2023. "AI-generated lemons: a sour outlook for content producers?," 32nd European Regional ITS Conference, Madrid 2023: Realising the digital decade in the European Union – Easier said than done? 277971, International Telecommunications Society (ITS).
    11. Narayanamurti, Venkatesh & Tsao, Jeffrey Y., 2024. "How technoscientific knowledge advances: A Bell-Labs-inspired architecture," Research Policy, Elsevier, vol. 53(4).
    12. Morrow, Nathan & Borrell, James S. & Mock, Nancy B. & Büchi, Lucie & Gatto, Andrea & Lulekal, Ermias, 2023. "Measure of indigenous perennial staple crop, Ensete ventricosum, associated with positive food security outcomes in southern Ethiopian highlands," Food Policy, Elsevier, vol. 117(C).
    13. Naudé, Wim, 2024. "What They Don't Teach You about Artificial Intelligence at Business School: Stagnation, Oil, and War," IZA Discussion Papers 17306, Institute of Labor Economics (IZA).
    14. Besancenot, Damien & Vranceanu, Radu, 2024. "Reluctance to pursue breakthrough research: A signaling explanation," Research Policy, Elsevier, vol. 53(4).
    15. Houqiang Yu & Yian Liang & Yinghua Xie, 2024. "Predicting Scientific Breakthroughs Based on Structural Dynamic of Citation Cascades," Mathematics, MDPI, vol. 12(11), pages 1-18, June.
    16. Marta Pacheco & Patrícia Moura & Carla Silva, 2023. "A Systematic Review of Syngas Bioconversion to Value-Added Products from 2012 to 2022," Energies, MDPI, vol. 16(7), pages 1-24, April.
    17. Naudé, Wim, 2023. "Melancholy Hues: The Futility of Green Growth and Degrowth, and the Inevitability of Societal Collapse," IZA Discussion Papers 16139, Institute of Labor Economics (IZA).
    18. Yuefen Wang & Lipeng Fan & Lei Wu, 2024. "A validation test of the Uzzi et al. novelty measure of innovation and applications to collaboration patterns between institutions," Scientometrics, Springer;Akadémiai Kiadó, vol. 129(7), pages 4379-4394, July.
    19. Mueller, Elisabeth & Boeing, Philipp, 2024. "Global influence of inventions and technology sovereignty," ZEW Discussion Papers 24-024, ZEW - Leibniz Centre for European Economic Research, revised 2024.
    20. Guderian, Carsten C. & Posth, Jan-Alexander & Grob, Linus, 2023. "Investment decisions and passive portfolio construction utilizing patent analytics: A multi-case study on COVID-19 treatment technologies," The Quarterly Review of Economics and Finance, Elsevier, vol. 92(C), pages 66-87.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:nat:natcom:v:15:y:2024:i:1:d:10.1038_s41467-024-50779-y. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.nature.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.