Author
Listed:
- Vahe Tshitoyan
(Lawrence Berkeley National Laboratory
Google LLC)
- John Dagdelen
(Lawrence Berkeley National Laboratory
University of California)
- Leigh Weston
(Lawrence Berkeley National Laboratory)
- Alexander Dunn
(Lawrence Berkeley National Laboratory
University of California)
- Ziqin Rong
(Lawrence Berkeley National Laboratory)
- Olga Kononova
(University of California)
- Kristin A. Persson
(Lawrence Berkeley National Laboratory
University of California)
- Gerbrand Ceder
(Lawrence Berkeley National Laboratory
University of California)
- Anubhav Jain
(Lawrence Berkeley National Laboratory)
Abstract
The overwhelming majority of scientific knowledge is published as text, which is difficult to analyse by either traditional statistical analysis or modern machine learning methods. By contrast, the main source of machine-interpretable data for the materials research community has come from structured property databases1,2, which encompass only a small fraction of the knowledge present in the research literature. Beyond property values, publications contain valuable knowledge regarding the connections and relationships between data items as interpreted by the authors. To improve the identification and use of this knowledge, several studies have focused on the retrieval of information from scientific literature using supervised natural language processing3–10, which requires large hand-labelled datasets for training. Here we show that materials science knowledge present in the published literature can be efficiently encoded as information-dense word embeddings11–13 (vector representations of words) without human labelling or supervision. Without any explicit insertion of chemical knowledge, these embeddings capture complex materials science concepts such as the underlying structure of the periodic table and structure–property relationships in materials. Furthermore, we demonstrate that an unsupervised method can recommend materials for functional applications several years before their discovery. This suggests that latent knowledge regarding future discoveries is to a large extent embedded in past publications. Our findings highlight the possibility of extracting knowledge and relationships from the massive body of scientific literature in a collective manner, and point towards a generalized approach to the mining of scientific literature.
Suggested Citation
Vahe Tshitoyan & John Dagdelen & Leigh Weston & Alexander Dunn & Ziqin Rong & Olga Kononova & Kristin A. Persson & Gerbrand Ceder & Anubhav Jain, 2019.
"Unsupervised word embeddings capture latent knowledge from materials science literature,"
Nature, Nature, vol. 571(7763), pages 95-98, July.
Handle:
RePEc:nat:nature:v:571:y:2019:i:7763:d:10.1038_s41586-019-1335-8
DOI: 10.1038/s41586-019-1335-8
Download full text from publisher
As the access to this document is restricted, you may want to search for a different version of it.
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:nat:nature:v:571:y:2019:i:7763:d:10.1038_s41586-019-1335-8. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.nature.com .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.