IDEAS home Printed from https://ideas.repec.org/a/nat/natcom/v14y2023i1d10.1038_s41467-023-42992-y.html
   My bibliography  Save this article

Exploiting redundancy in large materials datasets for efficient machine learning with less data

Author

Listed:
  • Kangming Li

    (University of Toronto)

  • Daniel Persaud

    (University of Toronto)

  • Kamal Choudhary

    (National Institute of Standards and Technology)

  • Brian DeCost

    (National Institute of Standards and Technology)

  • Michael Greenwood

    (Natural Resources Canada)

  • Jason Hattrick-Simpers

    (University of Toronto
    University of Toronto
    Vector Institute for Artificial Intelligence
    Schwartz Reisman Institute for Technology and Society)

Abstract

Extensive efforts to gather materials data have largely overlooked potential data redundancy. In this study, we present evidence of a significant degree of redundancy across multiple large datasets for various material properties, by revealing that up to 95% of data can be safely removed from machine learning training with little impact on in-distribution prediction performance. The redundant data is related to over-represented material types and does not mitigate the severe performance degradation on out-of-distribution samples. In addition, we show that uncertainty-based active learning algorithms can construct much smaller but equally informative datasets. We discuss the effectiveness of informative data in improving prediction performance and robustness and provide insights into efficient data acquisition and machine learning training. This work challenges the “bigger is better” mentality and calls for attention to the information richness of materials data rather than a narrow emphasis on data volume.

Suggested Citation

  • Kangming Li & Daniel Persaud & Kamal Choudhary & Brian DeCost & Michael Greenwood & Jason Hattrick-Simpers, 2023. "Exploiting redundancy in large materials datasets for efficient machine learning with less data," Nature Communications, Nature, vol. 14(1), pages 1-10, December.
  • Handle: RePEc:nat:natcom:v:14:y:2023:i:1:d:10.1038_s41467-023-42992-y
    DOI: 10.1038/s41467-023-42992-y
    as

    Download full text from publisher

    File URL: https://www.nature.com/articles/s41467-023-42992-y
    File Function: Abstract
    Download Restriction: no

    File URL: https://libkey.io/10.1038/s41467-023-42992-y?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Xiwen Jia & Allyson Lynch & Yuheng Huang & Matthew Danielson & Immaculate Lang’at & Alexander Milder & Aaron E. Ruby & Hao Wang & Sorelle A. Friedler & Alexander J. Norquist & Joshua Schrier, 2019. "Anthropogenic biases in chemical reaction data hinder exploratory inorganic synthesis," Nature, Nature, vol. 573(7773), pages 251-255, September.
    2. Miao Zhong & Kevin Tran & Yimeng Min & Chuanhao Wang & Ziyun Wang & Cao-Thang Dinh & Phil De Luna & Zongqian Yu & Armin Sedighian Rasouli & Peter Brodersen & Song Sun & Oleksandr Voznyy & Chih-Shan Ta, 2020. "Accelerated discovery of CO2 electrocatalysts using active machine learning," Nature, Nature, vol. 581(7807), pages 178-183, May.
    3. So Takamoto & Chikashi Shinagawa & Daisuke Motoki & Kosuke Nakago & Wenwen Li & Iori Kurata & Taku Watanabe & Yoshihiro Yayama & Hiroki Iriguchi & Yusuke Asano & Tasuku Onodera & Takafumi Ishii & Taka, 2022. "Towards universal neural network potential for material discovery applicable to arbitrary combination of 45 elements," Nature Communications, Nature, vol. 13(1), pages 1-11, December.
    4. Keith T. Butler & Daniel W. Davies & Hugh Cartwright & Olexandr Isayev & Aron Walsh, 2018. "Machine learning for molecular and materials science," Nature, Nature, vol. 559(7715), pages 547-555, July.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Zhilong Song & Linfeng Fan & Shuaihua Lu & Chongyi Ling & Qionghua Zhou & Jinlan Wang, 2025. "Inverse design of promising electrocatalysts for CO2 reduction via generative models and bird swarm algorithm," Nature Communications, Nature, vol. 16(1), pages 1-10, December.
    2. Manu Suvarna & Tangsheng Zou & Sok Ho Chong & Yuzhen Ge & Antonio J. Martín & Javier Pérez-Ramírez, 2024. "Active learning streamlines development of high performance catalysts for higher alcohol synthesis," Nature Communications, Nature, vol. 15(1), pages 1-14, December.
    3. Zhiyuan Han & An Chen & Zejian Li & Mengtian Zhang & Zhilong Wang & Lixue Yang & Runhua Gao & Yeyang Jia & Guanjun Ji & Zhoujie Lao & Xiao Xiao & Kehao Tao & Jing Gao & Wei Lv & Tianshuai Wang & Jinji, 2024. "Machine learning-based design of electrocatalytic materials towards high-energy lithium||sulfur batteries development," Nature Communications, Nature, vol. 15(1), pages 1-13, December.
    4. Kihoon Bang & Doosun Hong & Youngtae Park & Donghun Kim & Sang Soo Han & Hyuck Mo Lee, 2023. "Machine learning-enabled exploration of the electrochemical stability of real-scale metallic nanoparticles," Nature Communications, Nature, vol. 14(1), pages 1-11, December.
    5. Han Li & Ruotian Zhang & Yaosen Min & Dacheng Ma & Dan Zhao & Jianyang Zeng, 2023. "A knowledge-guided pre-training framework for improving molecular representation learning," Nature Communications, Nature, vol. 14(1), pages 1-13, December.
    6. Li, Yi & Liu, Kailong & Foley, Aoife M. & Zülke, Alana & Berecibar, Maitane & Nanini-Maury, Elise & Van Mierlo, Joeri & Hoster, Harry E., 2019. "Data-driven health estimation and lifetime prediction of lithium-ion batteries: A review," Renewable and Sustainable Energy Reviews, Elsevier, vol. 113(C), pages 1-1.
    7. Sarmad Dashti Latif & Ali Najah Ahmed, 2023. "A review of deep learning and machine learning techniques for hydrological inflow forecasting," Environment, Development and Sustainability: A Multidisciplinary Approach to the Theory and Practice of Sustainable Development, Springer, vol. 25(11), pages 12189-12216, November.
    8. Keke Song & Rui Zhao & Jiahui Liu & Yanzhou Wang & Eric Lindgren & Yong Wang & Shunda Chen & Ke Xu & Ting Liang & Penghua Ying & Nan Xu & Zhiqiang Zhao & Jiuyang Shi & Junjie Wang & Shuang Lyu & Zezhu, 2024. "General-purpose machine-learned potential for 16 elemental metals and their alloys," Nature Communications, Nature, vol. 15(1), pages 1-15, December.
    9. Xinyu Chen & Shuaihua Lu & Qian Chen & Qionghua Zhou & Jinlan Wang, 2024. "From bulk effective mass to 2D carrier mobility accurate prediction via adversarial transfer learning," Nature Communications, Nature, vol. 15(1), pages 1-9, December.
    10. Niklas W. A. Gebauer & Michael Gastegger & Stefaan S. P. Hessmann & Klaus-Robert Müller & Kristof T. Schütt, 2022. "Inverse design of 3d molecular structures with conditional generative neural networks," Nature Communications, Nature, vol. 13(1), pages 1-11, December.
    11. Yong Zhang & Feifei Chen & Xinyi Yang & Yiran Guo & Xinghua Zhang & Hong Dong & Weihua Wang & Feng Lu & Zunming Lu & Hui Liu & Hui Liu & Yao Xiao & Yahui Cheng, 2025. "Electronic metal-support interaction modulates Cu electronic structures for CO2 electroreduction to desired products," Nature Communications, Nature, vol. 16(1), pages 1-12, December.
    12. Gang Wang & Shinya Mine & Duotian Chen & Yuan Jing & Kah Wei Ting & Taichi Yamaguchi & Motoshi Takao & Zen Maeno & Ichigaku Takigawa & Koichi Matsushita & Ken-ichi Shimizu & Takashi Toyao, 2023. "Accelerated discovery of multi-elemental reverse water-gas shift catalysts using extrapolative machine learning approach," Nature Communications, Nature, vol. 14(1), pages 1-12, December.
    13. Jikai Sun & Rui Tu & Yuchun Xu & Hongyan Yang & Tie Yu & Dong Zhai & Xiuqin Ci & Weiqiao Deng, 2024. "Machine learning aided design of single-atom alloy catalysts for methane cracking," Nature Communications, Nature, vol. 15(1), pages 1-9, December.
    14. Huziel E. Sauceda & Luis E. Gálvez-González & Stefan Chmiela & Lauro Oliver Paz-Borbón & Klaus-Robert Müller & Alexandre Tkatchenko, 2022. "BIGDML—Towards accurate quantum machine learning force fields for materials," Nature Communications, Nature, vol. 13(1), pages 1-16, December.
    15. Sukriti Manna & Troy D. Loeffler & Rohit Batra & Suvo Banik & Henry Chan & Bilvin Varughese & Kiran Sasikumar & Michael Sternberg & Tom Peterka & Mathew J. Cherukara & Stephen K. Gray & Bobby G. Sumpt, 2022. "Learning in continuous action space for developing high dimensional potential energy models," Nature Communications, Nature, vol. 13(1), pages 1-10, December.
    16. Ribeiro, Haroldo V. & Lopes, Diego D. & Pessa, Arthur A.B. & Martins, Alvaro F. & da Cunha, Bruno R. & Gonçalves, Sebastián & Lenzi, Ervin K. & Hanley, Quentin S. & Perc, Matjaž, 2023. "Deep learning criminal networks," Chaos, Solitons & Fractals, Elsevier, vol. 172(C).
    17. Luis M. Antunes & Keith T. Butler & Ricardo Grau-Crespo, 2024. "Crystal structure generation with autoregressive large language modeling," Nature Communications, Nature, vol. 15(1), pages 1-16, December.
    18. Zhiheng Li & Xin Mao & Desheng Feng & Mengran Li & Xiaoyong Xu & Yadan Luo & Linzhou Zhuang & Rijia Lin & Tianjiu Zhu & Fengli Liang & Zi Huang & Dong Liu & Zifeng Yan & Aijun Du & Zongping Shao & Zho, 2024. "Prediction of perovskite oxygen vacancies for oxygen electrocatalysis at different temperatures," Nature Communications, Nature, vol. 15(1), pages 1-12, December.
    19. Xiaojie She & Lingling Zhai & Yifei Wang & Pei Xiong & Molly Meng-Jung Li & Tai-Sing Wu & Man Chung Wong & Xuyun Guo & Zhihang Xu & Huaming Li & Hui Xu & Ye Zhu & Shik Chi Edman Tsang & Shu Ping Lau, 2024. "Pure-water-fed, electrocatalytic CO2 reduction to ethylene beyond 1,000 h stability at 10 A," Nature Energy, Nature, vol. 9(1), pages 81-91, January.
    20. Jing Xue & Xue Dong & Chunxiao Liu & Jiawei Li & Yizhou Dai & Weiqing Xue & Laihao Luo & Yuan Ji & Xiao Zhang & Xu Li & Qiu Jiang & Tingting Zheng & Jianping Xiao & Chuan Xia, 2024. "Turning copper into an efficient and stable CO evolution catalyst beyond noble metals," Nature Communications, Nature, vol. 15(1), pages 1-11, December.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:nat:natcom:v:14:y:2023:i:1:d:10.1038_s41467-023-42992-y. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.nature.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.