IDEAS home Printed from https://ideas.repec.org/a/nat/natcom/v14y2023i1d10.1038_s41467-023-42992-y.html
   My bibliography  Save this article

Exploiting redundancy in large materials datasets for efficient machine learning with less data

Author

Listed:
  • Kangming Li

    (University of Toronto)

  • Daniel Persaud

    (University of Toronto)

  • Kamal Choudhary

    (National Institute of Standards and Technology)

  • Brian DeCost

    (National Institute of Standards and Technology)

  • Michael Greenwood

    (Natural Resources Canada)

  • Jason Hattrick-Simpers

    (University of Toronto
    University of Toronto
    Vector Institute for Artificial Intelligence
    Schwartz Reisman Institute for Technology and Society)

Abstract

Extensive efforts to gather materials data have largely overlooked potential data redundancy. In this study, we present evidence of a significant degree of redundancy across multiple large datasets for various material properties, by revealing that up to 95% of data can be safely removed from machine learning training with little impact on in-distribution prediction performance. The redundant data is related to over-represented material types and does not mitigate the severe performance degradation on out-of-distribution samples. In addition, we show that uncertainty-based active learning algorithms can construct much smaller but equally informative datasets. We discuss the effectiveness of informative data in improving prediction performance and robustness and provide insights into efficient data acquisition and machine learning training. This work challenges the “bigger is better” mentality and calls for attention to the information richness of materials data rather than a narrow emphasis on data volume.

Suggested Citation

  • Kangming Li & Daniel Persaud & Kamal Choudhary & Brian DeCost & Michael Greenwood & Jason Hattrick-Simpers, 2023. "Exploiting redundancy in large materials datasets for efficient machine learning with less data," Nature Communications, Nature, vol. 14(1), pages 1-10, December.
  • Handle: RePEc:nat:natcom:v:14:y:2023:i:1:d:10.1038_s41467-023-42992-y
    DOI: 10.1038/s41467-023-42992-y
    as

    Download full text from publisher

    File URL: https://www.nature.com/articles/s41467-023-42992-y
    File Function: Abstract
    Download Restriction: no

    File URL: https://libkey.io/10.1038/s41467-023-42992-y?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Xiwen Jia & Allyson Lynch & Yuheng Huang & Matthew Danielson & Immaculate Lang’at & Alexander Milder & Aaron E. Ruby & Hao Wang & Sorelle A. Friedler & Alexander J. Norquist & Joshua Schrier, 2019. "Anthropogenic biases in chemical reaction data hinder exploratory inorganic synthesis," Nature, Nature, vol. 573(7773), pages 251-255, September.
    2. Miao Zhong & Kevin Tran & Yimeng Min & Chuanhao Wang & Ziyun Wang & Cao-Thang Dinh & Phil De Luna & Zongqian Yu & Armin Sedighian Rasouli & Peter Brodersen & Song Sun & Oleksandr Voznyy & Chih-Shan Ta, 2020. "Accelerated discovery of CO2 electrocatalysts using active machine learning," Nature, Nature, vol. 581(7807), pages 178-183, May.
    3. So Takamoto & Chikashi Shinagawa & Daisuke Motoki & Kosuke Nakago & Wenwen Li & Iori Kurata & Taku Watanabe & Yoshihiro Yayama & Hiroki Iriguchi & Yusuke Asano & Tasuku Onodera & Takafumi Ishii & Taka, 2022. "Towards universal neural network potential for material discovery applicable to arbitrary combination of 45 elements," Nature Communications, Nature, vol. 13(1), pages 1-11, December.
    4. Keith T. Butler & Daniel W. Davies & Hugh Cartwright & Olexandr Isayev & Aron Walsh, 2018. "Machine learning for molecular and materials science," Nature, Nature, vol. 559(7715), pages 547-555, July.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Kihoon Bang & Doosun Hong & Youngtae Park & Donghun Kim & Sang Soo Han & Hyuck Mo Lee, 2023. "Machine learning-enabled exploration of the electrochemical stability of real-scale metallic nanoparticles," Nature Communications, Nature, vol. 14(1), pages 1-11, December.
    2. Han Li & Ruotian Zhang & Yaosen Min & Dacheng Ma & Dan Zhao & Jianyang Zeng, 2023. "A knowledge-guided pre-training framework for improving molecular representation learning," Nature Communications, Nature, vol. 14(1), pages 1-13, December.
    3. Cheng Du & Joel P. Mills & Asfaw G. Yohannes & Wei Wei & Lei Wang & Siyan Lu & Jian-Xiang Lian & Maoyu Wang & Tao Guo & Xiyang Wang & Hua Zhou & Cheng-Jun Sun & John Z. Wen & Brian Kendall & Martin Co, 2023. "Cascade electrocatalysis via AgCu single-atom alloy and Ag nanoparticles in CO2 electroreduction toward multicarbon products," Nature Communications, Nature, vol. 14(1), pages 1-10, December.
    4. Tian Xie & Arthur France-Lanord & Yanming Wang & Jeffrey Lopez & Michael A. Stolberg & Megan Hill & Graham Michael Leverick & Rafael Gomez-Bombarelli & Jeremiah A. Johnson & Yang Shao-Horn & Jeffrey C, 2022. "Accelerating amorphous polymer electrolyte screening by learning to reduce errors in molecular dynamics simulated properties," Nature Communications, Nature, vol. 13(1), pages 1-10, December.
    5. Li, Yi & Liu, Kailong & Foley, Aoife M. & Zülke, Alana & Berecibar, Maitane & Nanini-Maury, Elise & Van Mierlo, Joeri & Hoster, Harry E., 2019. "Data-driven health estimation and lifetime prediction of lithium-ion batteries: A review," Renewable and Sustainable Energy Reviews, Elsevier, vol. 113(C), pages 1-1.
    6. O. V. Mythreyi & M. Rohith Srinivaas & Tigga Amit Kumar & R. Jayaganthan, 2021. "Machine-Learning-Based Prediction of Corrosion Behavior in Additively Manufactured Inconel 718," Data, MDPI, vol. 6(8), pages 1-16, July.
    7. Sarmad Dashti Latif & Ali Najah Ahmed, 2023. "A review of deep learning and machine learning techniques for hydrological inflow forecasting," Environment, Development and Sustainability: A Multidisciplinary Approach to the Theory and Practice of Sustainable Development, Springer, vol. 25(11), pages 12189-12216, November.
    8. Andreas Erlebach & Martin Šípka & Indranil Saha & Petr Nachtigall & Christopher J. Heard & Lukáš Grajciar, 2024. "A reactive neural network framework for water-loaded acidic zeolites," Nature Communications, Nature, vol. 15(1), pages 1-14, December.
    9. Xiaoyun Lin & Xiaowei Du & Shican Wu & Shiyu Zhen & Wei Liu & Chunlei Pei & Peng Zhang & Zhi-Jian Zhao & Jinlong Gong, 2024. "Machine learning-assisted dual-atom sites design with interpretable descriptors unifying electrocatalytic reactions," Nature Communications, Nature, vol. 15(1), pages 1-13, December.
    10. Snehi Shrestha & Kieran James Barvenik & Tianle Chen & Haochen Yang & Yang Li & Meera Muthachi Kesavan & Joshua M. Little & Hayden C. Whitley & Zi Teng & Yaguang Luo & Eleonora Tubaldi & Po-Yen Chen, 2024. "Machine intelligence accelerated design of conductive MXene aerogels with programmable properties," Nature Communications, Nature, vol. 15(1), pages 1-14, December.
    11. Xinyu Chen & Shuaihua Lu & Qian Chen & Qionghua Zhou & Jinlan Wang, 2024. "From bulk effective mass to 2D carrier mobility accurate prediction via adversarial transfer learning," Nature Communications, Nature, vol. 15(1), pages 1-9, December.
    12. SJ, Balaji & Babu, Suresh Chandra & Pal, Suresh, 2021. "Understanding Science and Policy Making in Agriculture: A Machine Learning Application for India," 2021 Conference, August 17-31, 2021, Virtual 315227, International Association of Agricultural Economists.
    13. Niklas W. A. Gebauer & Michael Gastegger & Stefaan S. P. Hessmann & Klaus-Robert Müller & Kristof T. Schütt, 2022. "Inverse design of 3d molecular structures with conditional generative neural networks," Nature Communications, Nature, vol. 13(1), pages 1-11, December.
    14. Bo Peng & Ye Wei & Yu Qin & Jiabao Dai & Yue Li & Aobo Liu & Yun Tian & Liuliu Han & Yufeng Zheng & Peng Wen, 2023. "Machine learning-enabled constrained multi-objective design of architected materials," Nature Communications, Nature, vol. 14(1), pages 1-12, December.
    15. Hefei Li & Pengfei Wei & Tianfu Liu & Mingrun Li & Chao Wang & Rongtan Li & Jinyu Ye & Zhi-You Zhou & Shi-Gang Sun & Qiang Fu & Dunfeng Gao & Guoxiong Wang & Xinhe Bao, 2024. "CO electrolysis to multicarbon products over grain boundary-rich Cu nanoparticles in membrane electrode assembly electrolyzers," Nature Communications, Nature, vol. 15(1), pages 1-11, December.
    16. Gang Wang & Shinya Mine & Duotian Chen & Yuan Jing & Kah Wei Ting & Taichi Yamaguchi & Motoshi Takao & Zen Maeno & Ichigaku Takigawa & Koichi Matsushita & Ken-ichi Shimizu & Takashi Toyao, 2023. "Accelerated discovery of multi-elemental reverse water-gas shift catalysts using extrapolative machine learning approach," Nature Communications, Nature, vol. 14(1), pages 1-12, December.
    17. Jikai Sun & Rui Tu & Yuchun Xu & Hongyan Yang & Tie Yu & Dong Zhai & Xiuqin Ci & Weiqiao Deng, 2024. "Machine learning aided design of single-atom alloy catalysts for methane cracking," Nature Communications, Nature, vol. 15(1), pages 1-9, December.
    18. Kaihang Yue & Yanyang Qin & Honghao Huang & Zhuoran Lv & Mingzhi Cai & Yaqiong Su & Fuqiang Huang & Ya Yan, 2024. "Stabilized Cu0 -Cu1+ dual sites in a cyanamide framework for selective CO2 electroreduction to ethylene," Nature Communications, Nature, vol. 15(1), pages 1-12, December.
    19. Katja-Sophia Csizi & Miguel Steiner & Markus Reiher, 2024. "Nanoscale chemical reaction exploration with a quantum magnifying glass," Nature Communications, Nature, vol. 15(1), pages 1-15, December.
    20. Huziel E. Sauceda & Luis E. Gálvez-González & Stefan Chmiela & Lauro Oliver Paz-Borbón & Klaus-Robert Müller & Alexandre Tkatchenko, 2022. "BIGDML—Towards accurate quantum machine learning force fields for materials," Nature Communications, Nature, vol. 13(1), pages 1-16, December.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:nat:natcom:v:14:y:2023:i:1:d:10.1038_s41467-023-42992-y. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.nature.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.