IDEAS home Printed from https://ideas.repec.org/a/nat/natcom/v13y2022i1d10.1038_s41467-022-34902-5.html
   My bibliography  Save this article

Accuracy and data efficiency in deep learning models of protein expression

Author

Listed:
  • Evangelos-Marios Nikolados

    (University of Edinburgh)

  • Arin Wongprommoon

    (University of Edinburgh)

  • Oisin Mac Aodha

    (University of Edinburgh
    The Alan Turing Institute)

  • Guillaume Cambray

    (University of Montpellier
    University of Montpellier)

  • Diego A. Oyarzún

    (University of Edinburgh
    University of Edinburgh
    The Alan Turing Institute)

Abstract

Synthetic biology often involves engineering microbial strains to express high-value proteins. Thanks to progress in rapid DNA synthesis and sequencing, deep learning has emerged as a promising approach to build sequence-to-expression models for strain optimization. But such models need large and costly training data that create steep entry barriers for many laboratories. Here we study the relation between accuracy and data efficiency in an atlas of machine learning models trained on datasets of varied size and sequence diversity. We show that deep learning can achieve good prediction accuracy with much smaller datasets than previously thought. We demonstrate that controlled sequence diversity leads to substantial gains in data efficiency and employed Explainable AI to show that convolutional neural networks can finely discriminate between input DNA sequences. Our results provide guidelines for designing genotype-phenotype screens that balance cost and quality of training data, thus helping promote the wider adoption of deep learning in the biotechnology sector.

Suggested Citation

  • Evangelos-Marios Nikolados & Arin Wongprommoon & Oisin Mac Aodha & Guillaume Cambray & Diego A. Oyarzún, 2022. "Accuracy and data efficiency in deep learning models of protein expression," Nature Communications, Nature, vol. 13(1), pages 1-12, December.
  • Handle: RePEc:nat:natcom:v:13:y:2022:i:1:d:10.1038_s41467-022-34902-5
    DOI: 10.1038/s41467-022-34902-5
    as

    Download full text from publisher

    File URL: https://www.nature.com/articles/s41467-022-34902-5
    File Function: Abstract
    Download Restriction: no

    File URL: https://libkey.io/10.1038/s41467-022-34902-5?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Nicolaas M. Angenent-Mari & Alexander S. Garruss & Luis R. Soenksen & George Church & James J. Collins, 2020. "A deep learning approach to programmable RNA switches," Nature Communications, Nature, vol. 11(1), pages 1-12, December.
    2. Matthew J. Tarnowski & Thomas E. Gorochowski, 2022. "Massively parallel characterization of engineered transcript isoforms using direct RNA sequencing," Nature Communications, Nature, vol. 13(1), pages 1-14, December.
    3. Eva Yus & Jae-Seong Yang & Adrià Sogues & Luis Serrano, 2017. "A reporter system coupled with high-throughput sequencing unveils key bacterial transcription and translation determinants," Nature Communications, Nature, vol. 8(1), pages 1-12, December.
    4. Jacqueline A. Valeri & Katherine M. Collins & Pradeep Ramesh & Miguel A. Alcantar & Bianca A. Lepe & Timothy K. Lu & Diogo M. Camacho, 2020. "Sequence-to-function deep learning frameworks for engineered riboregulators," Nature Communications, Nature, vol. 11(1), pages 1-14, December.
    5. Eeshit Dhaval Vaishnav & Carl G. Boer & Jennifer Molinet & Moran Yassour & Lin Fan & Xian Adiconis & Dawn A. Thompson & Joshua Z. Levin & Francisco A. Cubillos & Aviv Regev, 2022. "The evolution, evolvability and engineering of gene regulatory DNA," Nature, Nature, vol. 603(7901), pages 455-463, March.
    6. Jan Zrimec & Xiaozhi Fu & Azam Sheikh Muhammad & Christos Skrekas & Vykintas Jauniskis & Nora K. Speicher & Christoph S. Börlin & Vilhelm Verendel & Morteza Haghir Chehreghani & Devdatt Dubhashi & Ver, 2022. "Controlling gene expression with deep generative design of regulatory DNA," Nature Communications, Nature, vol. 13(1), pages 1-17, December.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Raphaël V. Gayet & Katherine Ilia & Shiva Razavi & Nathaniel D. Tippens & Makoto A. Lalwani & Kehan Zhang & Jack X. Chen & Jonathan C. Chen & Jose Vargas-Asencio & James J. Collins, 2023. "Autocatalytic base editing for RNA-responsive translational control," Nature Communications, Nature, vol. 14(1), pages 1-10, December.
    2. Andreas Walbrun & Tianhe Wang & Michael Matthies & Petr Šulc & Friedrich C. Simmel & Matthias Rief, 2024. "Single-molecule force spectroscopy of toehold-mediated strand displacement," Nature Communications, Nature, vol. 15(1), pages 1-15, December.
    3. SJ, Balaji & Babu, Suresh Chandra & Pal, Suresh, 2021. "Understanding Science and Policy Making in Agriculture: A Machine Learning Application for India," 2021 Conference, August 17-31, 2021, Virtual 315227, International Association of Agricultural Economists.
    4. Simeon D. Castle & Michiel Stock & Thomas E. Gorochowski, 2024. "Engineering is evolution: a perspective on design processes to engineer biology," Nature Communications, Nature, vol. 15(1), pages 1-10, December.
    5. Lu Wu & Xu-Wen Wang & Zining Tao & Tong Wang & Wenlong Zuo & Yu Zeng & Yang-Yu Liu & Lei Dai, 2024. "Data-driven prediction of colonization outcomes for complex microbial communities," Nature Communications, Nature, vol. 15(1), pages 1-15, December.
    6. Noor Radde & Genevieve A. Mortensen & Diya Bhat & Shireen Shah & Joseph J. Clements & Sean P. Leonard & Matthew J. McGuffie & Dennis M. Mishler & Jeffrey E. Barrick, 2024. "Measuring the burden of hundreds of BioBricks defines an evolutionary limit on constructability in synthetic biology," Nature Communications, Nature, vol. 15(1), pages 1-17, December.
    7. Simon Höllerer & Laetitia Papaxanthos & Anja Cathrin Gumpinger & Katrin Fischer & Christian Beisel & Karsten Borgwardt & Yaakov Benenson & Markus Jeschek, 2020. "Large-scale DNA-based phenotypic recording and deep learning enable highly accurate sequence-function mapping," Nature Communications, Nature, vol. 11(1), pages 1-15, December.
    8. Naoki Hayashi & Yong Lai & Jay Fuerte-Stone & Mark Mimee & Timothy K. Lu, 2024. "Cas9-assisted biological containment of a genetically engineered human commensal bacterium and genetic elements," Nature Communications, Nature, vol. 15(1), pages 1-17, December.
    9. Alicia Broto & Erika Gaspari & Samuel Miravet-Verde & Vitor A. P. Martins Santos & Mark Isalan, 2022. "A genetic toolkit and gene switches to limit Mycoplasma growth for biosafety applications," Nature Communications, Nature, vol. 13(1), pages 1-13, December.
    10. Samuel Miravet-Verde & Rocco Mazzolini & Carolina Segura-Morales & Alicia Broto & Maria Lluch-Senar & Luis Serrano, 2024. "ProTInSeq: transposon insertion tracking by ultra-deep DNA sequencing to identify translated large and small ORFs," Nature Communications, Nature, vol. 15(1), pages 1-17, December.
    11. Gi Bae Kim & Ji Yeon Kim & Jong An Lee & Charles J. Norsigian & Bernhard O. Palsson & Sang Yup Lee, 2023. "Functional annotation of enzyme-encoding genes using deep learning with transformer layers," Nature Communications, Nature, vol. 14(1), pages 1-11, December.
    12. Pengcheng Zhang & Haochen Wang & Hanwen Xu & Lei Wei & Liyang Liu & Zhirui Hu & Xiaowo Wang, 2023. "Deep flanking sequence engineering for efficient promoter design using DeepSEED," Nature Communications, Nature, vol. 14(1), pages 1-14, December.
    13. Jan Zrimec & Xiaozhi Fu & Azam Sheikh Muhammad & Christos Skrekas & Vykintas Jauniskis & Nora K. Speicher & Christoph S. Börlin & Vilhelm Verendel & Morteza Haghir Chehreghani & Devdatt Dubhashi & Ver, 2022. "Controlling gene expression with deep generative design of regulatory DNA," Nature Communications, Nature, vol. 13(1), pages 1-17, December.
    14. Shumin Wang & Xin Jiang & Muhammad Bilawal Khaskheli, 2024. "The Role of Technology in the Digital Economy’s Sustainable Development of Hainan Free Trade Port and Genetic Testing: Cloud Computing and Digital Law," Sustainability, MDPI, vol. 16(14), pages 1-20, July.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:nat:natcom:v:13:y:2022:i:1:d:10.1038_s41467-022-34902-5. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.nature.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.