IDEAS home Printed from https://ideas.repec.org/a/nat/natcom/v13y2022i1d10.1038_s41467-022-32007-7.html
   My bibliography  Save this article

ProtGPT2 is a deep unsupervised language model for protein design

Author

Listed:
  • Noelia Ferruz

    (University of Bayreuth
    University of Girona)

  • Steffen Schmidt

    (University of Bayreuth)

  • Birte Höcker

    (University of Bayreuth)

Abstract

Protein design aims to build novel proteins customized for specific purposes, thereby holding the potential to tackle many environmental and biomedical problems. Recent progress in Transformer-based architectures has enabled the implementation of language models capable of generating text with human-like capabilities. Here, motivated by this success, we describe ProtGPT2, a language model trained on the protein space that generates de novo protein sequences following the principles of natural ones. The generated proteins display natural amino acid propensities, while disorder predictions indicate that 88% of ProtGPT2-generated proteins are globular, in line with natural sequences. Sensitive sequence searches in protein databases show that ProtGPT2 sequences are distantly related to natural ones, and similarity networks further demonstrate that ProtGPT2 is sampling unexplored regions of protein space. AlphaFold prediction of ProtGPT2-sequences yields well-folded non-idealized structures with embodiments and large loops and reveals topologies not captured in current structure databases. ProtGPT2 generates sequences in a matter of seconds and is freely available.

Suggested Citation

  • Noelia Ferruz & Steffen Schmidt & Birte Höcker, 2022. "ProtGPT2 is a deep unsupervised language model for protein design," Nature Communications, Nature, vol. 13(1), pages 1-10, December.
  • Handle: RePEc:nat:natcom:v:13:y:2022:i:1:d:10.1038_s41467-022-32007-7
    DOI: 10.1038/s41467-022-32007-7
    as

    Download full text from publisher

    File URL: https://www.nature.com/articles/s41467-022-32007-7
    File Function: Abstract
    Download Restriction: no

    File URL: https://libkey.io/10.1038/s41467-022-32007-7?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Kathryn Tunyasuvunakool & Jonas Adler & Zachary Wu & Tim Green & Michal Zielinski & Augustin Žídek & Alex Bridgland & Andrew Cowie & Clemens Meyer & Agata Laydon & Sameer Velankar & Gerard J. Kleywegt, 2021. "Highly accurate protein structure prediction for the human proteome," Nature, Nature, vol. 596(7873), pages 590-596, August.
    2. Andrew W. Senior & Richard Evans & John Jumper & James Kirkpatrick & Laurent Sifre & Tim Green & Chongli Qin & Augustin Žídek & Alexander W. R. Nelson & Alex Bridgland & Hugo Penedones & Stig Petersen, 2020. "Improved protein structure prediction using potentials from deep learning," Nature, Nature, vol. 577(7792), pages 706-710, January.
    3. Chunfu Xu & Peilong Lu & Tamer M. Gamal El-Din & Xue Y. Pei & Matthew C. Johnson & Atsuko Uyeda & Matthew J. Bick & Qi Xu & Daohua Jiang & Hua Bai & Gabriella Reggiano & Yang Hsia & T J Brunette & Jia, 2020. "Computational design of transmembrane pores," Nature, Nature, vol. 585(7823), pages 129-134, September.
    4. Hongyuan Lu & Daniel J. Diaz & Natalie J. Czarnecki & Congzhi Zhu & Wantae Kim & Raghav Shroff & Daniel J. Acosta & Bradley R. Alexander & Hannah O. Cole & Yan Zhang & Nathaniel A. Lynd & Andrew D. El, 2022. "Machine learning-aided engineering of hydrolases for PET depolymerization," Nature, Nature, vol. 604(7907), pages 662-667, April.
    5. Po-Ssu Huang & Scott E. Boyken & David Baker, 2016. "The coming of age of de novo protein design," Nature, Nature, vol. 537(7620), pages 320-327, September.
    6. Namrata Anand & Raphael Eguchi & Irimpan I. Mathews & Carla P. Perez & Alexander Derry & Russ B. Altman & Po-Ssu Huang, 2022. "Protein sequence design with a learned potential," Nature Communications, Nature, vol. 13(1), pages 1-11, December.
    7. John Jumper & Richard Evans & Alexander Pritzel & Tim Green & Michael Figurnov & Olaf Ronneberger & Kathryn Tunyasuvunakool & Russ Bates & Augustin Žídek & Anna Potapenko & Alex Bridgland & Clemens Me, 2021. "Highly accurate protein structure prediction with AlphaFold," Nature, Nature, vol. 596(7873), pages 583-589, August.
    8. Marion F Sauer & Alexander M Sevy & James E Crowe Jr. & Jens Meiler, 2020. "Multi-state design of flexible proteins predicts sequences optimal for conformational change," PLOS Computational Biology, Public Library of Science, vol. 16(2), pages 1-29, February.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Palistha Shrestha & Jeevan Kandel & Hilal Tayara & Kil To Chong, 2024. "Post-translational modification prediction via prompt-based fine-tuning of a GPT-2 model," Nature Communications, Nature, vol. 15(1), pages 1-13, December.
    2. Sijie Chen & Tong Lin & Ruchira Basu & Jeremy Ritchey & Shen Wang & Yichuan Luo & Xingcan Li & Dehua Pei & Levent Burak Kara & Xiaolin Cheng, 2024. "Design of target specific peptide inhibitors using generative deep learning and molecular dynamics simulations," Nature Communications, Nature, vol. 15(1), pages 1-20, December.
    3. Amir Pandi & David Adam & Amir Zare & Van Tuan Trinh & Stefan L. Schaefer & Marie Burt & Björn Klabunde & Elizaveta Bobkova & Manish Kushwaha & Yeganeh Foroughijabbari & Peter Braun & Christoph Spahn , 2023. "Cell-free biosynthesis combined with deep learning accelerates de novo-development of antimicrobial peptides," Nature Communications, Nature, vol. 14(1), pages 1-14, December.
    4. Wenwu Zeng & Yutao Dou & Liangrui Pan & Liwen Xu & Shaoliang Peng, 2024. "Improving prediction performance of general protein language model by domain-adaptive pretraining on DNA-binding protein," Nature Communications, Nature, vol. 15(1), pages 1-18, December.
    5. Kevin E. Wu & Kevin K. Yang & Rianne Berg & Sarah Alamdari & James Y. Zou & Alex X. Lu & Ava P. Amini, 2024. "Protein structure generation via folding diffusion," Nature Communications, Nature, vol. 15(1), pages 1-12, December.
    6. David Ding & Ada Y. Shaw & Sam Sinai & Nathan Rollins & Noam Prywes & David F. Savage & Michael T. Laub & Debora S. Marks, 2024. "Protein design using structure-based residue preferences," Nature Communications, Nature, vol. 15(1), pages 1-12, December.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Agnese I. Curatolo & Ofer Kimchi & Carl P. Goodrich & Ryan K. Krueger & Michael P. Brenner, 2023. "A computational toolbox for the assembly yield of complex and heterogeneous structures," Nature Communications, Nature, vol. 14(1), pages 1-13, December.
    2. Daniel J. Diaz & Chengyue Gong & Jeffrey Ouyang-Zhang & James M. Loy & Jordan Wells & David Yang & Andrew D. Ellington & Alexandros G. Dimakis & Adam R. Klivans, 2024. "Stability Oracle: a structure-based graph-transformer framework for identifying stabilizing mutations," Nature Communications, Nature, vol. 15(1), pages 1-15, December.
    3. Zachary C. Drake & Justin T. Seffernick & Steffen Lindert, 2022. "Protein complex prediction using Rosetta, AlphaFold, and mass spectrometry covalent labeling," Nature Communications, Nature, vol. 13(1), pages 1-9, December.
    4. Nicolae Sapoval & Amirali Aghazadeh & Michael G. Nute & Dinler A. Antunes & Advait Balaji & Richard Baraniuk & C. J. Barberan & Ruth Dannenfelser & Chen Dun & Mohammadamin Edrisi & R. A. Leo Elworth &, 2022. "Current progress and open challenges for applying deep learning across the biosciences," Nature Communications, Nature, vol. 13(1), pages 1-12, December.
    5. Niklas W. A. Gebauer & Michael Gastegger & Stefaan S. P. Hessmann & Klaus-Robert Müller & Kristof T. Schütt, 2022. "Inverse design of 3d molecular structures with conditional generative neural networks," Nature Communications, Nature, vol. 13(1), pages 1-11, December.
    6. Hajkowicz, Stefan & Naughtin, Claire & Sanderson, Conrad & Schleiger, Emma & Karimi, Sarvnaz & Bratanova, Alexandra & Bednarz, Tomasz, 2022. "Artificial intelligence for science – adoption trends and future development pathways," MPRA Paper 115464, University Library of Munich, Germany.
    7. Qiufen Chen & Yuanzhao Guo & Jiuhong Jiang & Jing Qu & Li Zhang & Han Wang, 2023. "The Relative Distance Prediction of Transmembrane Protein Surface Residue Based on Improved Residual Networks," Mathematics, MDPI, vol. 11(3), pages 1-16, January.
    8. Biao Ruan & Yanan He & Yingwei Chen & Eun Jung Choi & Yihong Chen & Dana Motabar & Tsega Solomon & Richard Simmerman & Thomas Kauffman & D. Travis Gallagher & John Orban & Philip N. Bryan, 2023. "Design and characterization of a protein fold switching network," Nature Communications, Nature, vol. 14(1), pages 1-14, December.
    9. Yinglu Cui & Yanchun Chen & Jinyuan Sun & Tong Zhu & Hua Pang & Chunli Li & Wen-Chao Geng & Bian Wu, 2024. "Computational redesign of a hydrolase for nearly complete PET depolymerization at industrially relevant high-solids loading," Nature Communications, Nature, vol. 15(1), pages 1-12, December.
    10. Zhuozhi Chen & Rongdi Duan & Yunjie Xiao & Yi Wei & Hanxiao Zhang & Xinzhao Sun & Shen Wang & Yingying Cheng & Xue Wang & Shanwei Tong & Yunxiao Yao & Cheng Zhu & Haitao Yang & Yanyan Wang & Zefang Wa, 2022. "Biodegradation of highly crystallized poly(ethylene terephthalate) through cell surface codisplay of bacterial PETase and hydrophobin," Nature Communications, Nature, vol. 13(1), pages 1-17, December.
    11. Aaron Gupta & Kevin S. Kao & Rachel Yamin & Deena A. Oren & Yehuda Goldgur & Jonathan Du & Pete Lollar & Eric J. Sundberg & Jeffrey V. Ravetch, 2023. "Mechanism of glycoform specificity and in vivo protection by an anti-afucosylated IgG nanobody," Nature Communications, Nature, vol. 14(1), pages 1-11, December.
    12. Lei Wang & Jiangguo Zhang & Dali Wang & Chen Song, 2022. "Membrane contact probability: An essential and predictive character for the structural and functional studies of membrane proteins," PLOS Computational Biology, Public Library of Science, vol. 18(3), pages 1-27, March.
    13. Jong Woo Bae & Sangtae Kim & V. Narry Kim & Jong-Seo Kim, 2021. "Photoactivatable ribonucleosides mark base-specific RNA-binding sites," Nature Communications, Nature, vol. 12(1), pages 1-10, December.
    14. Erika Erickson & Japheth E. Gado & Luisana Avilán & Felicia Bratti & Richard K. Brizendine & Paul A. Cox & Raj Gill & Rosie Graham & Dong-Jin Kim & Gerhard König & William E. Michener & Saroj Poudel &, 2022. "Sourcing thermotolerant poly(ethylene terephthalate) hydrolase scaffolds from natural diversity," Nature Communications, Nature, vol. 13(1), pages 1-15, December.
    15. Zhiye Guo & Jian Liu & Jeffrey Skolnick & Jianlin Cheng, 2022. "Prediction of inter-chain distance maps of protein complexes with 2D attention-based deep neural networks," Nature Communications, Nature, vol. 13(1), pages 1-10, December.
    16. Nicolas Renaud & Cunliang Geng & Sonja Georgievska & Francesco Ambrosetti & Lars Ridder & Dario F. Marzella & Manon F. Réau & Alexandre M. J. J. Bonvin & Li C. Xue, 2021. "DeepRank: a deep learning framework for data mining 3D protein-protein interfaces," Nature Communications, Nature, vol. 12(1), pages 1-8, December.
    17. Simon d’Oelsnitz & Daniel J. Diaz & Wantae Kim & Daniel J. Acosta & Tyler L. Dangerfield & Mason W. Schechter & Matthew B. Minus & James R. Howard & Hannah Do & James M. Loy & Hal S. Alper & Y. Jessie, 2024. "Biosensor and machine learning-aided engineering of an amaryllidaceae enzyme," Nature Communications, Nature, vol. 15(1), pages 1-14, December.
    18. Shuangjia Zheng & Tao Zeng & Chengtao Li & Binghong Chen & Connor W. Coley & Yuedong Yang & Ruibo Wu, 2022. "Deep learning driven biosynthetic pathways navigation for natural products with BioNavi-NP," Nature Communications, Nature, vol. 13(1), pages 1-9, December.
    19. Ye Yuan & Lei Chen & Kexu Song & Miaomiao Cheng & Ling Fang & Lingfei Kong & Lanlan Yu & Ruonan Wang & Zhendong Fu & Minmin Sun & Qian Wang & Chengjun Cui & Haojue Wang & Jiuyang He & Xiaonan Wang & Y, 2024. "Stable peptide-assembled nanozyme mimicking dual antifungal actions," Nature Communications, Nature, vol. 15(1), pages 1-17, December.
    20. Ivica Odorčić & Mohamed Belal Hamed & Sam Lismont & Lucía Chávez-Gutiérrez & Rouslan G. Efremov, 2024. "Apo and Aβ46-bound γ-secretase structures provide insights into amyloid-β processing by the APH-1B isoform," Nature Communications, Nature, vol. 15(1), pages 1-14, December.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:nat:natcom:v:13:y:2022:i:1:d:10.1038_s41467-022-32007-7. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.nature.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.