Author
Listed:
- Alex Hawkins-Hooker
- Florence Depardieu
- Sebastien Baur
- Guillaume Couairon
- Arthur Chen
- David Bikard
Abstract
The vast expansion of protein sequence databases provides an opportunity for new protein design approaches which seek to learn the sequence-function relationship directly from natural sequence variation. Deep generative models trained on protein sequence data have been shown to learn biologically meaningful representations helpful for a variety of downstream tasks, but their potential for direct use in the design of novel proteins remains largely unexplored. Here we show that variational autoencoders trained on a dataset of almost 70000 luciferase-like oxidoreductases can be used to generate novel, functional variants of the luxA bacterial luciferase. We propose separate VAE models to work with aligned sequence input (MSA VAE) and raw sequence input (AR-VAE), and offer evidence that while both are able to reproduce patterns of amino acid usage characteristic of the family, the MSA VAE is better able to capture long-distance dependencies reflecting the influence of 3D structure. To confirm the practical utility of the models, we used them to generate variants of luxA whose luminescence activity was validated experimentally. We further showed that conditional variants of both models could be used to increase the solubility of luxA without disrupting function. Altogether 6/12 of the variants generated using the unconditional AR-VAE and 9/11 generated using the unconditional MSA VAE retained measurable luminescence, together with all 23 of the less distant variants generated by conditional versions of the models; the most distant functional variant contained 35 differences relative to the nearest training set sequence. These results demonstrate the feasibility of using deep generative models to explore the space of possible protein sequences and generate useful variants, providing a method complementary to rational design and directed evolution approaches.Author summary: The design of novel proteins with specified function and biochemical properties is a longstanding goal in bio-engineering with applications across medicine and nanotechnology. Despite the impressive achievements of traditional approaches, a great deal of scope remains for the development of data-driven methods capable of exploiting the record of natural sequence variation available in protein databases. Deep generative models such as variational autoencoders (VAEs) have shown remarkable success in synthesising realistic data samples across a range of modalities, driving recent interest in developing such models for proteins. However, experimental evidence for the viability of such techniques in practical protein design settings remains scarce. Here we show that VAEs trained on the family of luciferase-like oxidoreductases can be used to generate functional variants of the luxA bacterial luciferase. We compare the use of raw and aligned sequences as input to the model, providing evidence that models trained on aligned data are better able to learn functional constraints. Finally, we demonstrate the possibility of controlling desired properties of the designed sequences, by using conditional versions of the VAE models to increase the solubility of the wild-type luxA sequence from P. luminescens.
Suggested Citation
Alex Hawkins-Hooker & Florence Depardieu & Sebastien Baur & Guillaume Couairon & Arthur Chen & David Bikard, 2021.
"Generating functional protein variants with variational autoencoders,"
PLOS Computational Biology, Public Library of Science, vol. 17(2), pages 1-23, February.
Handle:
RePEc:plo:pcbi00:1008736
DOI: 10.1371/journal.pcbi.1008736
Download full text from publisher
Citations
Citations are extracted by the
CitEc Project, subscribe to its
RSS feed for this item.
Cited by:
- Shunshi Kohyama & Béla P. Frohn & Leon Babl & Petra Schwille, 2024.
"Machine learning-aided design and screening of an emergent protein function in synthetic cells,"
Nature Communications, Nature, vol. 15(1), pages 1-14, December.
- Francisco McGee & Sandro Hauri & Quentin Novinger & Slobodan Vucetic & Ronald M. Levy & Vincenzo Carnevale & Allan Haldane, 2021.
"The generative capacity of probabilistic protein sequence models,"
Nature Communications, Nature, vol. 12(1), pages 1-14, December.
- Amir Pandi & David Adam & Amir Zare & Van Tuan Trinh & Stefan L. Schaefer & Marie Burt & Björn Klabunde & Elizaveta Bobkova & Manish Kushwaha & Yeganeh Foroughijabbari & Peter Braun & Christoph Spahn , 2023.
"Cell-free biosynthesis combined with deep learning accelerates de novo-development of antimicrobial peptides,"
Nature Communications, Nature, vol. 14(1), pages 1-14, December.
- Cheyenne Ziegler & Jonathan Martin & Claude Sinner & Faruck Morcos, 2023.
"Latent generative landscapes as maps of functional diversity in protein sequence space,"
Nature Communications, Nature, vol. 14(1), pages 1-15, December.
- Erika Erickson & Japheth E. Gado & Luisana Avilán & Felicia Bratti & Richard K. Brizendine & Paul A. Cox & Raj Gill & Rosie Graham & Dong-Jin Kim & Gerhard König & William E. Michener & Saroj Poudel &, 2022.
"Sourcing thermotolerant poly(ethylene terephthalate) hydrolase scaffolds from natural diversity,"
Nature Communications, Nature, vol. 13(1), pages 1-15, December.
- Chase R. Freschlin & Sarah A. Fahlberg & Pete Heinzelman & Philip A. Romero, 2024.
"Neural network extrapolation to distant regions of the protein fitness landscape,"
Nature Communications, Nature, vol. 15(1), pages 1-13, December.
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1008736. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.