IDEAS home Printed from https://ideas.repec.org/a/nat/natcom/v13y2022i1d10.1038_s41467-022-33397-4.html
   My bibliography  Save this article

Deciphering microbial gene function using natural language processing

Author

Listed:
  • Danielle Miller

    (Tel-Aviv University)

  • Adi Stern

    (Tel-Aviv University)

  • David Burstein

    (Tel-Aviv University)

Abstract

Revealing the function of uncharacterized genes is a fundamental challenge in an era of ever-increasing volumes of sequencing data. Here, we present a concept for tackling this challenge using deep learning methodologies adopted from natural language processing (NLP). We repurpose NLP algorithms to model “gene semantics” based on a biological corpus of more than 360 million microbial genes within their genomic context. We use the language models to predict functional categories for 56,617 genes and find that out of 1369 genes associated with recently discovered defense systems, 98% are inferred correctly. We then systematically evaluate the “discovery potential” of different functional categories, pinpointing those with the most genes yet to be characterized. Finally, we demonstrate our method’s ability to discover systems associated with microbial interaction and defense. Our results highlight that combining microbial genomics and language models is a promising avenue for revealing gene functions in microbes.

Suggested Citation

  • Danielle Miller & Adi Stern & David Burstein, 2022. "Deciphering microbial gene function using natural language processing," Nature Communications, Nature, vol. 13(1), pages 1-11, December.
  • Handle: RePEc:nat:natcom:v:13:y:2022:i:1:d:10.1038_s41467-022-33397-4
    DOI: 10.1038/s41467-022-33397-4
    as

    Download full text from publisher

    File URL: https://www.nature.com/articles/s41467-022-33397-4
    File Function: Abstract
    Download Restriction: no

    File URL: https://libkey.io/10.1038/s41467-022-33397-4?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. David Burstein & Lucas B. Harrington & Steven C. Strutt & Alexander J. Probst & Karthik Anantharaman & Brian C. Thomas & Jennifer A. Doudna & Jillian F. Banfield, 2017. "New CRISPR–Cas systems from uncultivated microbes," Nature, Nature, vol. 542(7640), pages 237-241, February.
    2. Florian Tesson & Alexandre Hervé & Ernest Mordret & Marie Touchon & Camille d’Humières & Jean Cury & Aude Bernheim, 2022. "Systematic and quantitative view of the antiviral arsenal of prokaryotes," Nature Communications, Nature, vol. 13(1), pages 1-10, December.
    3. Andrew C. Pawlowski & Wenliang Wang & Kalinka Koteva & Hazel A. Barton & Andrew G. McArthur & Gerard D. Wright, 2016. "A diverse intrinsic antibiotic resistome from a cave bacterium," Nature Communications, Nature, vol. 7(1), pages 1-10, December.
    4. Chaya M. Fridman & Kinga Keppel & Motti Gerlic & Eran Bosis & Dor Salomon, 2020. "A comparative genomics methodology reveals a widespread family of membrane-disrupting T6SS effectors," Nature Communications, Nature, vol. 11(1), pages 1-14, December.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Sheri Harari & Danielle Miller & Shay Fleishon & David Burstein & Adi Stern, 2024. "Using big sequencing data to identify chronic SARS-Coronavirus-2 infections," Nature Communications, Nature, vol. 15(1), pages 1-12, December.
    2. Yunha Hwang & Andre L. Cornman & Elizabeth H. Kellogg & Sergey Ovchinnikov & Peter R. Girguis, 2024. "Genomic language model predicts protein co-regulation and function," Nature Communications, Nature, vol. 15(1), pages 1-13, December.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Camila G. C. Lemes & Isabella F. Cordeiro & Camila H. de Paula & Ana K. Silva & Flávio F. do Carmo & Luciana H. Y. Kamino & Flávia M. S. Carvalho & Juan C. Caicedo & Jesus A. Ferro & Leandro M. Moreir, 2021. "Potential Bioinoculants for Sustainable Agriculture Prospected from Ferruginous Caves of the Iron Quadrangle/Brazil," Sustainability, MDPI, vol. 13(16), pages 1-23, August.
    2. Angelina Beavogui & Auriane Lacroix & Nicolas Wiart & Julie Poulain & Tom O. Delmont & Lucas Paoli & Patrick Wincker & Pedro H. Oliveira, 2024. "The defensome of complex bacterial communities," Nature Communications, Nature, vol. 15(1), pages 1-15, December.
    3. Jan D. Brüwer & Chandni Sidhu & Yanlin Zhao & Andreas Eich & Leonard Rößler & Luis H. Orellana & Bernhard M. Fuchs, 2024. "Globally occurring pelagiphage infections create ribosome-deprived cells," Nature Communications, Nature, vol. 15(1), pages 1-9, December.
    4. Bogna J. Smug & Krzysztof Szczepaniak & Eduardo P. C. Rocha & Stanislaw Dunin-Horkawicz & Rafał J. Mostowy, 2023. "Ongoing shuffling of protein fragments diversifies core viral functions linked to interactions with bacterial hosts," Nature Communications, Nature, vol. 14(1), pages 1-16, December.
    5. Natalia Quinones-Olvera & Siân V. Owen & Lucy M. McCully & Maximillian G. Marin & Eleanor A. Rand & Alice C. Fan & Oluremi J. Martins Dosumu & Kay Paul & Cleotilde E. Sanchez Castaño & Rachel Petherbr, 2024. "Diverse and abundant phages exploit conjugative plasmids," Nature Communications, Nature, vol. 15(1), pages 1-16, December.
    6. Feiyu Zhao & Tao Zhang & Xiaodi Sun & Xiyun Zhang & Letong Chen & Hejun Wang & Jinze Li & Peng Fan & Liangxue Lai & Tingting Sui & Zhanjun Li, 2023. "A strategy for Cas13 miniaturization based on the structure and AlphaFold," Nature Communications, Nature, vol. 14(1), pages 1-13, December.
    7. Rubén Barcia-Cruz & David Goudenège & Jorge A. Moura de Sousa & Damien Piel & Martial Marbouty & Eduardo P. C. Rocha & Frédérique Roux, 2024. "Phage-inducible chromosomal minimalist islands (PICMIs), a novel family of small marine satellites of virulent phages," Nature Communications, Nature, vol. 15(1), pages 1-13, December.
    8. Daniela S. Aliaga Goltsman & Lisa M. Alexander & Jyun-Liang Lin & Rodrigo Fregoso Ocampo & Benjamin Freeman & Rebecca C. Lamothe & Andres Perez Rivas & Morayma M. Temoche-Diaz & Shailaja Chadha & Nata, 2022. "Compact Cas9d and HEARO enzymes for genome editing discovered from uncultivated microbes," Nature Communications, Nature, vol. 13(1), pages 1-11, December.
    9. Shao-Ming Gao & Han-Lan Fei & Qi Li & Li-Ying Lan & Li-Nan Huang & Peng-Fei Fan, 2024. "Eco-evolutionary dynamics of gut phageome in wild gibbons (Hoolock tianxing) with seasonal diet variations," Nature Communications, Nature, vol. 15(1), pages 1-13, December.
    10. Natasha K. Dudek & Jesus G. Galaz-Montoya & Handuo Shi & Megan Mayer & Cristina Danita & Arianna I. Celis & Tobias Viehboeck & Gong-Her Wu & Barry Behr & Silvia Bulgheresi & Kerwyn Casey Huang & Wah C, 2023. "Previously uncharacterized rectangular bacterial structures in the dolphin mouth," Nature Communications, Nature, vol. 14(1), pages 1-15, December.
    11. Changchang Xin & Jianhang Yin & Shaopeng Yuan & Liqiong Ou & Mengzhu Liu & Weiwei Zhang & Jiazhi Hu, 2022. "Comprehensive assessment of miniature CRISPR-Cas12f nucleases for gene disruption," Nature Communications, Nature, vol. 13(1), pages 1-10, December.
    12. Xiaoguang Pan & Kunli Qu & Hao Yuan & Xi Xiang & Christian Anthon & Liubov Pashkova & Xue Liang & Peng Han & Giulia I. Corsi & Fengping Xu & Ping Liu & Jiayan Zhong & Yan Zhou & Tao Ma & Hui Jiang & J, 2022. "Massively targeted evaluation of therapeutic CRISPR off-targets in cells," Nature Communications, Nature, vol. 13(1), pages 1-14, December.
    13. Matthieu Haudiquet & Julie Bris & Amandine Nucci & Rémy A. Bonnin & Pilar Domingo-Calap & Eduardo P. C. Rocha & Olaya Rendueles, 2024. "Capsules and their traits shape phage susceptibility and plasmid conjugation efficiency," Nature Communications, Nature, vol. 15(1), pages 1-16, December.
    14. Yong Sheng & Hengyu Wang & Yixin Ou & Yingying Wu & Wei Ding & Meifeng Tao & Shuangjun Lin & Zixin Deng & Linquan Bai & Qianjin Kang, 2023. "Insertion sequence transposition inactivates CRISPR-Cas immunity," Nature Communications, Nature, vol. 14(1), pages 1-19, December.
    15. Eugen Pfeifer & Eduardo P. C. Rocha, 2024. "Phage-plasmids promote recombination and emergence of phages and plasmids," Nature Communications, Nature, vol. 15(1), pages 1-13, December.
    16. Katarzyna Kanarek & Chaya Mushka Fridman & Eran Bosis & Dor Salomon, 2023. "The RIX domain defines a class of polymorphic T6SS effectors and secreted adaptors," Nature Communications, Nature, vol. 14(1), pages 1-13, December.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:nat:natcom:v:13:y:2022:i:1:d:10.1038_s41467-022-33397-4. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.nature.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.