IDEAS home Printed from https://ideas.repec.org/a/nat/natcom/v15y2024i1d10.1038_s41467-024-53759-4.html
   My bibliography  Save this article

A long-context language model for deciphering and generating bacteriophage genomes

Author

Listed:
  • Bin Shao

    (Beijing Institute of Technology
    Harvard University)

  • Jiawei Yan

    (Independent researcher)

Abstract

Inspired by the success of large language models (LLMs), we develop a long-context generative model for genomes. Our multiscale transformer model, megaDNA, is pre-trained on unannotated bacteriophage genomes with nucleotide-level tokenization. We demonstrate the foundational capabilities of our model including the prediction of essential genes, genetic variant effects, regulatory element activity and taxonomy of unannotated sequences. Furthermore, it generates de novo sequences up to 96 K base pairs, which contain potential regulatory elements and annotated proteins with phage-related functions.

Suggested Citation

  • Bin Shao & Jiawei Yan, 2024. "A long-context language model for deciphering and generating bacteriophage genomes," Nature Communications, Nature, vol. 15(1), pages 1-7, December.
  • Handle: RePEc:nat:natcom:v:15:y:2024:i:1:d:10.1038_s41467-024-53759-4
    DOI: 10.1038/s41467-024-53759-4
    as

    Download full text from publisher

    File URL: https://www.nature.com/articles/s41467-024-53759-4
    File Function: Abstract
    Download Restriction: no

    File URL: https://libkey.io/10.1038/s41467-024-53759-4?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Amin Espah Borujeni & Jing Zhang & Hamid Doosthosseini & Alec A. K. Nielsen & Christopher A. Voigt, 2020. "Genetic circuit characterization by inferring RNA polymerase movement and ribosome usage," Nature Communications, Nature, vol. 11(1), pages 1-18, December.
    2. Travis L. LaFleur & Ayaan Hossain & Howard M. Salis, 2022. "Automated model-predictive design of synthetic promoters to control transcriptional profiles in bacteria," Nature Communications, Nature, vol. 13(1), pages 1-15, December.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Javier Santos-Moreno & Eve Tasiudi & Hadiastri Kusumawardhani & Joerg Stelling & Yolanda Schaerli, 2023. "Robustness and innovation in synthetic genotype networks," Nature Communications, Nature, vol. 14(1), pages 1-17, December.
    2. Michael B. Sheets & Nathan Tague & Mary J. Dunlop, 2023. "An optogenetic toolkit for light-inducible antibiotic resistance," Nature Communications, Nature, vol. 14(1), pages 1-13, December.
    3. Brian D. Huang & Dowan Kim & Yongjoon Yu & Corey J. Wilson, 2024. "Engineering intelligent chassis cells via recombinase-based MEMORY circuits," Nature Communications, Nature, vol. 15(1), pages 1-17, December.
    4. Bin Shao & Jiawei Yan & Jing Zhang & Lili Liu & Ye Chen & Allen R. Buskirk, 2024. "Riboformer: a deep learning framework for predicting context-dependent translation dynamics," Nature Communications, Nature, vol. 15(1), pages 1-10, December.
    5. Ruitu Lyu & Yun Gao & Tong Wu & Chang Ye & Pingluan Wang & Chuan He, 2024. "Quantitative analysis of cis-regulatory elements in transcription with KAS-ATAC-seq," Nature Communications, Nature, vol. 15(1), pages 1-17, December.
    6. Noor Radde & Genevieve A. Mortensen & Diya Bhat & Shireen Shah & Joseph J. Clements & Sean P. Leonard & Matthew J. McGuffie & Dennis M. Mishler & Jeffrey E. Barrick, 2024. "Measuring the burden of hundreds of BioBricks defines an evolutionary limit on constructability in synthetic biology," Nature Communications, Nature, vol. 15(1), pages 1-17, December.
    7. Charlotte Cautereels & Jolien Smets & Peter Bircham & Dries De Ruysscher & Anna Zimmermann & Peter De Rijk & Jan Steensels & Anton Gorkovskiy & Joleen Masschelein & Kevin J. Verstrepen, 2024. "Combinatorial optimization of gene expression through recombinase-mediated promoter and terminator shuffling in yeast," Nature Communications, Nature, vol. 15(1), pages 1-17, December.
    8. Travis L. LaFleur & Ayaan Hossain & Howard M. Salis, 2022. "Automated model-predictive design of synthetic promoters to control transcriptional profiles in bacteria," Nature Communications, Nature, vol. 13(1), pages 1-15, December.
    9. Daniel P. Cetnar & Ayaan Hossain & Grace E. Vezeau & Howard M. Salis, 2024. "Predicting synthetic mRNA stability using massively parallel kinetic measurements, biophysical modeling, and machine learning," Nature Communications, Nature, vol. 15(1), pages 1-11, December.
    10. Peter J. Diebold & Matthew W. Rhee & Qiaojuan Shi & Nguyen Vinh Trung & Fayaz Umrani & Sheraz Ahmed & Vandana Kulkarni & Prasad Deshpande & Mallika Alexander & Ngo Hoa & Nicholas A. Christakis & Najee, 2023. "Clinically relevant antibiotic resistance genes are linked to a limited set of taxa within gut microbiome worldwide," Nature Communications, Nature, vol. 14(1), pages 1-12, December.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:nat:natcom:v:15:y:2024:i:1:d:10.1038_s41467-024-53759-4. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.nature.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.