IDEAS home Printed from https://ideas.repec.org/a/plo/pgen00/1007333.html
   My bibliography  Save this article

Machine learning identifies signatures of host adaptation in the bacterial pathogen Salmonella enterica

Author

Listed:
  • Nicole E Wheeler
  • Paul P Gardner
  • Lars Barquist

Abstract

Emerging pathogens are a major threat to public health, however understanding how pathogens adapt to new niches remains a challenge. New methods are urgently required to provide functional insights into pathogens from the massive genomic data sets now being generated from routine pathogen surveillance for epidemiological purposes. Here, we measure the burden of atypical mutations in protein coding genes across independently evolved Salmonella enterica lineages, and use these as input to train a random forest classifier to identify strains associated with extraintestinal disease. Members of the species fall along a continuum, from pathovars which cause gastrointestinal infection and low mortality, associated with a broad host-range, to those that cause invasive infection and high mortality, associated with a narrowed host range. Our random forest classifier learned to perfectly discriminate long-established gastrointestinal and invasive serovars of Salmonella. Additionally, it was able to discriminate recently emerged Salmonella Enteritidis and Typhimurium lineages associated with invasive disease in immunocompromised populations in sub-Saharan Africa, and within-host adaptation to invasive infection. We dissect the architecture of the model to identify the genes that were most informative of phenotype, revealing a common theme of degradation of metabolic pathways in extraintestinal lineages. This approach accurately identifies patterns of gene degradation and diversifying selection specific to invasive serovars that have been captured by more labour-intensive investigations, but can be readily scaled to larger analyses.Author summary: Researchers are now collecting a wealth of genomic data from bacterial pathogens, and this will continue to grow with the introduction of routine sequencing for disease surveillance. However, our ability to use this data to predict how changes in genome sequence lead to differences in disease is limited. Here, we have used machine learning to detect an enrichment in functionally significant mutations in genes associated with a shift in pathogenic niche. This approach captures convergence in functional outcomes that does not necessarily result in a convergence in sequence, facilitating the inclusion of rare variants of large effect in an analysis, and allowing for complex interactions between genes. We apply this approach to Salmonella, showing that we can detect changes associated with disease phenotype in emerging lineages associated with the HIV epidemic. This approach should be applicable to other bacterial species with lineages independently adapting to similar niches. We provide open-source implementations of both the predictive model, and the workflow used to build it.

Suggested Citation

  • Nicole E Wheeler & Paul P Gardner & Lars Barquist, 2018. "Machine learning identifies signatures of host adaptation in the bacterial pathogen Salmonella enterica," PLOS Genetics, Public Library of Science, vol. 14(5), pages 1-20, May.
  • Handle: RePEc:plo:pgen00:1007333
    DOI: 10.1371/journal.pgen.1007333
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1007333
    Download Restriction: no

    File URL: https://journals.plos.org/plosgenetics/article/file?id=10.1371/journal.pgen.1007333&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pgen.1007333?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Erki Aun & Age Brauer & Veljo Kisand & Tanel Tenson & Maido Remm, 2018. "A k-mer-based method for the identification of phenotype-associated genomic biomarkers and predicting phenotypes of sequenced bacteria," PLOS Computational Biology, Public Library of Science, vol. 14(10), pages 1-17, October.
    2. Danesh Moradigaravand & Martin Palm & Anne Farewell & Ville Mustonen & Jonas Warringer & Leopold Parts, 2018. "Prediction of antibiotic resistance in Escherichia coli from large-scale pan-genome data," PLOS Computational Biology, Public Library of Science, vol. 14(12), pages 1-17, December.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pgen00:1007333. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosgenetics (email available below). General contact details of provider: https://journals.plos.org/plosgenetics/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.