IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/1008724.html
   My bibliography  Save this article

Spec2Vec: Improved mass spectral similarity scoring through learning of structural relationships

Author

Listed:
  • Florian Huber
  • Lars Ridder
  • Stefan Verhoeven
  • Jurriaan H Spaaks
  • Faruk Diblen
  • Simon Rogers
  • Justin J J van der Hooft

Abstract

Spectral similarity is used as a proxy for structural similarity in many tandem mass spectrometry (MS/MS) based metabolomics analyses such as library matching and molecular networking. Although weaknesses in the relationship between spectral similarity scores and the true structural similarities have been described, little development of alternative scores has been undertaken. Here, we introduce Spec2Vec, a novel spectral similarity score inspired by a natural language processing algorithm—Word2Vec. Spec2Vec learns fragmental relationships within a large set of spectral data to derive abstract spectral embeddings that can be used to assess spectral similarities. Using data derived from GNPS MS/MS libraries including spectra for nearly 13,000 unique molecules, we show how Spec2Vec scores correlate better with structural similarity than cosine-based scores. We demonstrate the advantages of Spec2Vec in library matching and molecular networking. Spec2Vec is computationally more scalable allowing structural analogue searches in large databases within seconds.Author summary: Most metabolomics analyses rely upon matching observed fragmentation mass spectra to library spectra for structural annotation or compare spectra with each other through network analysis. As a key part of such processes, scoring functions are used to assess the similarity between pairs of fragment spectra. No studies have so far proposed scores fundamentally different to the popular cosine-based similarity score, despite the fact that its limitations are well understood. We propose a novel spectral similarity score known as Spec2Vec which adapts algorithms from natural language processing to learn relationships between peaks from co-occurrences across large spectra datasets. We find that similarities computed with Spec2Vec i) correlate better to structural similarity than cosine-based scores, ii) subsequently gives better performance in library matching tasks, and iii) is computationally more scalable than cosine-based scores. Given the central place of similarity scoring in key metabolomics analysis tasks such as library matching and spectral networking, we expect Spec2Vec to make a broad impact in all fields that rely upon untargeted metabolomics.

Suggested Citation

  • Florian Huber & Lars Ridder & Stefan Verhoeven & Jurriaan H Spaaks & Faruk Diblen & Simon Rogers & Justin J J van der Hooft, 2021. "Spec2Vec: Improved mass spectral similarity scoring through learning of structural relationships," PLOS Computational Biology, Public Library of Science, vol. 17(2), pages 1-18, February.
  • Handle: RePEc:plo:pcbi00:1008724
    DOI: 10.1371/journal.pcbi.1008724
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008724
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1008724&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.1008724?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Niek F. de Jonge & Joris J. R. Louwen & Elena Chekmeneva & Stephane Camuzeaux & Femke J. Vermeir & Robert S. Jansen & Florian Huber & Justin J. J. van der Hooft, 2023. "MS2Query: reliable and scalable MS2 mass spectra-based analogue search," Nature Communications, Nature, vol. 14(1), pages 1-12, December.
    2. Nicholas J. Morehouse & Trevor N. Clark & Emily J. McMann & Jeffrey A. Santen & F. P. Jake Haeckl & Christopher A. Gray & Roger G. Linington, 2023. "Annotation of natural product compound families using molecular networking topology and structural similarity fingerprinting," Nature Communications, Nature, vol. 14(1), pages 1-10, December.
    3. Daniel G. C. Treen & Mingxun Wang & Shipei Xing & Katherine B. Louie & Tao Huan & Pieter C. Dorrestein & Trent R. Northen & Benjamin P. Bowen, 2022. "SIMILE enables alignment of tandem mass spectra with statistical significance," Nature Communications, Nature, vol. 13(1), pages 1-10, December.
    4. Zhiwei Zhou & Mingdu Luo & Haosong Zhang & Yandong Yin & Yuping Cai & Zheng-Jiang Zhu, 2022. "Metabolite annotation from knowns to unknowns through knowledge-guided multi-layer metabolic networking," Nature Communications, Nature, vol. 13(1), pages 1-15, December.
    5. Qiong Yang & Hongchao Ji & Zhenbo Xu & Yiming Li & Pingshan Wang & Jinyu Sun & Xiaqiong Fan & Hailiang Zhang & Hongmei Lu & Zhimin Zhang, 2023. "Ultra-fast and accurate electron ionization mass spectrum matching for compound identification with million-scale in-silico library," Nature Communications, Nature, vol. 14(1), pages 1-11, December.
    6. Wout Bittremieux & Nicole E. Avalon & Sydney P. Thomas & Sarvar A. Kakhkhorov & Alexander A. Aksenov & Paulo Wender P. Gomes & Christine M. Aceves & Andrés Mauricio Caraballo-Rodríguez & Julia M. Gaug, 2023. "Open access repository-scale propagated nearest neighbor suspect spectral library for untargeted metabolomics," Nature Communications, Nature, vol. 14(1), pages 1-15, December.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1008724. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.