IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/1004920.html
   My bibliography  Save this article

Learning from Heterogeneous Data Sources: An Application in Spatial Proteomics

Author

Listed:
  • Lisa M Breckels
  • Sean B Holden
  • David Wojnar
  • Claire M Mulvey
  • Andy Christoforou
  • Arnoud Groen
  • Matthew W B Trotter
  • Oliver Kohlbacher
  • Kathryn S Lilley
  • Laurent Gatto

Abstract

Sub-cellular localisation of proteins is an essential post-translational regulatory mechanism that can be assayed using high-throughput mass spectrometry (MS). These MS-based spatial proteomics experiments enable us to pinpoint the sub-cellular distribution of thousands of proteins in a specific system under controlled conditions. Recent advances in high-throughput MS methods have yielded a plethora of experimental spatial proteomics data for the cell biology community. Yet, there are many third-party data sources, such as immunofluorescence microscopy or protein annotations and sequences, which represent a rich and vast source of complementary information. We present a unique transfer learning classification framework that utilises a nearest-neighbour or support vector machine system, to integrate heterogeneous data sources to considerably improve on the quantity and quality of sub-cellular protein assignment. We demonstrate the utility of our algorithms through evaluation of five experimental datasets, from four different species in conjunction with four different auxiliary data sources to classify proteins to tens of sub-cellular compartments with high generalisation accuracy. We further apply the method to an experiment on pluripotent mouse embryonic stem cells to classify a set of previously unknown proteins, and validate our findings against a recent high resolution map of the mouse stem cell proteome. The methodology is distributed as part of the open-source Bioconductor pRoloc suite for spatial proteomics data analysis.Author Summary: Sub-cellular localisation of proteins is critical to their function in all cellular processes; proteins localising to their intended micro-environment, e.g organelles, vesicles or macro-molecular complexes, will meet the interaction partners and biochemical conditions suitable to pursue their molecular function. Therefore, sound data and methods to reliably and systematically study protein localisation, and hence their mis-localisation and the disruption of protein trafficking, that are relied upon by the cell biology community, are essential. Here we present a method to infer protein localisation relying on the optimal integration of experimental mass spectrometry-based data and auxiliary sources, such as GO annotation, outputs from third-party software, protein-protein interactions or immunocytochemistry data. We found that the application of transfer learning algorithms across these diverse data sources considerably improves on the quantity and reliability of sub-cellular protein assignment, compared to single data classifiers previously applied to infer sub-cellular localisation using experimental data only. We show how our method does not compromise biologically relevant experimental-specific signal after integration with heterogeneous freely available third-party resources. The integration of different data sources is an important challenge in the data intensive world of biology and we anticipate the transfer learning methods presented here will prove useful to many areas of biology, to unify data obtained from different but complimentary sources.

Suggested Citation

  • Lisa M Breckels & Sean B Holden & David Wojnar & Claire M Mulvey & Andy Christoforou & Arnoud Groen & Matthew W B Trotter & Oliver Kohlbacher & Kathryn S Lilley & Laurent Gatto, 2016. "Learning from Heterogeneous Data Sources: An Application in Spatial Proteomics," PLOS Computational Biology, Public Library of Science, vol. 12(5), pages 1-26, May.
  • Handle: RePEc:plo:pcbi00:1004920
    DOI: 10.1371/journal.pcbi.1004920
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004920
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1004920&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.1004920?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Jens S. Andersen & Christopher J. Wilkinson & Thibault Mayor & Peter Mortensen & Erich A. Nigg & Matthias Mann, 2003. "Proteomic characterization of the human centrosome by protein correlation profiling," Nature, Nature, vol. 426(6966), pages 570-574, December.
    2. Morik, Katharina & Brockhausen, Peter & Joachims, Thorsten, 1999. "Combining statistical learning with a knowledge-based approach: A case study in intensive care monitoring," Technical Reports 1999,24, Technische Universität Dortmund, Sonderforschungsbereich 475: Komplexitätsreduktion in multivariaten Datenstrukturen.
    3. Andy Christoforou & Claire M. Mulvey & Lisa M. Breckels & Aikaterini Geladaki & Tracey Hurrell & Penelope C. Hayward & Thomas Naake & Laurent Gatto & Rosa Viner & Alfonso Martinez Arias & Kathryn S. L, 2016. "A draft map of the mouse pluripotent stem cell spatial proteome," Nature Communications, Nature, vol. 7(1), pages 1-12, April.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Oliver M Crook & Aikaterini Geladaki & Daniel J H Nightingale & Owen L Vennard & Kathryn S Lilley & Laurent Gatto & Paul D W Kirk, 2020. "A semi-supervised Bayesian approach for simultaneous protein sub-cellular localisation assignment and novelty detection," PLOS Computational Biology, Public Library of Science, vol. 16(11), pages 1-21, November.
    2. Oliver M. Crook & Colin T. R. Davies & Lisa M. Breckels & Josie A. Christopher & Laurent Gatto & Paul D. W. Kirk & Kathryn S. Lilley, 2022. "Inferring differential subcellular localisation in comparative spatial proteomics using BANDLE," Nature Communications, Nature, vol. 13(1), pages 1-21, December.
    3. Octavio R. Salazar & Ke Chen & Vanessa J. Melino & Muppala P. Reddy & Eva Hřibová & Jana Čížková & Denisa Beránková & Juan Pablo Arciniegas Vega & Lina María Cáceres Leal & Manuel Aranda & Lukasz Jare, 2024. "SOS1 tonoplast neo-localization and the RGG protein SALTY are important in the extreme salinity tolerance of Salicornia bigelovii," Nature Communications, Nature, vol. 15(1), pages 1-21, December.
    4. Oliver M Crook & Claire M Mulvey & Paul D W Kirk & Kathryn S Lilley & Laurent Gatto, 2018. "A Bayesian mixture modelling approach for spatial proteomics," PLOS Computational Biology, Public Library of Science, vol. 14(11), pages 1-29, November.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Oliver M Crook & Aikaterini Geladaki & Daniel J H Nightingale & Owen L Vennard & Kathryn S Lilley & Laurent Gatto & Paul D W Kirk, 2020. "A semi-supervised Bayesian approach for simultaneous protein sub-cellular localisation assignment and novelty detection," PLOS Computational Biology, Public Library of Science, vol. 16(11), pages 1-21, November.
    2. Oliver M Crook & Claire M Mulvey & Paul D W Kirk & Kathryn S Lilley & Laurent Gatto, 2018. "A Bayesian mixture modelling approach for spatial proteomics," PLOS Computational Biology, Public Library of Science, vol. 14(11), pages 1-29, November.
    3. Aurélien Naldi & Romain M Larive & Urszula Czerwinska & Serge Urbach & Philippe Montcourrier & Christian Roy & Jérôme Solassol & Gilles Freiss & Peter J Coopman & Ovidiu Radulescu, 2017. "Reconstruction and signal propagation analysis of the Syk signaling network in breast cancer cells," PLOS Computational Biology, Public Library of Science, vol. 13(3), pages 1-27, March.
    4. Ying Zhu & Kerem Can Akkaya & Julia Ruta & Nanako Yokoyama & Cong Wang & Max Ruwolt & Diogo Borges Lima & Martin Lehmann & Fan Liu, 2024. "Cross-link assisted spatial proteomics to map sub-organelle proteomes and membrane protein topologies," Nature Communications, Nature, vol. 15(1), pages 1-18, December.
    5. Ana Martinez-Val & Dorte B. Bekker-Jensen & Sophia Steigerwald & Claire Koenig & Ole Østergaard & Adi Mehta & Trung Tran & Krzysztof Sikorski & Estefanía Torres-Vega & Ewa Kwasniewicz & Sólveig Hlín B, 2021. "Spatial-proteomics reveals phospho-signaling dynamics at subcellular resolution," Nature Communications, Nature, vol. 12(1), pages 1-17, December.
    6. Hans J C T Wessels & Rutger O Vogel & Robert N Lightowlers & Johannes N Spelbrink & Richard J Rodenburg & Lambert P van den Heuvel & Alain J van Gool & Jolein Gloerich & Jan A M Smeitink & Leo G Nijtm, 2013. "Analysis of 953 Human Proteins from a Mitochondrial HEK293 Fraction by Complexome Profiling," PLOS ONE, Public Library of Science, vol. 8(7), pages 1-14, July.
    7. Bo Huang & Chenglin Xie & Richard Tay & Bo Wu, 2009. "Land-Use-Change Modeling Using Unbalanced Support-Vector Machines," Environment and Planning B, , vol. 36(3), pages 398-416, June.
    8. Oliver M. Crook & Colin T. R. Davies & Lisa M. Breckels & Josie A. Christopher & Laurent Gatto & Paul D. W. Kirk & Kathryn S. Lilley, 2022. "Inferring differential subcellular localisation in comparative spatial proteomics using BANDLE," Nature Communications, Nature, vol. 13(1), pages 1-21, December.
    9. Nicola M. Moloney & Konstantin Barylyuk & Eelco Tromer & Oliver M. Crook & Lisa M. Breckels & Kathryn S. Lilley & Ross F. Waller & Paula MacGregor, 2023. "Mapping diversity in African trypanosomes using high resolution spatial proteomics," Nature Communications, Nature, vol. 14(1), pages 1-16, December.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1004920. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.