IDEAS home Printed from https://ideas.repec.org/a/nat/natcom/v15y2024i1d10.1038_s41467-024-47899-w.html
   My bibliography  Save this article

Optimizing differential expression analysis for proteomics data via high-performing rules and ensemble inference

Author

Listed:
  • Hui Peng

    (Nanyang Technological University
    Nanyang Technological University)

  • He Wang

    (Nanyang Technological University
    Nanyang Technological University)

  • Weijia Kong

    (Nanyang Technological University
    Nanyang Technological University)

  • Jinyan Li

    (Chinese Academy of Sciences)

  • Wilson Wen Bin Goh

    (Nanyang Technological University
    Nanyang Technological University
    Nanyang Technological University
    Nanyang Technological University)

Abstract

Identification of differentially expressed proteins in a proteomics workflow typically encompasses five key steps: raw data quantification, expression matrix construction, matrix normalization, missing value imputation (MVI), and differential expression analysis. The plethora of options in each step makes it challenging to identify optimal workflows that maximize the identification of differentially expressed proteins. To identify optimal workflows and their common properties, we conduct an extensive study involving 34,576 combinatoric experiments on 24 gold standard spike-in datasets. Applying frequent pattern mining techniques to top-ranked workflows, we uncover high-performing rules that demonstrate optimality has conserved properties. Via machine learning, we confirm optimal workflows are indeed predictable, with average cross-validation F1 scores and Matthew’s correlation coefficients surpassing 0.84. We introduce an ensemble inference to integrate results from individual top-performing workflows for expanding differential proteome coverage and resolve inconsistencies. Ensemble inference provides gains in pAUC (up to 4.61%) and G-mean (up to 11.14%) and facilitates effective aggregation of information across varied quantification approaches such as topN, directLFQ, MaxLFQ intensities, and spectral counts. However, further development and evaluation are needed to establish acceptable frameworks for conducting ensemble inference on multiple proteomics workflows.

Suggested Citation

  • Hui Peng & He Wang & Weijia Kong & Jinyan Li & Wilson Wen Bin Goh, 2024. "Optimizing differential expression analysis for proteomics data via high-performing rules and ensemble inference," Nature Communications, Nature, vol. 15(1), pages 1-18, December.
  • Handle: RePEc:nat:natcom:v:15:y:2024:i:1:d:10.1038_s41467-024-47899-w
    DOI: 10.1038/s41467-024-47899-w
    as

    Download full text from publisher

    File URL: https://www.nature.com/articles/s41467-024-47899-w
    File Function: Abstract
    Download Restriction: no

    File URL: https://libkey.io/10.1038/s41467-024-47899-w?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Ferreira José A., 2007. "The Benjamini-Hochberg Method in the Case of Discrete Test Statistics," The International Journal of Biostatistics, De Gruyter, vol. 3(1), pages 1-18, July.
    2. Ronghui Lou & Ye Cao & Shanshan Li & Xiaoyu Lang & Yunxia Li & Yaoyang Zhang & Wenqing Shui, 2023. "Benchmarking commonly used software suites and analysis workflows for DIA proteomics and phosphoproteomics," Nature Communications, Nature, vol. 14(1), pages 1-17, December.
    3. Crookston, Nicholas L. & Finley, Andrew O., 2008. "yaImpute: An R Package for kNN Imputation," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 23(i10).
    4. Kevin L. Yang & Fengchao Yu & Guo Ci Teo & Kai Li & Vadim Demichev & Markus Ralser & Alexey I. Nesvizhskii, 2023. "MSBooster: improving peptide identification rates using deep learning-based features," Nature Communications, Nature, vol. 14(1), pages 1-14, December.
    5. Joseph G. Ibrahim & Ming-Hui Chen & Stuart R. Lipsitz & Amy H. Herring, 2005. "Missing-Data Methods for Generalized Linear Models: A Comparative Review," Journal of the American Statistical Association, American Statistical Association, vol. 100, pages 332-346, March.
    6. Ying Jiang & Aihua Sun & Yang Zhao & Wantao Ying & Huichuan Sun & Xinrong Yang & Baocai Xing & Wei Sun & Liangliang Ren & Bo Hu & Chaoying Li & Li Zhang & Guangrong Qin & Menghuan Zhang & Ning Chen & , 2019. "Proteomics identifies new therapeutic targets of early-stage hepatocellular carcinoma," Nature, Nature, vol. 567(7747), pages 257-261, March.
    7. Fengchao Yu & Guo Ci Teo & Andy T. Kong & Klemens Fröhlich & Ginny Xiaohe Li & Vadim Demichev & Alexey I. Nesvizhskii, 2023. "Analysis of DIA proteomics data using MSFragger-DIA and FragPipe computational platform," Nature Communications, Nature, vol. 14(1), pages 1-14, December.
    8. Tomi Suomi & Fatemeh Seyednasrollah & Maria K Jaakkola & Thomas Faux & Laura L Elo, 2017. "ROTS: An R package for reproducibility-optimized statistical testing," PLOS Computational Biology, Public Library of Science, vol. 13(5), pages 1-10, May.
    9. Mathias Kalxdorf & Torsten Müller & Oliver Stegle & Jeroen Krijgsveld, 2021. "IceR improves proteome coverage and data completeness in global and single-cell proteomics," Nature Communications, Nature, vol. 12(1), pages 1-15, December.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Humberto J. Ferreira & Brian J. Stevenson & HuiSong Pak & Fengchao Yu & Jessica Almeida Oliveira & Florian Huber & Marie Taillandier-Coindard & Justine Michaux & Emma Ricart-Altimiras & Anne I. Kraeme, 2024. "Immunopeptidomics-based identification of naturally presented non-canonical circRNA-derived peptides," Nature Communications, Nature, vol. 15(1), pages 1-18, December.
    2. Fengchao Yu & Guo Ci Teo & Andy T. Kong & Klemens Fröhlich & Ginny Xiaohe Li & Vadim Demichev & Alexey I. Nesvizhskii, 2023. "Analysis of DIA proteomics data using MSFragger-DIA and FragPipe computational platform," Nature Communications, Nature, vol. 14(1), pages 1-14, December.
    3. Xiangwei Li & Thomas Delerue & Ben Schöttker & Bernd Holleczek & Eva Grill & Annette Peters & Melanie Waldenberger & Barbara Thorand & Hermann Brenner, 2022. "Derivation and validation of an epigenetic frailty risk score in population-based cohorts of older adults," Nature Communications, Nature, vol. 13(1), pages 1-11, December.
    4. Ryo Kato & Takahiro Hoshino, 2020. "Semiparametric Bayesian multiple imputation for regression models with missing mixed continuous–discrete covariates," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 72(3), pages 803-825, June.
    5. Li Cai & Lijie Gu & Qihua Wang & Suojin Wang, 2021. "Simultaneous confidence bands for nonparametric regression with missing covariate data," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 73(6), pages 1249-1279, December.
    6. McDonough, Ian K. & Millimet, Daniel L., 2017. "Missing data, imputation, and endogeneity," Journal of Econometrics, Elsevier, vol. 199(2), pages 141-155.
    7. J. Andrew Royle, 2009. "Analysis of Capture–Recapture Models with Individual Covariates Using Data Augmentation," Biometrics, The International Biometric Society, vol. 65(1), pages 267-274, March.
    8. Jiang, Depeng & Zhao, Puying & Tang, Niansheng, 2016. "A propensity score adjustment method for regression models with nonignorable missing covariates," Computational Statistics & Data Analysis, Elsevier, vol. 94(C), pages 98-119.
    9. J. F. Lawless, 2018. "Two-phase outcome-dependent studies for failure times and testing for effects of expensive covariates," Lifetime Data Analysis: An International Journal Devoted to Statistical Methods and Applications for Time-to-Event Data, Springer, vol. 24(1), pages 28-44, January.
    10. Hui Yao & Sungduk Kim & Ming-Hui Chen & Joseph G. Ibrahim & Arvind K. Shah & Jianxin Lin, 2015. "Bayesian Inference for Multivariate Meta-Regression With a Partially Observed Within-Study Sample Covariance Matrix," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 110(510), pages 528-544, June.
    11. Yi Qian & Hui Xie, 2011. "No Customer Left Behind: A Distribution-Free Bayesian Approach to Accounting for Missing Xs in Marketing Models," Marketing Science, INFORMS, vol. 30(4), pages 717-736, July.
    12. Jiang, Wei & Josse, Julie & Lavielle, Marc, 2020. "Logistic regression with missing covariates—Parameter estimation, model selection and prediction within a joint-modeling framework," Computational Statistics & Data Analysis, Elsevier, vol. 145(C).
    13. Baojiang Chen & Xiao-Hua Zhou, 2011. "Doubly Robust Estimates for Binary Longitudinal Data Analysis with Missing Response and Missing Covariates," Biometrics, The International Biometric Society, vol. 67(3), pages 830-842, September.
    14. Zhuoer Sun & Suojin Wang, 2019. "Semiparametric estimation in regression with missing covariates using single-index models," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 71(5), pages 1201-1232, October.
    15. Yang, Ying & Kang, Jian, 2010. "Joint analysis of mixed Poisson and continuous longitudinal data with nonignorable missing values," Computational Statistics & Data Analysis, Elsevier, vol. 54(1), pages 193-207, January.
    16. Lei Wang, 2019. "Dimension reduction for kernel-assisted M-estimators with missing response at random," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 71(4), pages 889-910, August.
    17. Peisong Han, 2016. "Combining Inverse Probability Weighting and Multiple Imputation to Improve Robustness of Estimation," Scandinavian Journal of Statistics, Danish Society for Theoretical Statistics;Finnish Statistical Society;Norwegian Statistical Association;Swedish Statistical Association, vol. 43(1), pages 246-260, March.
    18. Guo, Xu & Song, Lianlian & Fang, Yun & Zhu, Lixing, 2019. "Model checking for general linear regression with nonignorable missing response," Computational Statistics & Data Analysis, Elsevier, vol. 138(C), pages 1-12.
    19. Kowarik, Alexander & Templ, Matthias, 2016. "Imputation with the R Package VIM," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 74(i07).
    20. Charlotte Adams & Wassim Gabriel & Kris Laukens & Mario Picciani & Mathias Wilhelm & Wout Bittremieux & Kurt Boonen, 2024. "Fragment ion intensity prediction improves the identification rate of non-tryptic peptides in timsTOF," Nature Communications, Nature, vol. 15(1), pages 1-11, December.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:nat:natcom:v:15:y:2024:i:1:d:10.1038_s41467-024-47899-w. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.nature.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.