IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/1009682.html
   My bibliography  Save this article

Development of an experiment-split method for benchmarking the generalization of a PTM site predictor: Lysine methylome as an example

Author

Listed:
  • Guoyang Zou
  • Yang Zou
  • Chenglong Ma
  • Jiaojiao Zhao
  • Lei Li

Abstract

Many computational classifiers have been developed to predict different types of post-translational modification sites. Their performances are measured using cross-validation or independent test, in which experimental data from different sources are mixed and randomly split into training and test sets. However, the self-reported performances of most classifiers based on this measure are generally higher than their performances in the application of new experimental data. It suggests that the cross-validation method overestimates the generalization ability of a classifier. Here, we proposed a generalization estimate method, dubbed experiment-split test, where the experimental sources for the training set are different from those for the test set that simulate the data derived from a new experiment. We took the prediction of lysine methylome (Kme) as an example and developed a deep learning-based Kme site predictor (called DeepKme) with outstanding performance. We assessed the experiment-split test by comparing it with the cross-validation method. We found that the performance measured using the experiment-split test is lower than that measured in terms of cross-validation. As the test data of the experiment-split method were derived from an independent experimental source, this method could reflect the generalization of the predictor. Therefore, we believe that the experiment-split method can be applied to benchmark the practical performance of a given PTM model. DeepKme is free accessible via https://github.com/guoyangzou/DeepKme.Author summary: The performance of a model for predicting post-translational modification sites is commonly evaluated using the cross-validation method, where the data derived from different experimental sources are mixed and randomly separated into the training dataset and validation dataset. However, the performance measured through cross-validation is generally higher than the performance in the application of new experimental data, indicating that the cross-validation method overestimates the generalization of a model. In this study, we proposed a generalization estimate method, dubbed experiment-split test, where the experimental sources for the training set are different from those for the test set that simulate the data derived from a new experiment. We took the prediction of lysine methylome as an example and developed a deep learning-based Kme site predictor DeepKme with outstanding performance. We found that the performance measured by the experiment-split method is lower than that measured in terms of cross-validation. As the test data of the experiment-split method were derived from an independent experimental source, this method could reflect the generalization of the prediction model. Therefore, the experiment-split method can be applied to benchmark the practical prediction performance.

Suggested Citation

  • Guoyang Zou & Yang Zou & Chenglong Ma & Jiaojiao Zhao & Lei Li, 2021. "Development of an experiment-split method for benchmarking the generalization of a PTM site predictor: Lysine methylome as an example," PLOS Computational Biology, Public Library of Science, vol. 17(12), pages 1-14, December.
  • Handle: RePEc:plo:pcbi00:1009682
    DOI: 10.1371/journal.pcbi.1009682
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009682
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1009682&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.1009682?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Damiano Piovesan & Andras Hatos & Giovanni Minervini & Federica Quaglia & Alexander Miguel Monzon & Silvio C E Tosatto, 2020. "Assessing predictors for new post translational modification sites: A case study on hydroxylation," PLOS Computational Biology, Public Library of Science, vol. 16(6), pages 1-15, June.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.

      More about this item

      Statistics

      Access and download statistics

      Corrections

      All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1009682. See general information about how to correct material in RePEc.

      If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

      If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

      If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

      For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

      Please note that corrections may take a couple of weeks to filter through the various RePEc services.

      IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.