Author
Listed:
- Domonkos Tikk
- Philippe Thomas
- Peter Palaga
- Jörg Hakenberg
- Ulf Leser
Abstract
The most important way of conveying new findings in biomedical research is scientific publication. Extraction of protein–protein interactions (PPIs) reported in scientific publications is one of the core topics of text mining in the life sciences. Recently, a new class of such methods has been proposed - convolution kernels that identify PPIs using deep parses of sentences. However, comparing published results of different PPI extraction methods is impossible due to the use of different evaluation corpora, different evaluation metrics, different tuning procedures, etc. In this paper, we study whether the reported performance metrics are robust across different corpora and learning settings and whether the use of deep parsing actually leads to an increase in extraction quality. Our ultimate goal is to identify the one method that performs best in real-life scenarios, where information extraction is performed on unseen text and not on specifically prepared evaluation data. We performed a comprehensive benchmarking of nine different methods for PPI extraction that use convolution kernels on rich linguistic information. Methods were evaluated on five different public corpora using cross-validation, cross-learning, and cross-corpus evaluation. Our study confirms that kernels using dependency trees generally outperform kernels based on syntax trees. However, our study also shows that only the best kernel methods can compete with a simple rule-based approach when the evaluation prevents information leakage between training and test corpora. Our results further reveal that the F-score of many approaches drops significantly if no corpus-specific parameter optimization is applied and that methods reaching a good AUC score often perform much worse in terms of F-score. We conclude that for most kernels no sensible estimation of PPI extraction performance on new text is possible, given the current heterogeneity in evaluation data. Nevertheless, our study shows that three kernels are clearly superior to the other methods.Author Summary: The most important way of conveying new findings in biomedical research is scientific publication. In turn, the most recent and most important findings can only be found by carefully reading the scientific literature, which becomes more and more of a problem because of the enormous number of published articles. This situation has led to the development of various computational approaches to the automatic extraction of important facts from articles, mostly concentrating on the recognition of protein names and on interactions between proteins (PPI). However, so far there is little agreement on which methods perform best for which task. Our paper reports on an extensive comparison of nine recent PPI extraction tools. We studied their performance in various settings on a set of five different text collections containing articles describing PPIs, which for the first time allows for an unbiased comparison of their respective effectiveness. Our results show that the tools' performance depends largely on the collection they are trained on and the collection they are then evaluated on, which means that extrapolating their measured performance to arbitrary text is still highly problematic. We also show that certain classes of methods for extracting PPIs are clearly superior to other classes.
Suggested Citation
Domonkos Tikk & Philippe Thomas & Peter Palaga & Jörg Hakenberg & Ulf Leser, 2010.
"A Comprehensive Benchmark of Kernel Methods to Extract Protein–Protein Interactions from Literature,"
PLOS Computational Biology, Public Library of Science, vol. 6(7), pages 1-19, July.
Handle:
RePEc:plo:pcbi00:1000837
DOI: 10.1371/journal.pcbi.1000837
Download full text from publisher
Citations
Citations are extracted by the
CitEc Project, subscribe to its
RSS feed for this item.
Cited by:
- Behrouz Bokharaeian & Alberto Diaz & Hamidreza Chitsaz, 2016.
"Enhancing Extraction of Drug-Drug Interaction from Literature Using Neutral Candidates, Negation, and Clause Dependency,"
PLOS ONE, Public Library of Science, vol. 11(10), pages 1-20, October.
- Shandar Ahmad & Kenji Mizuguchi, 2011.
"Partner-Aware Prediction of Interacting Residues in Protein-Protein Complexes from Sequence Data,"
PLOS ONE, Public Library of Science, vol. 6(12), pages 1-11, December.
- Haibin Liu & Lawrence Hunter & Vlado Kešelj & Karin Verspoor, 2013.
"Approximate Subgraph Matching-Based Literature Mining for Biomedical Events and Relations,"
PLOS ONE, Public Library of Science, vol. 8(4), pages 1-16, April.
- Peng Su & Gang Li & Cathy Wu & K Vijay-Shanker, 2019.
"Using distant supervision to augment manually annotated data for relation extraction,"
PLOS ONE, Public Library of Science, vol. 14(7), pages 1-17, July.
- Kersten Döring & Ammar Qaseem & Michael Becer & Jianyu Li & Pankaj Mishra & Mingjie Gao & Pascal Kirchner & Florian Sauter & Kiran K Telukunta & Aurélien F A Moumbock & Philippe Thomas & Stefan Günthe, 2020.
"Automated recognition of functional compound-protein relationships in literature,"
PLOS ONE, Public Library of Science, vol. 15(3), pages 1-14, March.
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1000837. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.