IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0039932.html
   My bibliography  Save this article

Using Rule-Based Machine Learning for Candidate Disease Gene Prioritization and Sample Classification of Cancer Gene Expression Data

Author

Listed:
  • Enrico Glaab
  • Jaume Bacardit
  • Jonathan M Garibaldi
  • Natalio Krasnogor

Abstract

Microarray data analysis has been shown to provide an effective tool for studying cancer and genetic diseases. Although classical machine learning techniques have successfully been applied to find informative genes and to predict class labels for new samples, common restrictions of microarray analysis such as small sample sizes, a large attribute space and high noise levels still limit its scientific and clinical applications. Increasing the interpretability of prediction models while retaining a high accuracy would help to exploit the information content in microarray data more effectively. For this purpose, we evaluate our rule-based evolutionary machine learning systems, BioHEL and GAssist, on three public microarray cancer datasets, obtaining simple rule-based models for sample classification. A comparison with other benchmark microarray sample classifiers based on three diverse feature selection algorithms suggests that these evolutionary learning techniques can compete with state-of-the-art methods like support vector machines. The obtained models reach accuracies above 90% in two-level external cross-validation, with the added value of facilitating interpretation by using only combinations of simple if-then-else rules. As a further benefit, a literature mining analysis reveals that prioritizations of informative genes extracted from BioHEL’s classification rule sets can outperform gene rankings obtained from a conventional ensemble feature selection in terms of the pointwise mutual information between relevant disease terms and the standardized names of top-ranked genes.

Suggested Citation

  • Enrico Glaab & Jaume Bacardit & Jonathan M Garibaldi & Natalio Krasnogor, 2012. "Using Rule-Based Machine Learning for Candidate Disease Gene Prioritization and Sample Classification of Cancer Gene Expression Data," PLOS ONE, Public Library of Science, vol. 7(7), pages 1-18, July.
  • Handle: RePEc:plo:pone00:0039932
    DOI: 10.1371/journal.pone.0039932
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0039932
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0039932&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0039932?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Makoto Aoshima & Kazuyoshi Yata, 2019. "Distance-based classifier by data transformation for high-dimension, strongly spiked eigenvalue models," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 71(3), pages 473-503, June.
    2. Gao, Zhenguo & Wang, Xinye & Kang, Xiaoning, 2023. "Ensemble LDA via the modified Cholesky decomposition," Computational Statistics & Data Analysis, Elsevier, vol. 188(C).
    3. Vlassis Nikos & Glaab Enrico, 2015. "GenePEN: analysis of network activity alterations in complex diseases via the pairwise elastic net," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 14(2), pages 221-224, April.
    4. Kang, Xiaoning & Wang, Mingqiu, 2021. "Ensemble sparse estimation of covariance structure for exploring genetic disease data," Computational Statistics & Data Analysis, Elsevier, vol. 159(C).
    5. Patrick Murigu Kamau Njage & Clementine Henri & Pimlapas Leekitcharoenphon & Michel‐Yves Mistou & Rene S. Hendriksen & Tine Hald, 2019. "Machine Learning Methods as a Tool for Predicting Risk of Illness Applying Next‐Generation Sequencing Data," Risk Analysis, John Wiley & Sons, vol. 39(6), pages 1397-1413, June.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0039932. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.