IDEAS home Printed from https://ideas.repec.org/a/gam/jdataj/v1y2016i3p19-d85033.html
   My bibliography  Save this article

Application of Taxonomic Modeling to Microbiota Data Mining for Detection of Helminth Infection in Global Populations

Author

Listed:
  • Mahbaneh Eshaghzadeh Torbati

    (Department of Computer Science, University of Pittsburgh, 6135 Sennott Square, 210 S Bouquet St, Pittsburgh, PA 15260-9161, USA)

  • Makedonka Mitreva

    (Department of Medicine, Washington University School of Medicine, 660 S Euclid Ave, St. Louis, MO 63110, USA)

  • Vanathi Gopalakrishnan

    (Department of Biomedical Informatics, University of Pittsburgh, 5607 Baum Boulevard, Suite 500, Pittsburgh, PA 15206-3701, USA)

Abstract

Human microbiome data from genomic sequencing technologies is fast accumulating, giving us insights into bacterial taxa that contribute to health and disease. The predictive modeling of such microbiota count data for the classification of human infection from parasitic worms, such as helminths, can help in the detection and management across global populations. Real-world datasets of microbiome experiments are typically sparse, containing hundreds of measurements for bacterial species, of which only a few are detected in the bio-specimens that are analyzed. This feature of microbiome data produces the challenge of needing more observations for accurate predictive modeling and has been dealt with previously, using different methods of feature reduction. To our knowledge, integrative methods, such as transfer learning, have not yet been explored in the microbiome domain as a way to deal with data sparsity by incorporating knowledge of different but related datasets. One way of incorporating this knowledge is by using a meaningful mapping among features of these datasets. In this paper, we claim that this mapping would exist among members of each individual cluster, grouped based on phylogenetic dependency among taxa and their association to the phenotype. We validate our claim by showing that models incorporating associations in such a grouped feature space result in no performance deterioration for the given classification task. In this paper, we test our hypothesis by using classification models that detect helminth infection in microbiota of human fecal samples obtained from Indonesia and Liberia countries. In our experiments, we first learn binary classifiers for helminth infection detection by using Naive Bayes, Support Vector Machines, Multilayer Perceptrons, and Random Forest methods. In the next step, we add taxonomic modeling by using the SMART-scan module to group the data, and learn classifiers using the same four methods, to test the validity of the achieved groupings. We observed a 6% to 23% and 7% to 26% performance improvement based on the Area Under the receiver operating characteristic (ROC) Curve (AUC) and Balanced Accuracy (Bacc) measures, respectively, over 10 runs of 10-fold cross-validation. These results show that using phylogenetic dependency for grouping our microbiota data actually results in a noticeable improvement in classification performance for helminth infection detection. These promising results from this feasibility study demonstrate that methods such as SMART-scan can be utilized in the future for knowledge transfer from different but related microbiome datasets by phylogenetically-related functional mapping, to enable novel integrative biomarker discovery.

Suggested Citation

  • Mahbaneh Eshaghzadeh Torbati & Makedonka Mitreva & Vanathi Gopalakrishnan, 2016. "Application of Taxonomic Modeling to Microbiota Data Mining for Detection of Helminth Infection in Global Populations," Data, MDPI, vol. 1(3), pages 1-14, December.
  • Handle: RePEc:gam:jdataj:v:1:y:2016:i:3:p:19-:d:85033
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2306-5729/1/3/19/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2306-5729/1/3/19/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Robert Tibshirani, 2011. "Regression shrinkage and selection via the lasso: a retrospective," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 73(3), pages 273-282, June.
    2. Ian Holmes & Keith Harris & Christopher Quince, 2012. "Dirichlet Multinomial Mixtures: Generative Models for Microbial Metagenomics," PLOS ONE, Public Library of Science, vol. 7(2), pages 1-15, February.
    3. Kim-Anh Lê Cao & Mary-Ellen Costello & Vanessa Anne Lakis & François Bartolo & Xin-Yi Chua & Rémi Brazeilles & Pascale Rondeau, 2016. "MixMC: A Multivariate Statistical Framework to Gain Insight into Microbial Communities," PLOS ONE, Public Library of Science, vol. 11(8), pages 1-21, August.
    4. Patricio S La Rosa & J Paul Brooks & Elena Deych & Edward L Boone & David J Edwards & Qin Wang & Erica Sodergren & George Weinstock & William D Shannon, 2012. "Hypothesis Testing and Power Calculations for Taxonomic-Based Human Microbiome Data," PLOS ONE, Public Library of Science, vol. 7(12), pages 1-13, December.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Simone Fiori & Andrea Vitali, 2019. "Statistical Modeling of Trivariate Static Systems: Isotonic Models," Data, MDPI, vol. 4(1), pages 1-29, January.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Yaru Song & Hongyu Zhao & Tao Wang, 2020. "An adaptive independence test for microbiome community data," Biometrics, The International Biometric Society, vol. 76(2), pages 414-426, June.
    2. Shaikh Mateen R. & Beyene Joseph, 2017. "Statistical models and computational algorithms for discovering relationships in microbiome data," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 16(1), pages 1-12, March.
    3. Lucian Belascu & Alexandra Horobet & Georgiana Vrinceanu & Consuela Popescu, 2021. "Performance Dissimilarities in European Union Manufacturing: The Effect of Ownership and Technological Intensity," Sustainability, MDPI, vol. 13(18), pages 1-19, September.
    4. Achal Dhariwal & Polona Rajar & Gabriela Salvadori & Heidi Aarø Åmdal & Dag Berild & Ola Didrik Saugstad & Drude Fugelseth & Gorm Greisen & Ulf Dahle & Kirsti Haaland & Fernanda Cristina Petersen, 2024. "Prolonged hospitalization signature and early antibiotic effects on the nasopharyngeal resistome in preterm infants," Nature Communications, Nature, vol. 15(1), pages 1-13, December.
    5. Alberti, Federica & Mantilla, César, 2020. "Provision of noxious facilities using a market-like mechanism: A simple implementation in the lab," Working papers 35, Red Investigadores de Economía.
    6. Laura Anderlucci & Cinzia Viroli, 2020. "Mixtures of Dirichlet-Multinomial distributions for supervised and unsupervised classification of short text data," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 14(4), pages 759-770, December.
    7. Camila Epprecht & Dominique Guegan & Álvaro Veiga & Joel Correa da Rosa, 2017. "Variable selection and forecasting via automated methods for linear models: LASSO/adaLASSO and Autometrics," Post-Print halshs-00917797, HAL.
    8. Sandro Radovanovic & Boris Delibasic & Milija Suknovic & Dajana Matovic, 2019. "Where will the next ski injury occur? A system for visual and predictive analytics of ski injuries," Operational Research, Springer, vol. 19(4), pages 973-992, December.
    9. Peter Martey Addo & Dominique Guegan & Bertrand Hassani, 2018. "Credit Risk Analysis Using Machine and Deep Learning Models," Risks, MDPI, vol. 6(2), pages 1-20, April.
    10. Zhang, Guike & Gao, Zengan & Dong, June & Mei, Dexiang, 2023. "Machine learning approaches for constructing the national anti-money laundering index," Finance Research Letters, Elsevier, vol. 52(C).
    11. Lee Anthony & Caron Francois & Doucet Arnaud & Holmes Chris, 2012. "Bayesian Sparsity-Path-Analysis of Genetic Association Signal using Generalized t Priors," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 11(2), pages 1-31, January.
    12. Dexin Chen & Meiting Fu & Liangjie Chi & Liyan Lin & Jiaxin Cheng & Weisong Xue & Chenyan Long & Wei Jiang & Xiaoyu Dong & Jian Sui & Dajia Lin & Jianping Lu & Shuangmu Zhuo & Side Liu & Guoxin Li & G, 2022. "Prognostic and predictive value of a pathomics signature in gastric cancer," Nature Communications, Nature, vol. 13(1), pages 1-13, December.
    13. Sokbae Lee & Myung Hwan Seo & Youngki Shin, 2016. "The lasso for high dimensional regression with a possible change point," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 78(1), pages 193-210, January.
    14. Hautsch, Nikolaus & Okhrin, Ostap & Ristig, Alexander, 2014. "Efficient iterative maximum likelihood estimation of high-parameterized time series models," SFB 649 Discussion Papers 2014-010, Humboldt University Berlin, Collaborative Research Center 649: Economic Risk.
    15. Jin, Shaobo & Moustaki, Irini & Yang-Wallentin, Fan, 2018. "Approximated penalized maximum likelihood for exploratory factor analysis: an orthogonal case," LSE Research Online Documents on Economics 88118, London School of Economics and Political Science, LSE Library.
    16. repec:hum:wpaper:sfb649dp2014-010 is not listed on IDEAS
    17. Hettihewa, Samanthala & Saha, Shrabani & Zhang, Hanxiong, 2018. "Does an aging population influence stock markets? Evidence from New Zealand," Economic Modelling, Elsevier, vol. 75(C), pages 142-158.
    18. Shao, Hu & Lam, William H.K. & Sumalee, Agachai & Chen, Anthony & Hazelton, Martin L., 2014. "Estimation of mean and covariance of peak hour origin–destination demands from day-to-day traffic counts," Transportation Research Part B: Methodological, Elsevier, vol. 68(C), pages 52-75.
    19. Andrés Gómez & Oleg A. Prokopyev, 2021. "A Mixed-Integer Fractional Optimization Approach to Best Subset Selection," INFORMS Journal on Computing, INFORMS, vol. 33(2), pages 551-565, May.
    20. Shihao Gu & Bryan Kelly & Dacheng Xiu, 2020. "Empirical Asset Pricing via Machine Learning," The Review of Financial Studies, Society for Financial Studies, vol. 33(5), pages 2223-2273.
    21. Xing, Li-Min & Zhang, Yue-Jun, 2022. "Forecasting crude oil prices with shrinkage methods: Can nonconvex penalty and Huber loss help?," Energy Economics, Elsevier, vol. 110(C).

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jdataj:v:1:y:2016:i:3:p:19-:d:85033. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.