Author
Listed:
- Lucija Brezočnik
(Faculty of Electrical Engineering and Computer Science, University of Maribor, SI-2000 Maribor, Slovenia)
- Tanja Žlender
(National Laboratory of Health, Environment and Food, Centre for Medical Microbiology, Department for Microbiological Research, SI-2000 Maribor, Slovenia)
- Maja Rupnik
(National Laboratory of Health, Environment and Food, Centre for Medical Microbiology, Department for Microbiological Research, SI-2000 Maribor, Slovenia
Faculty of Medicine, University of Maribor, SI-2000 Maribor, Slovenia)
- Vili Podgorelec
(Faculty of Electrical Engineering and Computer Science, University of Maribor, SI-2000 Maribor, Slovenia)
Abstract
Microbiota analysis can provide valuable insights in various fields, including diet and nutrition, understanding health and disease, and in environmental contexts, such as understanding the role of microorganisms in different ecosystems. Based on the results, we can provide targeted therapies, personalized medicine, or detect environmental contaminants. In our research, we examined the gut microbiota of 16 animal taxa, including humans, as well as the microbiota of cattle and pig manure, where we focused on 16S rRNA V3-V4 hypervariable regions. Analyzing these regions is common in microbiome studies but can be challenging since the results are high-dimensional. Thus, we utilized machine learning techniques and demonstrated their applicability in processing microbial sequence data. Moreover, we showed that techniques commonly employed in natural language processing can be adapted for analyzing microbial text vectors. We obtained the latter through frequency analyses and utilized the proposed hierarchical clustering method over them. All steps in this study were gathered in a proposed microbial sequence data processing pipeline. The results demonstrate that we not only found similarities between samples but also sorted groups’ samples into semantically related clusters. We also tested our method against other known algorithms like the Kmeans and Spectral Clustering algorithms using clustering evaluation metrics. The results demonstrate the superiority of the proposed method over them. Moreover, the proposed microbial sequence data pipeline can be utilized for different types of microbiota, such as oral, gut, and skin, demonstrating its reusability and robustness.
Suggested Citation
Lucija Brezočnik & Tanja Žlender & Maja Rupnik & Vili Podgorelec, 2024.
"Using Machine Learning and Natural Language Processing for Unveiling Similarities between Microbial Data,"
Mathematics, MDPI, vol. 12(17), pages 1-20, August.
Handle:
RePEc:gam:jmathe:v:12:y:2024:i:17:p:2717-:d:1468218
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:12:y:2024:i:17:p:2717-:d:1468218. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.