IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0221068.html
   My bibliography  Save this article

TreeCluster: Clustering biological sequences using phylogenetic trees

Author

Listed:
  • Metin Balaban
  • Niema Moshiri
  • Uyen Mai
  • Xingfan Jia
  • Siavash Mirarab

Abstract

Clustering homologous sequences based on their similarity is a problem that appears in many bioinformatics applications. The fact that sequences cluster is ultimately the result of their phylogenetic relationships. Despite this observation and the natural ways in which a tree can define clusters, most applications of sequence clustering do not use a phylogenetic tree and instead operate on pairwise sequence distances. Due to advances in large-scale phylogenetic inference, we argue that tree-based clustering is under-utilized. We define a family of optimization problems that, given an arbitrary tree, return the minimum number of clusters such that all clusters adhere to constraints on their heterogeneity. We study three specific constraints, limiting (1) the diameter of each cluster, (2) the sum of its branch lengths, or (3) chains of pairwise distances. These three problems can be solved in time that increases linearly with the size of the tree, and for two of the three criteria, the algorithms have been known in the theoretical computer scientist literature. We implement these algorithms in a tool called TreeCluster, which we test on three applications: OTU clustering for microbiome data, HIV transmission clustering, and divide-and-conquer multiple sequence alignment. We show that, by using tree-based distances, TreeCluster generates more internally consistent clusters than alternatives and improves the effectiveness of downstream applications. TreeCluster is available at https://github.com/niemasd/TreeCluster.

Suggested Citation

  • Metin Balaban & Niema Moshiri & Uyen Mai & Xingfan Jia & Siavash Mirarab, 2019. "TreeCluster: Clustering biological sequences using phylogenetic trees," PLOS ONE, Public Library of Science, vol. 14(8), pages 1-20, August.
  • Handle: RePEc:plo:pone00:0221068
    DOI: 10.1371/journal.pone.0221068
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0221068
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0221068&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0221068?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Sung Hee Ko & Pierce Radecki & Frida Belinky & Jinal N. Bhiman & Susan Meiring & Jackie Kleynhans & Daniel Amoako & Vanessa Guerra Canedo & Margaret Lucas & Dikeledi Kekana & Neil Martinson & Limakats, 2024. "Rapid intra-host diversification and evolution of SARS-CoV-2 in advanced HIV infection," Nature Communications, Nature, vol. 15(1), pages 1-14, December.
    2. Paul O. Sheridan & Yiyu Meng & Tom A. Williams & Cécile Gubry-Rangin, 2023. "Genomics of soil depth niche partitioning in the Thaumarchaeota family Gagatemarchaeaceae," Nature Communications, Nature, vol. 14(1), pages 1-14, December.
    3. Ning Zhang & Luuk Harbers & Michele Simonetti & Constantin Diekmann & Quentin Verron & Enrico Berrino & Sara E. Bellomo & Gabriel M. C. Longo & Michael Ratz & Niklas Schultz & Firas Tarish & Peng Su &, 2024. "High clonal diversity and spatial genetic admixture in early prostate cancer and surrounding normal tissue," Nature Communications, Nature, vol. 15(1), pages 1-17, December.
    4. Matteo Ciciani & Michele Demozzi & Eleonora Pedrazzoli & Elisabetta Visentin & Laura Pezzè & Lorenzo Federico Signorini & Aitor Blanco-Miguez & Moreno Zolfo & Francesco Asnicar & Antonio Casini & Anna, 2022. "Automated identification of sequence-tailored Cas9 proteins using massive metagenomic data," Nature Communications, Nature, vol. 13(1), pages 1-8, December.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0221068. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.