IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/1004228.html
   My bibliography  Save this article

Convex Clustering: An Attractive Alternative to Hierarchical Clustering

Author

Listed:
  • Gary K Chen
  • Eric C Chi
  • John Michael O Ranola
  • Kenneth Lange

Abstract

The primary goal in cluster analysis is to discover natural groupings of objects. The field of cluster analysis is crowded with diverse methods that make special assumptions about data and address different scientific aims. Despite its shortcomings in accuracy, hierarchical clustering is the dominant clustering method in bioinformatics. Biologists find the trees constructed by hierarchical clustering visually appealing and in tune with their evolutionary perspective. Hierarchical clustering operates on multiple scales simultaneously. This is essential, for instance, in transcriptome data, where one may be interested in making qualitative inferences about how lower-order relationships like gene modules lead to higher-order relationships like pathways or biological processes. The recently developed method of convex clustering preserves the visual appeal of hierarchical clustering while ameliorating its propensity to make false inferences in the presence of outliers and noise. The solution paths generated by convex clustering reveal relationships between clusters that are hidden by static methods such as k-means clustering. The current paper derives and tests a novel proximal distance algorithm for minimizing the objective function of convex clustering. The algorithm separates parameters, accommodates missing data, and supports prior information on relationships. Our program CONVEXCLUSTER incorporating the algorithm is implemented on ATI and nVidia graphics processing units (GPUs) for maximal speed. Several biological examples illustrate the strengths of convex clustering and the ability of the proximal distance algorithm to handle high-dimensional problems. CONVEXCLUSTER can be freely downloaded from the UCLA Human Genetics web site at http://www.genetics.ucla.edu/software/Author Summary: Pattern discovery is one of the most important goals of data-driven research. In the biological sciences hierarchical clustering has achieved a position of pre-eminence due to its ability to capture multiple levels of data granularity. Hierarchical clustering’s visual displays of phylogenetic trees and gene-expression modules are indeed seductive. Despite its merits, hierarchical clustering is greedy by nature and often produces spurious clusters, particularly in the presence of substantial noise. This paper presents a relatively new alternative to hierarchical clustering known as convex clustering. Although convex clustering is more computationally demanding, it enjoys several advantages over hierarchical clustering and other traditional methods of clustering. Convex clustering delivers a uniquely defined clustering path that partially obviates the need for choosing an optimal number of clusters. Along the path small clusters gradually coalesce to form larger clusters. Clustering can be guided by external information through appropriately defined similarity weights. Comparisons to hierarchical clustering demonstrate the superior robustness of convex clustering to noise. Our genetics examples include inference of the demographic history of 52 populations across the world, a more detailed analysis of European demography, and a re-analysis of a well-known breast cancer expression dataset. We also introduce a new algorithm for solving the convex clustering problem. This algorithm belongs to a subclass of MM (minimization-majorization) algorithms known as proximal distance algorithms. The proximal distance convex clustering algorithm is inherently parallelizable and readily maps to modern many-core devices such as graphics processing units (GPUs). Our freely available software, convexcluster, exploits OpenCL routines that ensure compatibility across a variety of hardware environments.

Suggested Citation

  • Gary K Chen & Eric C Chi & John Michael O Ranola & Kenneth Lange, 2015. "Convex Clustering: An Attractive Alternative to Hierarchical Clustering," PLOS Computational Biology, Public Library of Science, vol. 11(5), pages 1-31, May.
  • Handle: RePEc:plo:pcbi00:1004228
    DOI: 10.1371/journal.pcbi.1004228
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004228
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1004228&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.1004228?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Hunter D.R. & Lange K., 2004. "A Tutorial on MM Algorithms," The American Statistician, American Statistical Association, vol. 58, pages 30-37, February.
    2. Charles M. Perou & Therese Sørlie & Michael B. Eisen & Matt van de Rijn & Stefanie S. Jeffrey & Christian A. Rees & Jonathan R. Pollack & Douglas T. Ross & Hilde Johnsen & Lars A. Akslen & Øystein Flu, 2000. "Molecular portraits of human breast tumours," Nature, Nature, vol. 406(6797), pages 747-752, August.
    3. Su, Yu-Sung & Gelman, Andrew & Hill, Jennifer & Yajima, Masanao, 2011. "Multiple Imputation with Diagnostics (mi) in R: Opening Windows into the Black Box," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 45(i02).
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Hirose, Kei & Miura, Kanta & Koie, Atori, 2023. "Hierarchical clustered multiclass discriminant analysis via cross-validation," Computational Statistics & Data Analysis, Elsevier, vol. 178(C).

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Yang, Xi & Hoadley, Katherine A. & Hannig, Jan & Marron, J.S., 2023. "Jackstraw inference for AJIVE data integration," Computational Statistics & Data Analysis, Elsevier, vol. 180(C).
    2. Joost Ginkel & Pieter Kroonenberg, 2014. "Using Generalized Procrustes Analysis for Multiple Imputation in Principal Component Analysis," Journal of Classification, Springer;The Classification Society, vol. 31(2), pages 242-269, July.
    3. Manish G & Anil Kumar Badana & Rama Rao Malla, 2017. "Emerging Diagnostic and Prognostic Biomarkers of Triple Negative Breast Cancer," Biomedical Journal of Scientific & Technical Research, Biomedical Research Network+, LLC, vol. 1(3), pages 561-565, August.
    4. Jacob Elnaggar & Fern Tsien & Lucio Miele & Chindo Hicks & Clayton Yates & Melisa Davis, 2019. "An Integrative Genomics Approach for Associating Genetic Susceptibility with the Tumor Immune Microenvironment in Triple Negative Breast Cancer," Biomedical Journal of Scientific & Technical Research, Biomedical Research Network+, LLC, vol. 15(1), pages 1-12, February.
    5. Rasmus Lentz & Jean Marc Robin & Suphanit Piyapromdee, 2018. "On Worker and Firm Heterogeneity in Wages and Employment Mobility: Evidence from Danish Register Data," 2018 Meeting Papers 469, Society for Economic Dynamics.
    6. Egashira, Kento & Yata, Kazuyoshi & Aoshima, Makoto, 2024. "Asymptotic properties of hierarchical clustering in high-dimensional settings," Journal of Multivariate Analysis, Elsevier, vol. 199(C).
    7. María Elena Martínez & Jonathan T Unkart & Li Tao & Candyce H Kroenke & Richard Schwab & Ian Komenaka & Scarlett Lin Gomez, 2017. "Prognostic significance of marital status in breast cancer survival: A population-based study," PLOS ONE, Public Library of Science, vol. 12(5), pages 1-14, May.
    8. Yishai Shimoni, 2018. "Association between expression of random gene sets and survival is evident in multiple cancer types and may be explained by sub-classification," PLOS Computational Biology, Public Library of Science, vol. 14(2), pages 1-15, February.
    9. Takashi Sugimoto & Tomohiro Shinozaki & Takashi Naruse & Yuki Miyamoto, 2014. "Who Was Concerned about Radiation, Food Safety, and Natural Disasters after the Great East Japan Earthquake and Fukushima Catastrophe? A Nationwide Cross-Sectional Survey in 2012," PLOS ONE, Public Library of Science, vol. 9(9), pages 1-8, September.
    10. Marcin Pilarczyk & Mehdi Fazel-Najafabadi & Michal Kouril & Behrouz Shamsaei & Juozas Vasiliauskas & Wen Niu & Naim Mahi & Lixia Zhang & Nicholas A. Clark & Yan Ren & Shana White & Rashid Karim & Huan, 2022. "Connecting omics signatures and revealing biological mechanisms with iLINCS," Nature Communications, Nature, vol. 13(1), pages 1-13, December.
    11. Gerko Vink & Laurence E. Frank & Jeroen Pannekoek & Stef Buuren, 2014. "Predictive mean matching imputation of semicontinuous variables," Statistica Neerlandica, Netherlands Society for Statistics and Operations Research, vol. 68(1), pages 61-90, February.
    12. Junhee Seok & Ronald W Davis & Wenzhong Xiao, 2015. "A Hybrid Approach of Gene Sets and Single Genes for the Prediction of Survival Risks with Gene Expression Data," PLOS ONE, Public Library of Science, vol. 10(5), pages 1-15, May.
    13. Qing Qu & Yan Mao & Xiao-chun Fei & Kun-wei Shen, 2013. "The Impact of Androgen Receptor Expression on Breast Cancer Survival: A Retrospective Study and Meta-Analysis," PLOS ONE, Public Library of Science, vol. 8(12), pages 1-1, December.
    14. Elizabeth Duthie & Diogo Veríssimo & Aidan Keane & Andrew T Knight, 2017. "The effectiveness of celebrities in conservation marketing," PLOS ONE, Public Library of Science, vol. 12(7), pages 1-16, July.
    15. Songfeng Zheng, 2021. "KLERC: kernel Lagrangian expectile regression calculator," Computational Statistics, Springer, vol. 36(1), pages 283-311, March.
    16. Sakyajit Bhattacharya & Paul McNicholas, 2014. "A LASSO-penalized BIC for mixture model selection," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 8(1), pages 45-61, March.
    17. Nguyen Thai An & Nguyen Mau Nam & Xiaolong Qin, 2020. "Solving k-center problems involving sets based on optimization techniques," Journal of Global Optimization, Springer, vol. 76(1), pages 189-209, January.
    18. Bourret, Pascale & Keating, Peter & Cambrosio, Alberto, 2011. "Regulating diagnosis in post-genomic medicine: Re-aligning clinical judgment?," Social Science & Medicine, Elsevier, vol. 73(6), pages 816-824, September.
    19. G. Gambardella & G. Viscido & B. Tumaini & A. Isacchi & R. Bosotti & D. di Bernardo, 2022. "A single-cell analysis of breast cancer cell lines to study tumour heterogeneity and drug response," Nature Communications, Nature, vol. 13(1), pages 1-12, December.
    20. Florian Schwendinger & Bettina Grün & Kurt Hornik, 2021. "A comparison of optimization solvers for log binomial regression including conic programming," Computational Statistics, Springer, vol. 36(3), pages 1721-1754, September.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1004228. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.