IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/1004228.html
   My bibliography  Save this article

Convex Clustering: An Attractive Alternative to Hierarchical Clustering

Author

Listed:
  • Gary K Chen
  • Eric C Chi
  • John Michael O Ranola
  • Kenneth Lange

Abstract

The primary goal in cluster analysis is to discover natural groupings of objects. The field of cluster analysis is crowded with diverse methods that make special assumptions about data and address different scientific aims. Despite its shortcomings in accuracy, hierarchical clustering is the dominant clustering method in bioinformatics. Biologists find the trees constructed by hierarchical clustering visually appealing and in tune with their evolutionary perspective. Hierarchical clustering operates on multiple scales simultaneously. This is essential, for instance, in transcriptome data, where one may be interested in making qualitative inferences about how lower-order relationships like gene modules lead to higher-order relationships like pathways or biological processes. The recently developed method of convex clustering preserves the visual appeal of hierarchical clustering while ameliorating its propensity to make false inferences in the presence of outliers and noise. The solution paths generated by convex clustering reveal relationships between clusters that are hidden by static methods such as k-means clustering. The current paper derives and tests a novel proximal distance algorithm for minimizing the objective function of convex clustering. The algorithm separates parameters, accommodates missing data, and supports prior information on relationships. Our program CONVEXCLUSTER incorporating the algorithm is implemented on ATI and nVidia graphics processing units (GPUs) for maximal speed. Several biological examples illustrate the strengths of convex clustering and the ability of the proximal distance algorithm to handle high-dimensional problems. CONVEXCLUSTER can be freely downloaded from the UCLA Human Genetics web site at http://www.genetics.ucla.edu/software/Author Summary: Pattern discovery is one of the most important goals of data-driven research. In the biological sciences hierarchical clustering has achieved a position of pre-eminence due to its ability to capture multiple levels of data granularity. Hierarchical clustering’s visual displays of phylogenetic trees and gene-expression modules are indeed seductive. Despite its merits, hierarchical clustering is greedy by nature and often produces spurious clusters, particularly in the presence of substantial noise. This paper presents a relatively new alternative to hierarchical clustering known as convex clustering. Although convex clustering is more computationally demanding, it enjoys several advantages over hierarchical clustering and other traditional methods of clustering. Convex clustering delivers a uniquely defined clustering path that partially obviates the need for choosing an optimal number of clusters. Along the path small clusters gradually coalesce to form larger clusters. Clustering can be guided by external information through appropriately defined similarity weights. Comparisons to hierarchical clustering demonstrate the superior robustness of convex clustering to noise. Our genetics examples include inference of the demographic history of 52 populations across the world, a more detailed analysis of European demography, and a re-analysis of a well-known breast cancer expression dataset. We also introduce a new algorithm for solving the convex clustering problem. This algorithm belongs to a subclass of MM (minimization-majorization) algorithms known as proximal distance algorithms. The proximal distance convex clustering algorithm is inherently parallelizable and readily maps to modern many-core devices such as graphics processing units (GPUs). Our freely available software, convexcluster, exploits OpenCL routines that ensure compatibility across a variety of hardware environments.

Suggested Citation

  • Gary K Chen & Eric C Chi & John Michael O Ranola & Kenneth Lange, 2015. "Convex Clustering: An Attractive Alternative to Hierarchical Clustering," PLOS Computational Biology, Public Library of Science, vol. 11(5), pages 1-31, May.
  • Handle: RePEc:plo:pcbi00:1004228
    DOI: 10.1371/journal.pcbi.1004228
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004228
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1004228&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.1004228?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Hunter D.R. & Lange K., 2004. "A Tutorial on MM Algorithms," The American Statistician, American Statistical Association, vol. 58, pages 30-37, February.
    2. Charles M. Perou & Therese Sørlie & Michael B. Eisen & Matt van de Rijn & Stefanie S. Jeffrey & Christian A. Rees & Jonathan R. Pollack & Douglas T. Ross & Hilde Johnsen & Lars A. Akslen & Øystein Flu, 2000. "Molecular portraits of human breast tumours," Nature, Nature, vol. 406(6797), pages 747-752, August.
    3. Su, Yu-Sung & Gelman, Andrew & Hill, Jennifer & Yajima, Masanao, 2011. "Multiple Imputation with Diagnostics (mi) in R: Opening Windows into the Black Box," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 45(i02).
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Hirose, Kei & Miura, Kanta & Koie, Atori, 2023. "Hierarchical clustered multiclass discriminant analysis via cross-validation," Computational Statistics & Data Analysis, Elsevier, vol. 178(C).

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Yang, Xi & Hoadley, Katherine A. & Hannig, Jan & Marron, J.S., 2023. "Jackstraw inference for AJIVE data integration," Computational Statistics & Data Analysis, Elsevier, vol. 180(C).
    2. Joost Ginkel & Pieter Kroonenberg, 2014. "Using Generalized Procrustes Analysis for Multiple Imputation in Principal Component Analysis," Journal of Classification, Springer;The Classification Society, vol. 31(2), pages 242-269, July.
    3. Egashira, Kento & Yata, Kazuyoshi & Aoshima, Makoto, 2024. "Asymptotic properties of hierarchical clustering in high-dimensional settings," Journal of Multivariate Analysis, Elsevier, vol. 199(C).
    4. María Elena Martínez & Jonathan T Unkart & Li Tao & Candyce H Kroenke & Richard Schwab & Ian Komenaka & Scarlett Lin Gomez, 2017. "Prognostic significance of marital status in breast cancer survival: A population-based study," PLOS ONE, Public Library of Science, vol. 12(5), pages 1-14, May.
    5. Yishai Shimoni, 2018. "Association between expression of random gene sets and survival is evident in multiple cancer types and may be explained by sub-classification," PLOS Computational Biology, Public Library of Science, vol. 14(2), pages 1-15, February.
    6. Gerko Vink & Laurence E. Frank & Jeroen Pannekoek & Stef Buuren, 2014. "Predictive mean matching imputation of semicontinuous variables," Statistica Neerlandica, Netherlands Society for Statistics and Operations Research, vol. 68(1), pages 61-90, February.
    7. Sakyajit Bhattacharya & Paul McNicholas, 2014. "A LASSO-penalized BIC for mixture model selection," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 8(1), pages 45-61, March.
    8. Yoo-Ah Kim & Stefan Wuchty & Teresa M Przytycka, 2011. "Identifying Causal Genes and Dysregulated Pathways in Complex Diseases," PLOS Computational Biology, Public Library of Science, vol. 7(3), pages 1-13, March.
    9. Utkarsh J. Dang & Michael P.B. Gallaugher & Ryan P. Browne & Paul D. McNicholas, 2023. "Model-Based Clustering and Classification Using Mixtures of Multivariate Skewed Power Exponential Distributions," Journal of Classification, Springer;The Classification Society, vol. 40(1), pages 145-167, April.
    10. Radhakrishnan Nagarajan & Marco Scutari, 2013. "Impact of Noise on Molecular Network Inference," PLOS ONE, Public Library of Science, vol. 8(12), pages 1-12, December.
    11. R Joseph Bender & Feilim Mac Gabhann, 2013. "Expression of VEGF and Semaphorin Genes Define Subgroups of Triple Negative Breast Cancer," PLOS ONE, Public Library of Science, vol. 8(5), pages 1-15, May.
    12. Deepak Poduval & Zuzana Sichmanova & Anne Hege Straume & Per Eystein Lønning & Stian Knappskog, 2020. "The novel microRNAs hsa-miR-nov7 and hsa-miR-nov3 are over-expressed in locally advanced breast cancer," PLOS ONE, Public Library of Science, vol. 15(4), pages 1-23, April.
    13. V. Maume-Deschamps & D. Rullière & A. Usseglio-Carleve, 2018. "Spatial Expectile Predictions for Elliptical Random Fields," Methodology and Computing in Applied Probability, Springer, vol. 20(2), pages 643-671, June.
    14. Sanjeena Subedi & Drew Neish & Stephen Bak & Zeny Feng, 2020. "Cluster analysis of microbiome data by using mixtures of Dirichlet–multinomial regression models," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 69(5), pages 1163-1187, November.
    15. Lian, Heng & Meng, Jie & Fan, Zengyan, 2015. "Simultaneous estimation of linear conditional quantiles with penalized splines," Journal of Multivariate Analysis, Elsevier, vol. 141(C), pages 1-21.
    16. Zhiguang Huo & Li Zhu & Tianzhou Ma & Hongcheng Liu & Song Han & Daiqing Liao & Jinying Zhao & George Tseng, 2020. "Two-Way Horizontal and Vertical Omics Integration for Disease Subtype Discovery," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 12(1), pages 1-22, April.
    17. Christian Seiler, 2013. "Nonresponse in Business Tendency Surveys: Theoretical Discourse and Empirical Evidence," ifo Beiträge zur Wirtschaftsforschung, ifo Institute - Leibniz Institute for Economic Research at the University of Munich, number 52, March.
    18. Markus Ringnér & Erik Fredlund & Jari Häkkinen & Åke Borg & Johan Staaf, 2011. "GOBO: Gene Expression-Based Outcome for Breast Cancer Online," PLOS ONE, Public Library of Science, vol. 6(3), pages 1-11, March.
    19. Cheng, Xiaoyue & Cook, Dianne & Hofmann, Heike, 2015. "Visually Exploring Missing Values in Multivariable Data Using a Graphical User Interface," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 68(i06).
    20. Tian, Guo-Liang & Tang, Man-Lai & Liu, Chunling, 2012. "Accelerating the quadratic lower-bound algorithm via optimizing the shrinkage parameter," Computational Statistics & Data Analysis, Elsevier, vol. 56(2), pages 255-265.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1004228. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.