IDEAS home Printed from https://ideas.repec.org/a/eee/jmvana/v124y2014icp42-56.html
   My bibliography  Save this article

A statistical view of clustering performance through the theory of U-processes

Author

Listed:
  • Clémençon, Stéphan

Abstract

Many clustering techniques aim at optimizing empirical criteria that are of the form of a U-statistic of degree two. Given a measure of dissimilarity between pairs of observations, the goal is to minimize the within cluster point scatter over a class of partitions of the feature space. It is the purpose of this paper to define a general statistical framework, relying on the theory of U-processes, for studying the performance of such clustering methods. In this setup, under adequate assumptions on the complexity of the subsets forming the partition candidates, the excess of clustering risk of the empirical minimizer is proved to be of the order OP(1/n). A lower bound result shows that the rate obtained is optimal in a minimax sense. Based on recent results related to the tail behavior of degenerate U-processes, it is also shown how to establish tighter, and even faster, rate bounds under additional assumptions. Model selection issues, related to the number of clusters forming the data partition in particular, are also considered. Finally, it is explained how the theoretical results developed here can provide statistical guarantees for empirical clustering aggregation.

Suggested Citation

  • Clémençon, Stéphan, 2014. "A statistical view of clustering performance through the theory of U-processes," Journal of Multivariate Analysis, Elsevier, vol. 124(C), pages 42-56.
  • Handle: RePEc:eee:jmvana:v:124:y:2014:i:c:p:42-56
    DOI: 10.1016/j.jmva.2013.10.001
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0047259X1300208X
    Download Restriction: Full text for ScienceDirect subscribers only

    File URL: https://libkey.io/10.1016/j.jmva.2013.10.001?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Lawrence Hubert & Phipps Arabie, 1985. "Comparing partitions," Journal of Classification, Springer;The Classification Society, vol. 2(1), pages 193-218, December.
    2. Biau Gérard & Bleakley Kevin, 2006. "Statistical inference on graphs," Statistics & Risk Modeling, De Gruyter, vol. 24(2), pages 209-232, December.
    3. Robert Tibshirani & Guenther Walther & Trevor Hastie, 2001. "Estimating the number of clusters in a data set via the gap statistic," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 63(2), pages 411-423.
    4. Witten, Daniela M. & Tibshirani, Robert, 2010. "A Framework for Feature Selection in Clustering," Journal of the American Statistical Association, American Statistical Association, vol. 105(490), pages 713-726.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Yaeji Lim & Hee-Seok Oh & Ying Kuen Cheung, 2019. "Multiscale Clustering for Functional Data," Journal of Classification, Springer;The Classification Society, vol. 36(2), pages 368-391, July.
    2. J. Fernando Vera & Rodrigo Macías, 2021. "On the Behaviour of K-Means Clustering of a Dissimilarity Matrix by Means of Full Multidimensional Scaling," Psychometrika, Springer;The Psychometric Society, vol. 86(2), pages 489-513, June.
    3. Zhiguang Huo & Li Zhu & Tianzhou Ma & Hongcheng Liu & Song Han & Daiqing Liao & Jinying Zhao & George Tseng, 2020. "Two-Way Horizontal and Vertical Omics Integration for Disease Subtype Discovery," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 12(1), pages 1-22, April.
    4. Lingsong Meng & Dorina Avram & George Tseng & Zhiguang Huo, 2022. "Outcome‐guided sparse K‐means for disease subtype discovery via integrating phenotypic data with high‐dimensional transcriptomic data," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 71(2), pages 352-375, March.
    5. Li, Pai-Ling & Chiou, Jeng-Min, 2011. "Identifying cluster number for subspace projected functional data clustering," Computational Statistics & Data Analysis, Elsevier, vol. 55(6), pages 2090-2103, June.
    6. Yujia Li & Xiangrui Zeng & Chien‐Wei Lin & George C. Tseng, 2022. "Simultaneous estimation of cluster number and feature sparsity in high‐dimensional cluster analysis," Biometrics, The International Biometric Society, vol. 78(2), pages 574-585, June.
    7. Dong Liu & Changwei Zhao & Yong He & Lei Liu & Ying Guo & Xinsheng Zhang, 2023. "Simultaneous cluster structure learning and estimation of heterogeneous graphs for matrix‐variate fMRI data," Biometrics, The International Biometric Society, vol. 79(3), pages 2246-2259, September.
    8. Jeffrey Andrews & Paul McNicholas, 2014. "Variable Selection for Clustering and Classification," Journal of Classification, Springer;The Classification Society, vol. 31(2), pages 136-153, July.
    9. Charles Bouveyron & Camille Brunet-Saumard, 2014. "Discriminative variable selection for clustering with the sparse Fisher-EM algorithm," Computational Statistics, Springer, vol. 29(3), pages 489-513, June.
    10. Hosik Choi & Seokho Lee, 2019. "Convex clustering for binary data," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 13(4), pages 991-1018, December.
    11. Nathalia Castellanos & Dhruv Desai & Sebastian Frank & Stefano Pasquali & Dhagash Mehta, 2024. "Can an unsupervised clustering algorithm reproduce a categorization system?," Papers 2408.10340, arXiv.org.
    12. Pi, J. & Wang, Honggang & Pardalos, Panos M., 2021. "A dual reformulation and solution framework for regularized convex clustering problems," European Journal of Operational Research, Elsevier, vol. 290(3), pages 844-856.
    13. Nilsen Gro & Borgan Ørnulf & LiestØl Knut & Lingjærde Ole Christian, 2013. "Identifying clusters in genomics data by recursive partitioning," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 12(5), pages 637-652, October.
    14. Weinand, J.M. & McKenna, R. & Fichtner, W., 2019. "Developing a municipality typology for modelling decentralised energy systems," Utilities Policy, Elsevier, vol. 57(C), pages 75-96.
    15. Seoung Bum Kim & Jung Woo Lee & Sin Young Kim & Deok Won Lee, 2013. "Dental Informatics to Characterize Patients with Dentofacial Deformities," PLOS ONE, Public Library of Science, vol. 8(8), pages 1-8, August.
    16. Ja‐Yoon Jang & Hee‐Seok Oh & Yaeji Lim & Ying Kuen Cheung, 2021. "Ensemble clustering for step data via binning," Biometrics, The International Biometric Society, vol. 77(1), pages 293-304, March.
    17. Arias-Castro, Ery & Pu, Xiao, 2017. "A simple approach to sparse clustering," Computational Statistics & Data Analysis, Elsevier, vol. 105(C), pages 217-228.
    18. Šárka Brodinová & Peter Filzmoser & Thomas Ortner & Christian Breiteneder & Maia Rohm, 2019. "Robust and sparse k-means clustering for high-dimensional data," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 13(4), pages 905-932, December.
    19. Douglas Steinley, 2007. "Validating Clusters with the Lower Bound for Sum-of-Squares Error," Psychometrika, Springer;The Psychometric Society, vol. 72(1), pages 93-106, March.
    20. Crook Oliver M. & Gatto Laurent & Kirk Paul D. W., 2019. "Fast approximate inference for variable selection in Dirichlet process mixtures, with an application to pan-cancer proteomics," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 18(6), pages 1-20, December.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:jmvana:v:124:y:2014:i:c:p:42-56. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/wps/find/journaldescription.cws_home/622892/description#description .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.