IDEAS home Printed from https://ideas.repec.org/a/spr/jclass/v32y2015i2p268-284.html
   My bibliography  Save this article

On the Added Value of Bootstrap Analysis for K-Means Clustering

Author

Listed:
  • Joeri Hofmans
  • Eva Ceulemans
  • Douglas Steinley
  • Iven Mechelen

Abstract

Because of its deterministic nature, K-means does not yield confidence information about centroids and estimated cluster memberships, although this could be useful for inferential purposes. In this paper we propose to arrive at such information by means of a non-parametric bootstrap procedure, the performance of which is tested in an extensive simulation study. Results show that the coverage of hyper-ellipsoid bootstrap confidence regions for the centroids is in general close to the nominal coverage probability. For the cluster memberships, we found that probabilistic membership information derived from the bootstrap analysis can be used to improve the cluster assignment of individual objects, albeit only in the case of a very large number of clusters. However, in the case of smaller numbers of clusters, the probabilistic membership information still appeared to be useful as it indicates for which objects the cluster assignment resulting from the analysis of the original data is likely to be correct; hence, this information can be used to construct a partial clustering in which the latter objects only are assigned to clusters. Copyright Classification Society of North America 2015

Suggested Citation

  • Joeri Hofmans & Eva Ceulemans & Douglas Steinley & Iven Mechelen, 2015. "On the Added Value of Bootstrap Analysis for K-Means Clustering," Journal of Classification, Springer;The Classification Society, vol. 32(2), pages 268-284, July.
  • Handle: RePEc:spr:jclass:v:32:y:2015:i:2:p:268-284
    DOI: 10.1007/s00357-015-9178-y
    as

    Download full text from publisher

    File URL: http://hdl.handle.net/10.1007/s00357-015-9178-y
    Download Restriction: Access to full text is restricted to subscribers.

    File URL: https://libkey.io/10.1007/s00357-015-9178-y?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Hand, David J. & Krzanowski, Wojtek J., 2005. "Optimising k-means clustering results with standard software packages," Computational Statistics & Data Analysis, Elsevier, vol. 49(4), pages 969-973, June.
    2. Hennig, Christian, 2007. "Cluster-wise assessment of cluster stability," Computational Statistics & Data Analysis, Elsevier, vol. 52(1), pages 258-271, September.
    3. Douglas Steinley & Michael J. Brusco, 2007. "Initializing K-means Batch Clustering: A Critical Evaluation of Several Techniques," Journal of Classification, Springer;The Classification Society, vol. 24(1), pages 99-121, June.
    4. Lawrence Hubert & Phipps Arabie, 1985. "Comparing partitions," Journal of Classification, Springer;The Classification Society, vol. 2(1), pages 193-218, December.
    5. Glenn Milligan, 1985. "An algorithm for generating artificial test clusters," Psychometrika, Springer;The Psychometric Society, vol. 50(1), pages 123-127, March.
    6. Ranjan Maitra & Volodymyr Melnykov & Soumendra N. Lahiri, 2012. "Bootstrapping for Significance of Compact Clusters in Multidimensional Datasets," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 107(497), pages 378-392, March.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Daniel Aloise & Nielsen Castelo Damasceno & Nenad Mladenović & Daniel Nobre Pinheiro, 2017. "On Strategies to Fix Degenerate k-means Solutions," Journal of Classification, Springer;The Classification Society, vol. 34(2), pages 165-190, July.
    2. Lauri Varmann & Helena Mouriño, 2024. "Clustering Empirical Bootstrap Distribution Functions Parametrized by Galton–Watson Branching Processes," Mathematics, MDPI, vol. 12(15), pages 1-25, August.
    3. Aurora Torrente & Juan Romo, 2021. "Initializing k-means Clustering by Bootstrap and Data Depth," Journal of Classification, Springer;The Classification Society, vol. 38(2), pages 232-256, July.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Michael Brusco & Douglas Steinley, 2007. "A Comparison of Heuristic Procedures for Minimum Within-Cluster Sums of Squares Partitioning," Psychometrika, Springer;The Psychometric Society, vol. 72(4), pages 583-600, December.
    2. Aurora Torrente & Juan Romo, 2021. "Initializing k-means Clustering by Bootstrap and Data Depth," Journal of Classification, Springer;The Classification Society, vol. 38(2), pages 232-256, July.
    3. Jerzy Korzeniewski, 2013. "Empirical Evaluation of OCLUS and GenRandomClust Algorithms of Generating Cluster Structures," Statistics in Transition new series, Główny Urząd Statystyczny (Polska), vol. 14(3), pages 487-494, September.
    4. Ana Alina Tudoran, 2022. "A machine learning approach to identifying decision-making styles for managing customer relationships," Electronic Markets, Springer;IIM University of St. Gallen, vol. 32(1), pages 351-374, March.
    5. Wu, Han-Ming, 2011. "On biological validity indices for soft clustering algorithms for gene expression data," Computational Statistics & Data Analysis, Elsevier, vol. 55(5), pages 1969-1979, May.
    6. J. Fernando Vera & Rodrigo Macías, 2021. "On the Behaviour of K-Means Clustering of a Dissimilarity Matrix by Means of Full Multidimensional Scaling," Psychometrika, Springer;The Psychometric Society, vol. 86(2), pages 489-513, June.
    7. Pieter Schoonees & Michel Velden & Patrick Groenen, 2015. "Constrained Dual Scaling for Detecting Response Styles in Categorical Data," Psychometrika, Springer;The Psychometric Society, vol. 80(4), pages 968-994, December.
    8. Michael Brusco & Douglas Steinley, 2015. "Affinity Propagation and Uncapacitated Facility Location Problems," Journal of Classification, Springer;The Classification Society, vol. 32(3), pages 443-480, October.
    9. Tom Wilderjans & Dirk Depril & Iven Van Mechelen, 2013. "Additive Biclustering: A Comparison of One New and Two Existing ALS Algorithms," Journal of Classification, Springer;The Classification Society, vol. 30(1), pages 56-74, April.
    10. Jerzy Korzeniewski, 2016. "New Method Of Variable Selection For Binary Data Cluster Analysis," Statistics in Transition new series, Główny Urząd Statystyczny (Polska), vol. 17(2), pages 295-304, June.
    11. Pfenninger, Stefan, 2017. "Dealing with multiple decades of hourly wind and PV time series in energy models: A comparison of methods to reduce time resolution and the planning implications of inter-annual variability," Applied Energy, Elsevier, vol. 197(C), pages 1-13.
    12. Jaehong Yu & Hua Zhong & Seoung Bum Kim, 2020. "An Ensemble Feature Ranking Algorithm for Clustering Analysis," Journal of Classification, Springer;The Classification Society, vol. 37(2), pages 462-489, July.
    13. Ekaterina Kovaleva & Boris Mirkin, 2015. "Bisecting K-Means and 1D Projection Divisive Clustering: A Unified Framework and Experimental Comparison," Journal of Classification, Springer;The Classification Society, vol. 32(3), pages 414-442, October.
    14. Douglas Steinley, 2007. "Validating Clusters with the Lower Bound for Sum-of-Squares Error," Psychometrika, Springer;The Psychometric Society, vol. 72(1), pages 93-106, March.
    15. Tsai, Chieh-Yuan & Chiu, Chuang-Cheng, 2008. "Developing a feature weight self-adjustment mechanism for a K-means clustering algorithm," Computational Statistics & Data Analysis, Elsevier, vol. 52(10), pages 4658-4672, June.
    16. Schepers, Jan & van Mechelen, Iven & Ceulemans, Eva, 2006. "Three-mode partitioning," Computational Statistics & Data Analysis, Elsevier, vol. 51(3), pages 1623-1642, December.
    17. Javier Albert-Smet & Aurora Torrente & Juan Romo, 2023. "Band depth based initialization of K-means for functional data clustering," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 17(2), pages 463-484, June.
    18. Jerzy Korzeniewski, 2016. "New Method Of Variable Selection For Binary Data Cluster Analysis," Statistics in Transition New Series, Polish Statistical Association, vol. 17(2), pages 295-304, June.
    19. Ricardo Fraiman & Badih Ghattas & Marcela Svarc, 2013. "Interpretable clustering using unsupervised binary trees," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 7(2), pages 125-145, June.
    20. Alessandro Albano & José Luis García-Lapresta & Antonella Plaia & Mariangela Sciandra, 2023. "A family of distances for preference–approvals," Annals of Operations Research, Springer, vol. 323(1), pages 1-29, April.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:jclass:v:32:y:2015:i:2:p:268-284. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.