IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0210236.html
   My bibliography  Save this article

Clustering algorithms: A comparative approach

Author

Listed:
  • Mayra Z Rodriguez
  • Cesar H Comin
  • Dalcimar Casanova
  • Odemir M Bruno
  • Diego R Amancio
  • Luciano da F Costa
  • Francisco A Rodrigues

Abstract

Many real-world systems can be studied in terms of pattern recognition tasks, so that proper use (and understanding) of machine learning methods in practical applications becomes essential. While many classification methods have been proposed, there is no consensus on which methods are more suitable for a given dataset. As a consequence, it is important to comprehensively compare methods in many possible scenarios. In this context, we performed a systematic comparison of 9 well-known clustering methods available in the R language assuming normally distributed data. In order to account for the many possible variations of data, we considered artificial datasets with several tunable properties (number of classes, separation between classes, etc). In addition, we also evaluated the sensitivity of the clustering methods with regard to their parameters configuration. The results revealed that, when considering the default configurations of the adopted methods, the spectral approach tended to present particularly good performance. We also found that the default configuration of the adopted implementations was not always accurate. In these cases, a simple approach based on random selection of parameters values proved to be a good alternative to improve the performance. All in all, the reported approach provides subsidies guiding the choice of clustering algorithms.

Suggested Citation

  • Mayra Z Rodriguez & Cesar H Comin & Dalcimar Casanova & Odemir M Bruno & Diego R Amancio & Luciano da F Costa & Francisco A Rodrigues, 2019. "Clustering algorithms: A comparative approach," PLOS ONE, Public Library of Science, vol. 14(1), pages 1-34, January.
  • Handle: RePEc:plo:pone00:0210236
    DOI: 10.1371/journal.pone.0210236
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0210236
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0210236&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0210236?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Mangiameli, Paul & Chen, Shaw K. & West, David, 1996. "A comparison of SOM neural network and hierarchical clustering methods," European Journal of Operational Research, Elsevier, vol. 93(2), pages 402-417, September.
    2. Yordan P Raykov & Alexis Boukouvalas & Fahd Baig & Max A Little, 2016. "What to Do When K-Means Clustering Fails: A Simple yet Principled Alternative Algorithm," PLOS ONE, Public Library of Science, vol. 11(9), pages 1-28, September.
    3. Leila M Naeni & Hugh Craig & Regina Berretta & Pablo Moscato, 2016. "A Novel Clustering Methodology Based on Modularity Optimisation for Detecting Authorship Affinities in Shakespearean Era Plays," PLOS ONE, Public Library of Science, vol. 11(8), pages 1-27, August.
    4. de Arruda, Guilherme F. & Costa, Luciano da Fontoura & Rodrigues, Francisco A., 2012. "A complex networks approach for data clustering," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 391(23), pages 6174-6183.
    5. Chris Fraley & Adrian E. Raftery, 2003. "Enhanced Model-Based Clustering, Density Estimation, and Discriminant Analysis Software: MCLUST," Journal of Classification, Springer;The Classification Society, vol. 20(2), pages 263-286, September.
    6. Diego Raphael Amancio & Cesar Henrique Comin & Dalcimar Casanova & Gonzalo Travieso & Odemir Martinez Bruno & Francisco Aparecido Rodrigues & Luciano da Fontoura Costa, 2014. "A Systematic Comparison of Supervised Classifiers," PLOS ONE, Public Library of Science, vol. 9(4), pages 1-14, April.
    7. Amancio, Diego Raphael & Oliveira, Osvaldo Novais & da Fontoura Costa, Luciano, 2012. "Three-feature model to reproduce the topology of citation networks and the effects from authors’ visibility on their h-index," Journal of Informetrics, Elsevier, vol. 6(3), pages 427-434.
    8. Chris Fraley & Adrian E. Raftery, 1999. "MCLUST: Software for Model-Based Cluster Analysis," Journal of Classification, Springer;The Classification Society, vol. 16(2), pages 297-306, July.
    9. Johan Bollen & Herbert Van de Sompel & Aric Hagberg & Ryan Chute, 2009. "A Principal Component Analysis of 39 Scientific Impact Measures," PLOS ONE, Public Library of Science, vol. 4(6), pages 1-11, June.
    10. Colavizza, Giovanni & Franceschet, Massimo, 2016. "Clustering citation histories in the Physical Review," Journal of Informetrics, Elsevier, vol. 10(4), pages 1037-1051.
    11. Bouveyron, C. & Girard, S. & Schmid, C., 2007. "High-dimensional data clustering," Computational Statistics & Data Analysis, Elsevier, vol. 52(1), pages 502-519, September.
    12. Lawrence Hubert & Phipps Arabie, 1985. "Comparing partitions," Journal of Classification, Springer;The Classification Society, vol. 2(1), pages 193-218, December.
    13. Bergé, Laurent & Bouveyron, Charles & Girard, Stéphane, 2012. "HDclassif: An R Package for Model-Based Clustering and Discriminant Analysis of High-Dimensional Data," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 46(i06).
    14. Mingoti, Sueli A. & Lima, Joab O., 2006. "Comparing SOM neural network with Fuzzy c-means, K-means and traditional hierarchical clustering algorithms," European Journal of Operational Research, Elsevier, vol. 174(3), pages 1742-1759, November.
    15. Hirschberger, Markus & Qi, Yue & Steuer, Ralph E., 2007. "Randomly generating portfolio-selection covariance matrices with specified distributional characteristics," European Journal of Operational Research, Elsevier, vol. 177(3), pages 1610-1625, March.
    16. Carlos Garcia, 2016. "BoCluSt: Bootstrap Clustering Stability Algorithm for Community Detection," PLOS ONE, Public Library of Science, vol. 11(6), pages 1-15, June.
    17. Viana, Matheus P. & Amancio, Diego R. & da F. Costa, Luciano, 2013. "On time-varying collaboration networks," Journal of Informetrics, Elsevier, vol. 7(2), pages 371-378.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Hossam M J Mustafa & Masri Ayob & Mohd Zakree Ahmad Nazri & Graham Kendall, 2019. "An improved adaptive memetic differential evolution optimization algorithms for data clustering problems," PLOS ONE, Public Library of Science, vol. 14(5), pages 1-28, May.
    2. Christian Hennig, 2022. "An empirical comparison and characterisation of nine popular clustering methods," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 16(1), pages 201-229, March.
    3. Ebba Mark & Ryan Rafaty & Moritz Schwarz, 2022. "Spatial-temporal dynamics of employment shocks in declining coal mining regions and potentialities of the 'just transition'," Papers 2211.12619, arXiv.org.
    4. Narjes Vara & Mahdieh Mirzabeigi & Hajar Sotudeh & Seyed Mostafa Fakhrahmad, 2022. "Application of k-means clustering algorithm to improve effectiveness of the results recommended by journal recommender system," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(6), pages 3237-3252, June.
    5. Mikhail Kanevski, 2021. "Unsupervised learning of Swiss population spatial distribution," PLOS ONE, Public Library of Science, vol. 16(2), pages 1-24, February.
    6. Fernandez Martinez, Roberto & Lostado Lorza, Ruben & Santos Delgado, Ana Alexandra & Piedra, Nelson, 2021. "Use of classification trees and rule-based models to optimize the funding assignment to research projects: A case study of UTPL," Journal of Informetrics, Elsevier, vol. 15(1).
    7. Corrêa, Edilson A. & Marinho, Vanessa Q. & Amancio, Diego R., 2020. "Semantic flow in language networks discriminates texts by genre and publication date," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 557(C).
    8. Simon Crase & Suresh N Thennadil, 2022. "An analysis framework for clustering algorithm selection with applications to spectroscopy," PLOS ONE, Public Library of Science, vol. 17(3), pages 1-24, March.
    9. Alfred Kume & Stephen G Walker, 2021. "The utility of clusters and a Hungarian clustering algorithm," PLOS ONE, Public Library of Science, vol. 16(8), pages 1-23, August.
    10. Hossam M J Mustafa & Masri Ayob & Dheeb Albashish & Sawsan Abu-Taleb, 2020. "Solving text clustering problem using a memetic differential evolution algorithm," PLOS ONE, Public Library of Science, vol. 15(6), pages 1-18, June.
    11. Trotta, Gianluca, 2020. "An empirical analysis of domestic electricity load profiles: Who consumes how much and when?," Applied Energy, Elsevier, vol. 275(C).
    12. K. S. Sablin & E. S. Kagan & E. S. Chernova, 2020. "Clustering of the Russian coal mining regions: Investment and innovation activity," Journal of New Economy, Ural State University of Economics, vol. 21(1), pages 89-106, March.
    13. Chong, Woon Kian & Chang, Chiachi, 2024. "Information exploitation of human resource data with persistent homology," Journal of Business Research, Elsevier, vol. 172(C).
    14. Quispe, Laura V.C. & Tohalino, Jorge A.V. & Amancio, Diego R., 2021. "Using virtual edges to improve the discriminability of co-occurrence text networks," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 562(C).
    15. Sultan Mahmud & Ferdausi Mahojabin Sumana & Md Mohsin & Md. Hasinur Rahaman Khan, 2022. "Redefining homogeneous climate regions in Bangladesh using multivariate clustering approaches," Natural Hazards: Journal of the International Society for the Prevention and Mitigation of Natural Hazards, Springer;International Society for the Prevention and Mitigation of Natural Hazards, vol. 111(2), pages 1863-1884, March.
    16. Tohalino, Jorge A.V. & Amancio, Diego R., 2022. "On predicting research grants productivity via machine learning," Journal of Informetrics, Elsevier, vol. 16(2).
    17. Ioannis Mikrou & Nickolas S. Sapidis, 2024. "Enhancing operational research in mechatronic systems via modularization: comparative analysis of four clustering algorithms using validation indices," Operational Research, Springer, vol. 24(4), pages 1-44, December.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Carlo Cavicchia & Maurizio Vichi & Giorgia Zaccaria, 2022. "Gaussian mixture model with an extended ultrametric covariance structure," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 16(2), pages 399-427, June.
    2. Andrews, Jeffrey L. & McNicholas, Paul D. & Subedi, Sanjeena, 2011. "Model-based classification via mixtures of multivariate t-distributions," Computational Statistics & Data Analysis, Elsevier, vol. 55(1), pages 520-529, January.
    3. Kim, Nam-Hwui & Browne, Ryan P., 2021. "In the pursuit of sparseness: A new rank-preserving penalty for a finite mixture of factor analyzers," Computational Statistics & Data Analysis, Elsevier, vol. 160(C).
    4. Alex Sharp & Glen Chalatov & Ryan P. Browne, 2023. "A dual subspace parsimonious mixture of matrix normal distributions," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 17(3), pages 801-822, September.
    5. Diego R. Amancio & Osvaldo N. Oliveira jr & Luciano F. Costa, 2015. "Topological-collaborative approach for disambiguating authors’ names in collaborative networks," Scientometrics, Springer;Akadémiai Kiadó, vol. 102(1), pages 465-485, January.
    6. Bouveyron, Charles & Brunet-Saumard, Camille, 2014. "Model-based clustering of high-dimensional data: A review," Computational Statistics & Data Analysis, Elsevier, vol. 71(C), pages 52-78.
    7. Galimberti, Giuliano & Montanari, Angela & Viroli, Cinzia, 2009. "Penalized factor mixture analysis for variable selection in clustered data," Computational Statistics & Data Analysis, Elsevier, vol. 53(12), pages 4301-4310, October.
    8. Cathy Maugis & Gilles Celeux & Marie-Laure Martin-Magniette, 2009. "Variable Selection for Clustering with Gaussian Mixture Models," Biometrics, The International Biometric Society, vol. 65(3), pages 701-709, September.
    9. Jeffrey Andrews & Paul McNicholas, 2014. "Variable Selection for Clustering and Classification," Journal of Classification, Springer;The Classification Society, vol. 31(2), pages 136-153, July.
    10. Alessandro Casa & Andrea Cappozzo & Michael Fop, 2022. "Group-Wise Shrinkage Estimation in Penalized Model-Based Clustering," Journal of Classification, Springer;The Classification Society, vol. 39(3), pages 648-674, November.
    11. repec:jss:jstsof:18:i06 is not listed on IDEAS
    12. Hennig, Christian, 2008. "Dissolution point and isolation robustness: Robustness criteria for general cluster analysis methods," Journal of Multivariate Analysis, Elsevier, vol. 99(6), pages 1154-1176, July.
    13. Corrêa Jr., Edilson A. & Silva, Filipi N. & da F. Costa, Luciano & Amancio, Diego R., 2017. "Patterns of authors contribution in scientific manuscripts," Journal of Informetrics, Elsevier, vol. 11(2), pages 498-510.
    14. Andreas Wunsch & Tanja Liesch & Stefan Broda, 2022. "Feature-based Groundwater Hydrograph Clustering Using Unsupervised Self-Organizing Map-Ensembles," Water Resources Management: An International Journal, Published for the European Water Resources Association (EWRA), Springer;European Water Resources Association (EWRA), vol. 36(1), pages 39-54, January.
    15. McNicholas, P.D. & Murphy, T.B. & McDaid, A.F. & Frost, D., 2010. "Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models," Computational Statistics & Data Analysis, Elsevier, vol. 54(3), pages 711-723, March.
    16. Sanjeena Subedi & Paul McNicholas, 2014. "Variational Bayes approximations for clustering via mixtures of normal inverse Gaussian distributions," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 8(2), pages 167-193, June.
    17. O’Hagan, Adrian & Murphy, Thomas Brendan & Gormley, Isobel Claire & McNicholas, Paul D. & Karlis, Dimitris, 2016. "Clustering with the multivariate normal inverse Gaussian distribution," Computational Statistics & Data Analysis, Elsevier, vol. 93(C), pages 18-30.
    18. Adilson Vital & Diego R. Amancio, 2022. "A comparative analysis of local similarity metrics and machine learning approaches: application to link prediction in author citation networks," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(10), pages 6011-6028, October.
    19. Pérez-Campuzano, Darío & Rubio Andrada, Luis & Morcillo Ortega, Patricio & López-Lázaro, Antonio, 2022. "Visualizing the historical COVID-19 shock in the US airline industry: A Data Mining approach for dynamic market surveillance," Journal of Air Transport Management, Elsevier, vol. 101(C).
    20. Jorge A. V. Tohalino & Laura V. C. Quispe & Diego R. Amancio, 2021. "Analyzing the relationship between text features and grants productivity," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(5), pages 4255-4275, May.
    21. Hien D. Nguyen & Geoffrey J. McLachlan & Jeremy F. P. Ullmann & Andrew L. Janke, 2016. "Spatial clustering of time series via mixture of autoregressions models and Markov random fields," Statistica Neerlandica, Netherlands Society for Statistics and Operations Research, vol. 70(4), pages 414-439, November.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0210236. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.