IDEAS home Printed from https://ideas.repec.org/a/spr/advdac/v15y2021i2d10.1007_s11634-020-00425-4.html
   My bibliography  Save this article

Clustering of modal-valued symbolic data

Author

Listed:
  • Nataša Kejžar

    (University of Ljubljana)

  • Simona Korenjak-Černe

    (University of Ljubljana)

  • Vladimir Batagelj

    (Institute of Mathematics, Physics and Mechanics
    University of Primorska
    National Research University Higher School of Economics)

Abstract

Symbolic data analysis is based on special descriptions of data known as symbolic objects (SOs). Such descriptions preserve more detailed information about units and their clusters than the usual representations with mean values. A special type of SO is a representation with frequency or probability distributions (modal values). This representation enables us to simultaneously consider variables of all measurement types during the clustering process. In this paper, we present the theoretical basis for compatible leaders and agglomerative clustering methods with alternative dissimilarities for modal-valued SOs. The leaders method efficiently solves clustering problems with large numbers of units, while the agglomerative method can be applied either alone to a small data set, or to leaders, obtained from the compatible leaders clustering method. We focus on (a) the inclusion of weights that enables clustering representatives to retain the same structure as if clustering only first order units and (b) the selection of relative dissimilarities that produce more interpretable, i.e., meaningful optimal clustering representatives. The usefulness of the proposed methods with adaptations was assessed and substantiated by carefully constructed simulation settings and demonstrated on three different real-world data sets gaining in interpretability from the use of weights (population pyramids and ESS data) or relative dissimilarity (US patents data).

Suggested Citation

  • Nataša Kejžar & Simona Korenjak-Černe & Vladimir Batagelj, 2021. "Clustering of modal-valued symbolic data," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 15(2), pages 513-541, June.
  • Handle: RePEc:spr:advdac:v:15:y:2021:i:2:d:10.1007_s11634-020-00425-4
    DOI: 10.1007/s11634-020-00425-4
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11634-020-00425-4
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11634-020-00425-4?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Billard L. & Diday E., 2003. "From the Statistics of Data to the Statistics of Knowledge: Symbolic Data Analysis," Journal of the American Statistical Association, American Statistical Association, vol. 98, pages 470-487, January.
    2. Bronwyn H. Hall & Adam B. Jaffe & Manuel Trajtenberg, 2001. "The NBER Patent Citation Data File: Lessons, Insights and Methodological Tools," NBER Working Papers 8498, National Bureau of Economic Research, Inc.
    3. Francisco Carvalho & Paula Brito & Hans-Hermann Bock, 2006. "Dynamic clustering for interval data based on L 2 distance," Computational Statistics, Springer, vol. 21(2), pages 231-250, June.
    4. Kim, Jaejik & Billard, L., 2012. "Dissimilarity measures and divisive clustering for symbolic multimodal-valued data," Computational Statistics & Data Analysis, Elsevier, vol. 56(9), pages 2795-2808.
    5. Kim, Jaejik & Billard, L., 2011. "A polythetic clustering process and cluster validity indexes for histogram-valued objects," Computational Statistics & Data Analysis, Elsevier, vol. 55(7), pages 2250-2262, July.
    6. Nataša Kejžar & Simona Korenjak-Černe & Vladimir Batagelj, 2011. "Clustering of Distributions: A Case of Patent Citations," Journal of Classification, Springer;The Classification Society, vol. 28(2), pages 156-183, July.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Fei Liu & L. Billard, 2022. "Partition of Interval-Valued Observations Using Regression," Journal of Classification, Springer;The Classification Society, vol. 39(1), pages 55-77, March.
    2. Maia, André Luis Santiago & de Carvalho, Francisco de A.T., 2011. "Holt’s exponential smoothing and neural network models for forecasting interval-valued time series," International Journal of Forecasting, Elsevier, vol. 27(3), pages 740-759.
    3. Guo, Junpeng & Li, Wenhua & Li, Chenhua & Gao, Sa, 2012. "Standardization of interval symbolic data based on the empirical descriptive statistics," Computational Statistics & Data Analysis, Elsevier, vol. 56(3), pages 602-610.
    4. Soroosh Shalileh, 2023. "An Effective Partitional Crisp Clustering Method Using Gradient Descent Approach," Mathematics, MDPI, vol. 11(12), pages 1-23, June.
    5. M. Rosário Oliveira & Margarida Azeitona & António Pacheco & Rui Valadas, 2022. "Association measures for interval variables," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 16(3), pages 491-520, September.
    6. A. Pedro Duarte Silva & Peter Filzmoser & Paula Brito, 2018. "Outlier detection in interval data," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 12(3), pages 785-822, September.
    7. Maia, André Luis Santiago & de Carvalho, Francisco de A.T., 2011. "Holt's exponential smoothing and neural network models for forecasting interval-valued time series," International Journal of Forecasting, Elsevier, vol. 27(3), pages 740-759, July.
    8. Ana Belén Ramos-Guajardo, 2022. "A hierarchical clustering method for random intervals based on a similarity measure," Computational Statistics, Springer, vol. 37(1), pages 229-261, March.
    9. Manuel Ammann & Philipp Horsch & David Oesch, 2016. "Competing with Superstars," Management Science, INFORMS, vol. 62(10), pages 2842-2858, October.
    10. Guan-Can Yang & Gang Li & Chun-Ya Li & Yun-Hua Zhao & Jing Zhang & Tong Liu & Dar-Zen Chen & Mu-Hsuan Huang, 2015. "Using the comprehensive patent citation network (CPC) to evaluate patent value," Scientometrics, Springer;Akadémiai Kiadó, vol. 105(3), pages 1319-1346, December.
    11. Pauly, Stefan & Stipanicic, Fernando, 2021. "The creation and diffusion of knowledge: Evidence from the Jet Age," CEPREMAP Working Papers (Docweb) 2112, CEPREMAP.
    12. Suma Athreye & Martha Prevezer, 2008. "R&D offshoring and the domestic science base in India and China," Working Papers 26, Queen Mary, University of London, School of Business and Management, Centre for Globalisation Research.
    13. Florent Silve & Alexander Plekhanov, 2018. "Institutions, innovation and growth : Evidence from industry data," The Economics of Transition, The European Bank for Reconstruction and Development, vol. 26(3), pages 335-362, July.
    14. Ufuk Akcigit & Douglas Hanley & Stefanie Stantcheva, 2022. "Optimal Taxation and R&D Policies," Econometrica, Econometric Society, vol. 90(2), pages 645-684, March.
    15. Banal-Estañol, Albert & Duso, Tomaso & Seldeslachts, Jo & Szücs, Florian, 2022. "R&D Spillovers through RJV Cooperation," EconStor Open Access Articles and Book Chapters, ZBW - Leibniz Information Centre for Economics, vol. 51(4), pages 1-10.
    16. Curci, Ylenia & Mongeau Ospina, Christian A., 2016. "Investigating biofuels through network analysis," Energy Policy, Elsevier, vol. 97(C), pages 60-72.
    17. Alex Bell & Raj Chetty & Xavier Jaravel & Neviana Petkova & John Van Reenen, 2019. "Who Becomes an Inventor in America? The Importance of Exposure to Innovation," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 134(2), pages 647-713.
    18. Michael J. Andrews, 2020. "Local Effects of Land Grant Colleges on Agricultural Innovation and Output," NBER Chapters, in: Economics of Research and Innovation in Agriculture, pages 139-175, National Bureau of Economic Research, Inc.
    19. Jeon, Sung-Hee & Pohl, R. Vincent, 2019. "Medical innovation, education, and labor market outcomes of cancer patients," Journal of Health Economics, Elsevier, vol. 68(C).
    20. William R Kerr, 2018. "Heterogeneous Technology Diffusion and Ricardian Trade Patterns," The World Bank Economic Review, World Bank, vol. 32(1), pages 163-182.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:advdac:v:15:y:2021:i:2:d:10.1007_s11634-020-00425-4. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.