IDEAS home Printed from https://ideas.repec.org/a/eee/csdana/v54y2010i1p120-134.html
   My bibliography  Save this article

On multivariate binary data clustering and feature weighting

Author

Listed:
  • Bouguila, Nizar

Abstract

This paper presents an approach that partitions data sets of unlabeled binary vectors without a priori information about the number of clusters or the saliency of the features. The unsupervised binary feature selection problem is approached using finite mixture models of multivariate Bernoulli distributions. Using stochastic complexity, the proposed model determines simultaneously the number of clusters in a given data set composed of binary vectors and the saliency of the features used. We conduct different applications involving real data, document classification and images categorization to show the merits of the proposed approach.

Suggested Citation

  • Bouguila, Nizar, 2010. "On multivariate binary data clustering and feature weighting," Computational Statistics & Data Analysis, Elsevier, vol. 54(1), pages 120-134, January.
  • Handle: RePEc:eee:csdana:v:54:y:2010:i:1:p:120-134
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0167-9473(09)00261-8
    Download Restriction: Full text for ScienceDirect subscribers only.
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Bengt Muthén, 1978. "Contributions to factor analysis of dichotomous variables," Psychometrika, Springer;The Psychometric Society, vol. 43(4), pages 551-560, December.
    2. D. R. Cox, 1972. "The Analysis of Multivariate Binary Data," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 21(2), pages 113-120, June.
    3. Gyllenberg, Mats & Koski, Timo & Verlaan, Martin, 1997. "Classification of Binary Vectors by Stochastic Complexity," Journal of Multivariate Analysis, Elsevier, vol. 63(1), pages 47-72, October.
    4. Mats Gyllenberg & Timo Koski, 1996. "Numerical taxonomy and the principle of maximum entropy," Journal of Classification, Springer;The Classification Society, vol. 13(2), pages 213-229, September.
    5. J. D. Wilbur & J. K. Ghosh & C. H. Nakatsu & S. M. Brouder & R. W. Doerge, 2002. "Variable Selection in High-Dimensional Multivariate Binary Data with Application to the Analysis of Microbial Community DNA Fingerprints," Biometrics, The International Biometric Society, vol. 58(2), pages 378-386, June.
    6. Govaert, G. & Nadif, M., 1996. "Comparison of the mixture and the classification maximum likelihood in cluster analysis with binary data," Computational Statistics & Data Analysis, Elsevier, vol. 23(1), pages 65-81, November.
    7. Gilles Celeux & Gérard Govaert, 1991. "Clustering criteria for discrete data and latent class models," Journal of Classification, Springer;The Classification Society, vol. 8(2), pages 157-176, December.
    8. Anders Christoffersson, 1975. "Factor analysis of dichotomized variables," Psychometrika, Springer;The Psychometric Society, vol. 40(1), pages 5-32, March.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Guillaume Gautreau & Adelme Bazin & Mathieu Gachet & Rémi Planel & Laura Burlot & Mathieu Dubois & Amandine Perrin & Claudine Médigue & Alexandra Calteau & Stéphane Cruveiller & Catherine Matias & Chr, 2020. "PPanGGOLiN: Depicting microbial diversity via a partitioned pangenome graph," PLOS Computational Biology, Public Library of Science, vol. 16(3), pages 1-27, March.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Yang Yixin & Lü Xin & Ma Jian & Qiao Han, 2014. "A Robust Factor Analysis Model for Dichotomous Data," Journal of Systems Science and Information, De Gruyter, vol. 2(5), pages 437-450, October.
    2. Beth Reboussin & Kung-Yee Liang, 1998. "An estimating equations approach for the LISCOMP model," Psychometrika, Springer;The Psychometric Society, vol. 63(2), pages 165-182, June.
    3. Wayne DeSarbo & Jaewun Cho, 1989. "A stochastic multidimensional scaling vector threshold model for the spatial representation of “pick any/n” data," Psychometrika, Springer;The Psychometric Society, vol. 54(1), pages 105-129, March.
    4. Edward Haertel, 1990. "Continuous and discrete latent structure models for item response data," Psychometrika, Springer;The Psychometric Society, vol. 55(3), pages 477-494, September.
    5. Albert Maydeu-Olivares & Harry Joe, 2006. "Limited Information Goodness-of-fit Testing in Multidimensional Contingency Tables," Psychometrika, Springer;The Psychometric Society, vol. 71(4), pages 713-732, December.
    6. Mark Reiser, 1996. "Analysis of residuals for the multionmial item response model," Psychometrika, Springer;The Psychometric Society, vol. 61(3), pages 509-528, September.
    7. Kamel Jedidi & Wayne DeSarbo, 1991. "A stochastic multidimensional scaling procedure for the spatial representation of three-mode, three-way pick any/J data," Psychometrika, Springer;The Psychometric Society, vol. 56(3), pages 471-494, September.
    8. Park, Junyong, 2009. "Independent rule in classification of multivariate binary data," Journal of Multivariate Analysis, Elsevier, vol. 100(10), pages 2270-2286, November.
    9. Gyllenberg, Mats & Koski, Timo & Verlaan, Martin, 1997. "Classification of Binary Vectors by Stochastic Complexity," Journal of Multivariate Analysis, Elsevier, vol. 63(1), pages 47-72, October.
    10. Govaert, Gérard & Nadif, Mohamed, 2008. "Block clustering with Bernoulli mixture models: Comparison of different approaches," Computational Statistics & Data Analysis, Elsevier, vol. 52(6), pages 3233-3245, February.
    11. Maydeu-Olivares, Albert, 2002. "Limited information estimation and testing of Thurstonian models for preference data," Mathematical Social Sciences, Elsevier, vol. 43(3), pages 467-483, July.
    12. Kromidha, Endrit & Li, Matthew C., 2019. "Determinants of leadership in online social trading: A signaling theory perspective," Journal of Business Research, Elsevier, vol. 97(C), pages 184-197.
    13. Kim, Chul & Jun, Duk Bin & Park, Sungho, 2018. "Capturing flexible correlations in multiple-discrete choice outcomes using copulas," International Journal of Research in Marketing, Elsevier, vol. 35(1), pages 34-59.
    14. Richards, Timothy J. & Hamilton, Stephen F. & Yonezawa, Koichi, 2018. "Retail Market Power in a Shopping Basket Model of Supermarket Competition," Journal of Retailing, Elsevier, vol. 94(3), pages 328-342.
    15. Timothy Tyler Brown & Juan Pablo Atal, 2019. "How robust are reference pricing studies on outpatient medical procedures? Three different preprocessing techniques applied to difference‐in differences," Health Economics, John Wiley & Sons, Ltd., vol. 28(2), pages 280-298, February.
    16. L. Sun & M. K. Clayton, 2008. "Bayesian Analysis of Crossclassified Spatial Data with Autocorrelation," Biometrics, The International Biometric Society, vol. 64(1), pages 74-84, March.
    17. Heimeriks, K. & Duysters, G.M. & Vanhaverbeke, W.P.M., 2004. "The evolution of alliance capabilities," Working Papers 04.20, Eindhoven Center for Innovation Studies.
    18. Francesco Bartolucci & Claudia Pigini, 2018. "Partial effects estimation for fixed-effects logit panel data models," Working Papers 431, Universita' Politecnica delle Marche (I), Dipartimento di Scienze Economiche e Sociali.
    19. Monica Billio & Roberto Casarin & Matteo Iacopini, 2024. "Bayesian Markov-Switching Tensor Regression for Time-Varying Networks," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 119(545), pages 109-121, January.
    20. Brajendra C. Sutradhar, 2022. "Fixed versus Mixed Effects Based Marginal Models for Clustered Correlated Binary Data: an Overview on Advances and Challenges," Sankhya B: The Indian Journal of Statistics, Springer;Indian Statistical Institute, vol. 84(1), pages 259-302, May.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:csdana:v:54:y:2010:i:1:p:120-134. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/csda .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.