IDEAS home Printed from https://ideas.repec.org/a/eee/jmvana/v101y2010i7p1728-1737.html
   My bibliography  Save this article

Sparse Bayesian hierarchical modeling of high-dimensional clustering problems

Author

Listed:
  • Lian, Heng

Abstract

Clustering is one of the most widely used procedures in the analysis of microarray data, for example with the goal of discovering cancer subtypes based on observed heterogeneity of genetic marks between different tissues. It is well known that in such high-dimensional settings, the existence of many noise variables can overwhelm the few signals embedded in the high-dimensional space. We propose a novel Bayesian approach based on Dirichlet process with a sparsity prior that simultaneous performs variable selection and clustering, and also discover variables that only distinguish a subset of the cluster components. Unlike previous Bayesian formulations, we use Dirichlet process (DP) for both clustering of samples as well as for regularizing the high-dimensional mean/variance structure. To solve the computational challenge brought by this double usage of DP, we propose to make use of a sequential sampling scheme embedded within Markov chain Monte Carlo (MCMC) updates to improve the naive implementation of existing algorithms for DP mixture models. Our method is demonstrated on a simulation study and illustrated with the leukemia gene expression dataset.

Suggested Citation

  • Lian, Heng, 2010. "Sparse Bayesian hierarchical modeling of high-dimensional clustering problems," Journal of Multivariate Analysis, Elsevier, vol. 101(7), pages 1728-1737, August.
  • Handle: RePEc:eee:jmvana:v:101:y:2010:i:7:p:1728-1737
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0047-259X(10)00068-0
    Download Restriction: Full text for ScienceDirect subscribers only
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Nott, David J., 2008. "Predictive performance of Dirichlet process shrinkage methods in linear regression," Computational Statistics & Data Analysis, Elsevier, vol. 52(7), pages 3658-3669, March.
    2. Ibrahim J. G. & Chen M-H. & Gray R. J., 2002. "Bayesian Models for Gene Expression With DNA Microarray Data," Journal of the American Statistical Association, American Statistical Association, vol. 97, pages 88-99, March.
    3. Dudoit S. & Fridlyand J. & Speed T. P, 2002. "Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data," Journal of the American Statistical Association, American Statistical Association, vol. 97, pages 77-87, March.
    4. Rodríguez, Abel & Dunson, David B & Gelfand, Alan E, 2008. "The Nested Dirichlet Process," Journal of the American Statistical Association, American Statistical Association, vol. 103(483), pages 1131-1154.
    5. Sinae Kim & Mahlet G. Tadesse & Marina Vannucci, 2006. "Variable selection in clustering via Dirichlet process mixture models," Biometrika, Biometrika Trust, vol. 93(4), pages 877-893, December.
    6. Ma, Ping & Zhong, Wenxuan, 2008. "Penalized Clustering of Large-Scale Functional Data With Multiple Covariates," Journal of the American Statistical Association, American Statistical Association, vol. 103, pages 625-636, June.
    7. Tadesse, Mahlet G. & Sha, Naijun & Vannucci, Marina, 2005. "Bayesian Variable Selection in Clustering High-Dimensional Data," Journal of the American Statistical Association, American Statistical Association, vol. 100, pages 602-617, June.
    8. Matthew Stephens, 2000. "Dealing with label switching in mixture models," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 62(4), pages 795-809.
    9. van Dyk, David A. & Park, Taeyoung, 2008. "Partially Collapsed Gibbs Samplers: Theory and Methods," Journal of the American Statistical Association, American Statistical Association, vol. 103, pages 790-796, June.
    10. Fraley C. & Raftery A.E., 2002. "Model-Based Clustering, Discriminant Analysis, and Density Estimation," Journal of the American Statistical Association, American Statistical Association, vol. 97, pages 611-631, June.
    11. Jerome H. Friedman & Jacqueline J. Meulman, 2004. "Clustering objects on subsets of attributes (with discussion)," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 66(4), pages 815-849, November.
    12. Carvalho, Carlos M. & Chang, Jeffrey & Lucas, Joseph E. & Nevins, Joseph R. & Wang, Quanli & West, Mike, 2008. "High-Dimensional Sparse Factor Modeling: Applications in Gene Expression Genomics," Journal of the American Statistical Association, American Statistical Association, vol. 103(484), pages 1438-1456.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Chakraborty, Sounak, 2009. "Bayesian binary kernel probit model for microarray based cancer classification and gene selection," Computational Statistics & Data Analysis, Elsevier, vol. 53(12), pages 4198-4209, October.
    2. Benhuai Xie & Wei Pan & Xiaotong Shen, 2008. "Variable Selection in Penalized Model‐Based Clustering Via Regularization on Grouped Parameters," Biometrics, The International Biometric Society, vol. 64(3), pages 921-930, September.
    3. Sun Jiehuan & Warren Joshua L. & Zhao Hongyu, 2017. "A Bayesian semiparametric factor analysis model for subtype identification," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 16(2), pages 145-158, April.
    4. Cathy Maugis & Gilles Celeux & Marie-Laure Martin-Magniette, 2009. "Variable Selection for Clustering with Gaussian Mixture Models," Biometrics, The International Biometric Society, vol. 65(3), pages 701-709, September.
    5. Jian Guo & Elizaveta Levina & George Michailidis & Ji Zhu, 2010. "Pairwise Variable Selection for High-Dimensional Model-Based Clustering," Biometrics, The International Biometric Society, vol. 66(3), pages 793-804, September.
    6. Aßmann, Christian & Boysen-Hogrefe, Jens & Pape, Markus, 2012. "The directional identification problem in Bayesian factor analysis: An ex-post approach," Kiel Working Papers 1799, Kiel Institute for the World Economy (IfW Kiel).
    7. Conti, Gabriella & Frühwirth-Schnatter, Sylvia & Heckman, James J. & Piatek, Rémi, 2014. "Bayesian exploratory factor analysis," Journal of Econometrics, Elsevier, vol. 183(1), pages 31-57.
    8. Montanari, Angela & Viroli, Cinzia, 2011. "Maximum likelihood estimation of mixtures of factor analyzers," Computational Statistics & Data Analysis, Elsevier, vol. 55(9), pages 2712-2723, September.
    9. De la Cruz-Mesia, Rolando & Quintana, Fernando A. & Marshall, Guillermo, 2008. "Model-based clustering for longitudinal data," Computational Statistics & Data Analysis, Elsevier, vol. 52(3), pages 1441-1457, January.
    10. Aßmann, Christian & Boysen-Hogrefe, Jens, 2009. "A bayesian approach to model-based clustering for panel probit models," Economics Working Papers 2009-03, Christian-Albrechts-University of Kiel, Department of Economics.
    11. Bassetti, Federico & Casarin, Roberto & Leisen, Fabrizio, 2014. "Beta-product dependent Pitman–Yor processes for Bayesian inference," Journal of Econometrics, Elsevier, vol. 180(1), pages 49-72.
    12. Crook Oliver M. & Gatto Laurent & Kirk Paul D. W., 2019. "Fast approximate inference for variable selection in Dirichlet process mixtures, with an application to pan-cancer proteomics," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 18(6), pages 1-20, December.
    13. Thierry Chekouo & Alejandro Murua, 2018. "High-dimensional variable selection with the plaid mixture model for clustering," Computational Statistics, Springer, vol. 33(3), pages 1475-1496, September.
    14. Juarez, Miguel A. & Steel, Mark F. J., 2006. "Model-based Clustering of non-Gaussian Panel Data," MPRA Paper 880, University Library of Munich, Germany.
    15. Sijian Wang & Ji Zhu, 2008. "Variable Selection for Model-Based High-Dimensional Clustering and Its Application to Microarray Data," Biometrics, The International Biometric Society, vol. 64(2), pages 440-448, June.
    16. Brian J. Reich & Howard D. Bondell, 2011. "A Spatial Dirichlet Process Mixture Model for Clustering Population Genetics Data," Biometrics, The International Biometric Society, vol. 67(2), pages 381-390, June.
    17. Shaikh Mateen & McNicholas Paul D & Desmond Anthony F, 2010. "A Pseudo-EM Algorithm for Clustering Incomplete Longitudinal Data," The International Journal of Biostatistics, De Gruyter, vol. 6(1), pages 1-17, March.
    18. Aßmann, Christian & Boysen-Hogrefe, Jens & Pape, Markus, 2014. "Bayesian analysis of dynamic factor models: An ex-post approach towards the rotation problem," Kiel Working Papers 1902, Kiel Institute for the World Economy (IfW Kiel).
    19. Aßmann, Christian & Boysen-Hogrefe, Jens & Pape, Markus, 2016. "Bayesian analysis of static and dynamic factor models: An ex-post approach towards the rotation problem," Journal of Econometrics, Elsevier, vol. 192(1), pages 190-206.
    20. Krzanowski, Wojtek J. & Hand, David J., 2009. "A simple method for screening variables before clustering microarray data," Computational Statistics & Data Analysis, Elsevier, vol. 53(7), pages 2747-2753, May.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:jmvana:v:101:y:2010:i:7:p:1728-1737. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/wps/find/journaldescription.cws_home/622892/description#description .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.