IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0075748.html
   My bibliography  Save this article

Bayesian Hierarchical Clustering for Studying Cancer Gene Expression Data with Unknown Statistics

Author

Listed:
  • Korsuk Sirinukunwattana
  • Richard S Savage
  • Muhammad F Bari
  • David R J Snead
  • Nasir M Rajpoot

Abstract

Clustering analysis is an important tool in studying gene expression data. The Bayesian hierarchical clustering (BHC) algorithm can automatically infer the number of clusters and uses Bayesian model selection to improve clustering quality. In this paper, we present an extension of the BHC algorithm. Our Gaussian BHC (GBHC) algorithm represents data as a mixture of Gaussian distributions. It uses normal-gamma distribution as a conjugate prior on the mean and precision of each of the Gaussian components. We tested GBHC over 11 cancer and 3 synthetic datasets. The results on cancer datasets show that in sample clustering, GBHC on average produces a clustering partition that is more concordant with the ground truth than those obtained from other commonly used algorithms. Furthermore, GBHC frequently infers the number of clusters that is often close to the ground truth. In gene clustering, GBHC also produces a clustering partition that is more biologically plausible than several other state-of-the-art methods. This suggests GBHC as an alternative tool for studying gene expression data.The implementation of GBHC is available at https://sites.google.com/site/gaussianbhc/

Suggested Citation

  • Korsuk Sirinukunwattana & Richard S Savage & Muhammad F Bari & David R J Snead & Nasir M Rajpoot, 2013. "Bayesian Hierarchical Clustering for Studying Cancer Gene Expression Data with Unknown Statistics," PLOS ONE, Public Library of Science, vol. 8(10), pages 1-11, October.
  • Handle: RePEc:plo:pone00:0075748
    DOI: 10.1371/journal.pone.0075748
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0075748
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0075748&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0075748?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Brock, Guy & Pihur, Vasyl & Datta, Susmita & Datta, Somnath, 2008. "clValid: An R Package for Cluster Validation," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 25(i04).
    2. Lawrence Hubert & Phipps Arabie, 1985. "Comparing partitions," Journal of Classification, Springer;The Classification Society, vol. 2(1), pages 193-218, December.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Guillaume Marrelec & Arnaud Messé & Pierre Bellec, 2015. "A Bayesian Alternative to Mutual Information for the Hierarchical Clustering of Dependent Random Variables," PLOS ONE, Public Library of Science, vol. 10(9), pages 1-26, September.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Ana Alina Tudoran, 2022. "A machine learning approach to identifying decision-making styles for managing customer relationships," Electronic Markets, Springer;IIM University of St. Gallen, vol. 32(1), pages 351-374, March.
    2. Wu, Han-Ming, 2011. "On biological validity indices for soft clustering algorithms for gene expression data," Computational Statistics & Data Analysis, Elsevier, vol. 55(5), pages 1969-1979, May.
    3. Volodymyr Melnykov & Xuwen Zhu, 2019. "An extension of the K-means algorithm to clustering skewed data," Computational Statistics, Springer, vol. 34(1), pages 373-394, March.
    4. Robert Darkins & Emma J Cooke & Zoubin Ghahramani & Paul D W Kirk & David L Wild & Richard S Savage, 2013. "Accelerating Bayesian Hierarchical Clustering of Time Series Data with a Randomised Algorithm," PLOS ONE, Public Library of Science, vol. 8(4), pages 1-9, April.
    5. Johann Kraus & Christoph Müssel & Günther Palm & Hans Kestler, 2011. "Multi-objective selection for collecting cluster alternatives," Computational Statistics, Springer, vol. 26(2), pages 341-353, June.
    6. Patrick Zschech & Kai Heinrich & Raphael Bink & Janis S. Neufeld, 2019. "Prognostic Model Development with Missing Labels," Business & Information Systems Engineering: The International Journal of WIRTSCHAFTSINFORMATIK, Springer;Gesellschaft für Informatik e.V. (GI), vol. 61(3), pages 327-343, June.
    7. Wu, Han-Ming & Tien, Yin-Jing & Chen, Chun-houh, 2010. "GAP: A graphical environment for matrix visualization and cluster analysis," Computational Statistics & Data Analysis, Elsevier, vol. 54(3), pages 767-778, March.
    8. José E. Chacón, 2021. "Explicit Agreement Extremes for a 2 × 2 Table with Given Marginals," Journal of Classification, Springer;The Classification Society, vol. 38(2), pages 257-263, July.
    9. Roberto Rocci & Stefano Antonio Gattone & Roberto Di Mari, 2018. "A data driven equivariant approach to constrained Gaussian mixture modeling," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 12(2), pages 235-260, June.
    10. Redivo, Edoardo & Nguyen, Hien D. & Gupta, Mayetri, 2020. "Bayesian clustering of skewed and multimodal data using geometric skewed normal distributions," Computational Statistics & Data Analysis, Elsevier, vol. 152(C).
    11. Zhu, Xuwen & Melnykov, Volodymyr, 2018. "Manly transformation in finite mixture modeling," Computational Statistics & Data Analysis, Elsevier, vol. 121(C), pages 190-208.
    12. Amiri, Babak & Karimianghadim, Ramin, 2024. "A novel text clustering model based on topic modelling and social network analysis," Chaos, Solitons & Fractals, Elsevier, vol. 181(C).
    13. Li, Pai-Ling & Chiou, Jeng-Min, 2011. "Identifying cluster number for subspace projected functional data clustering," Computational Statistics & Data Analysis, Elsevier, vol. 55(6), pages 2090-2103, June.
    14. A van Giessen & K G M Moons & G A de Wit & W M M Verschuren & J M A Boer & H Koffijberg, 2015. "Tailoring the Implementation of New Biomarkers Based on Their Added Predictive Value in Subgroups of Individuals," PLOS ONE, Public Library of Science, vol. 10(1), pages 1-14, January.
    15. Yaeji Lim & Hee-Seok Oh & Ying Kuen Cheung, 2019. "Multiscale Clustering for Functional Data," Journal of Classification, Springer;The Classification Society, vol. 36(2), pages 368-391, July.
    16. Stefano Tonellato & Andrea Pastore, 2013. "On the comparison of model-based clustering solutions," Working Papers 2013:05, Department of Economics, University of Venice "Ca' Foscari".
    17. Gainbi Park & Zengwang Xu, 2022. "The constituent components and local indicator variables of social vulnerability index," Natural Hazards: Journal of the International Society for the Prevention and Mitigation of Natural Hazards, Springer;International Society for the Prevention and Mitigation of Natural Hazards, vol. 110(1), pages 95-120, January.
    18. Elvira Pelle & Roberta Pappadà, 2021. "A clustering procedure for mixed-type data to explore ego network typologies: an application to elderly people living alone in Italy," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 30(5), pages 1507-1533, December.
    19. Renato Cordeiro Amorim, 2016. "A Survey on Feature Weighting Based K-Means Algorithms," Journal of Classification, Springer;The Classification Society, vol. 33(2), pages 210-242, July.
    20. Tom Wilderjans & Eva Ceulemans & Iven Mechelen, 2008. "The CHIC Model: A Global Model for Coupled Binary Data," Psychometrika, Springer;The Psychometric Society, vol. 73(4), pages 729-751, December.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0075748. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.