IDEAS home Printed from https://ideas.repec.org/a/bpj/sagmbi/v14y2015i2p125-141n2.html
   My bibliography  Save this article

H-CLAP: hierarchical clustering within a linear array with an application in genetics

Author

Listed:
  • Ghosh Samiran

    (Department of Family Medicine and Public Health Sciences and Center of Molecular Medicine and Genetics, Wayne State University School of Medicine, 3127 Scott Hall, 540 East Canfield, Detroit, MI, USA)

  • Townsend Jeffrey P.

    (Department of Biostatistics and Program in Computational, Biology and Bioinformatics, Yale University, 135 College Street, Suite 200, New Haven, CT 06510, USA)

Abstract

In most cases where clustering of data is desirable, the underlying data distribution to be clustered is unconstrained. However clustering of site types in a discretely structured linear array, as is often desired in studies of linear sequences such as DNA, RNA or proteins, represents a problem where data points are not necessarily exchangeable and are directionally constrained within the array. Each position in the linear array is fixed, and could be either “marked” (i.e., of interest such as polymorphic or substitute sites) or “non-marked.” Here we describe a method for clustering of those marked sites. Since the cluster-generating process is constrained by discrete locality inside such an array, traditional clustering methods need adjustment to be appropriate. We develop a hierarchical Bayesian approach. We adopt a Markov clustering algorithm, revealing any natural partitioning in the pattern of marked sites. The resulting recursive partitioning and clustering algorithm is named hierarchical clustering in a linear array (H-CLAP). It employs domain-specific directional constraints directly in the likelihood construction. Our method, being fully Bayesian, is more flexible in cluster discovery compared to a standard agglomerative hierarchical clustering algorithm. It not only provides hierarchical clustering, but also cluster boundaries, which may have their own biological significance. We have tested the efficacy of our method on data sets, including two biological and several simulated ones.

Suggested Citation

  • Ghosh Samiran & Townsend Jeffrey P., 2015. "H-CLAP: hierarchical clustering within a linear array with an application in genetics," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 14(2), pages 125-141, April.
  • Handle: RePEc:bpj:sagmbi:v:14:y:2015:i:2:p:125-141:n:2
    DOI: 10.1515/sagmb-2013-0076
    as

    Download full text from publisher

    File URL: https://doi.org/10.1515/sagmb-2013-0076
    Download Restriction: For access to full text, subscription to the journal or payment for the individual article is required.

    File URL: https://libkey.io/10.1515/sagmb-2013-0076?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Struyf, Anja & Hubert, Mia & Rousseeuw, Peter J., 1997. "Integrating robust clustering techniques in S-PLUS," Computational Statistics & Data Analysis, Elsevier, vol. 26(1), pages 17-37, November.
    2. Karl Schmid & Ziheng Yang, 2008. "The Trouble with Sliding Windows and the Selective Pressure in BRCA1," PLOS ONE, Public Library of Science, vol. 3(11), pages 1-7, November.
    3. Zhang Zhang & Jeffrey P Townsend, 2009. "Maximum-Likelihood Model Averaging To Profile Clustering of Site Types across Discrete Linear Sequences," PLOS Computational Biology, Public Library of Science, vol. 5(6), pages 1-14, June.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Birgin, E. G. & Martinez, J. M. & Ronconi, D. P., 2003. "Minimization subproblems and heuristics for an applied clustering problem," European Journal of Operational Research, Elsevier, vol. 146(1), pages 19-34, April.
    2. Dario Krpan & Jonathan E. Booth & Andreea Damien, 2023. "The positive–negative–competence (PNC) model of psychological responses to representations of robots," Nature Human Behaviour, Nature, vol. 7(11), pages 1933-1954, November.
    3. Pison, Greet & Struyf, Anja & Rousseeuw, Peter J., 1999. "Displaying a clustering with CLUSPLOT," Computational Statistics & Data Analysis, Elsevier, vol. 30(4), pages 381-392, June.
    4. Aloyce R Kaliba & Kizito Mazvimavi & Theresia L Gregory & Frida M Mgonja & Mary Mgonja, 2018. "Factors affecting adoption of improved sorghum varieties in Tanzania under information and capital constraints," Agricultural and Food Economics, Springer;Italian Society of Agricultural Economics (SIDEA), vol. 6(1), pages 1-21, December.
    5. Yi-Fei Huang & G Brian Golding, 2014. "Phylogenetic Gaussian Process Model for the Inference of Functionally Important Regions in Protein Tertiary Structures," PLOS Computational Biology, Public Library of Science, vol. 10(1), pages 1-12, January.
    6. Ahmed Albatineh & Magdalena Niewiadomska-Bugaj, 2011. "MCS: A Method for Finding the Number of Clusters," Journal of Classification, Springer;The Classification Society, vol. 28(2), pages 184-209, July.
    7. Karina Acosta & Yuri Reina-Aranza, 2023. "Categorías municipales en Colombia: Avanzando hacia un modelo de descentralización asimétrica," Documentos de trabajo sobre Economía Regional y Urbana 321, Banco de la Republica de Colombia.
    8. Leisch, Friedrich, 2006. "A toolbox for K-centroids cluster analysis," Computational Statistics & Data Analysis, Elsevier, vol. 51(2), pages 526-544, November.
    9. Zhang Zhang & Jeffrey P Townsend, 2009. "Maximum-Likelihood Model Averaging To Profile Clustering of Site Types across Discrete Linear Sequences," PLOS Computational Biology, Public Library of Science, vol. 5(6), pages 1-14, June.
    10. Abellanas, Manuel & Claverol, Merce & Hurtado, Ferran, 2007. "Point set stratification and Delaunay depth," Computational Statistics & Data Analysis, Elsevier, vol. 51(5), pages 2513-2530, February.
    11. Song, Seongjoo & Nicolae, Dan L. & Song, Jongwoo, 2010. "Estimating the mixing proportion in a semiparametric mixture model," Computational Statistics & Data Analysis, Elsevier, vol. 54(10), pages 2276-2283, October.
    12. Slaets, Leen & Claeskens, Gerda & Hubert, Mia, 2012. "Phase and amplitude-based clustering for functional data," Computational Statistics & Data Analysis, Elsevier, vol. 56(7), pages 2360-2374.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bpj:sagmbi:v:14:y:2015:i:2:p:125-141:n:2. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Peter Golla (email available below). General contact details of provider: https://www.degruyter.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.