IDEAS home Printed from https://ideas.repec.org/p/zbw/sfb475/200625.html
   My bibliography  Save this paper

Similarity Measures for Clustering SNP and Epidemiological Data

Author

Listed:
  • Selinski, Silvia

Abstract

The issue of suitable similarity measures for a joint consideration of so called SNP data and epidemiological variables arises from the GENICA (Interdisciplinary Study Group on Gene Environment Interaction and Breast Cancer in Germany) casecontrol study of sporadic breast cancer. The GENICA study aims to investigate the influence and interaction of single nucleotide polymorphic (SNP) loci and exogenous risk factors. A single nucleotide polymorphism is a point mutation that is present in at least 1 % of a population. SNPs are the most common form of human genetic variations. In particular, we consider 43 SNP loci in genes involved in the metabolism of hormones, xenobiotics and drugs as well as in the repair of DNA. Assuming that these single nucleotide changes may lead, for instance, to altered enzymes or to a reduced or enhanced amount of the original enzymes – with each alteration alone having minor effects – the aim is to detect combinations of SNPs that under certain environmental conditions increase the risk of sporadic breast cancer. The search for patterns in the present data set may be performed by a variety of clustering and classification approaches. I consider here the problem of suitable measures of proximity of two variables or subjects as an indispensable basis for a further cluster analysis. In the present data situation these measures have to be able to handle different numbers and meaning of categories of nominal scaled data as well as data of different scales. Generally, clustering approaches are a useful tool to detect structures and to generate hypothesis about potential relationships in complex data situations. Searching for patterns in the data there are two possible objectives: the identification of groups of similar objects or subjects or the identification of groups of similar variables within the whole or within subpopulations. The different objectives imply different requirements on the measures of similarity. Comparing the individual genetic profiles as well as comparing the genetic information across subpopulations I discuss possible choices of similarity measures suitable for genetic and epidemiological data, in particular, measures based on the ÷2-statistic, Flexible Matching Coefficients and combinations of similarity measures.

Suggested Citation

  • Selinski, Silvia, 2006. "Similarity Measures for Clustering SNP and Epidemiological Data," Technical Reports 2006,25, Technische Universität Dortmund, Sonderforschungsbereich 475: Komplexitätsreduktion in multivariaten Datenstrukturen.
  • Handle: RePEc:zbw:sfb475:200625
    as

    Download full text from publisher

    File URL: https://www.econstor.eu/bitstream/10419/22668/1/tr25-06.pdf
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Ickstadt, Katja & Selinski, Silvia & Müller, Tina, 2005. "Cluster Analysis : A Comparison of Different Similarity Measures for SNP Data," Technical Reports 2005,14, Technische Universität Dortmund, Sonderforschungsbereich 475: Komplexitätsreduktion in multivariaten Datenstrukturen.
    2. Ickstadt, Katja & Selinski, Silvia, 2005. "Similarity Measures for Clustering SNP Data," Technical Reports 2005,27, Technische Universität Dortmund, Sonderforschungsbereich 475: Komplexitätsreduktion in multivariaten Datenstrukturen.
    3. Jerome H. Friedman & Jacqueline J. Meulman, 2004. "Clustering objects on subsets of attributes (with discussion)," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 66(4), pages 815-849, November.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Zhaoyu Xing & Yang Wan & Juan Wen & Wei Zhong, 2024. "GOLFS: feature selection via combining both global and local information for high dimensional clustering," Computational Statistics, Springer, vol. 39(5), pages 2651-2675, July.
    2. Jian Guo & Elizaveta Levina & George Michailidis & Ji Zhu, 2010. "Pairwise Variable Selection for High-Dimensional Model-Based Clustering," Biometrics, The International Biometric Society, vol. 66(3), pages 793-804, September.
    3. Cathy Maugis & Gilles Celeux & Marie-Laure Martin-Magniette, 2009. "Variable Selection for Clustering with Gaussian Mixture Models," Biometrics, The International Biometric Society, vol. 65(3), pages 701-709, September.
    4. Nicoleta Serban, 2008. "Estimating and clustering curves in the presence of heteroscedastic errors," Journal of Nonparametric Statistics, Taylor & Francis Journals, vol. 20(7), pages 553-571.
    5. Floriello, Davide & Vitelli, Valeria, 2017. "Sparse clustering of functional data," Journal of Multivariate Analysis, Elsevier, vol. 154(C), pages 1-18.
    6. Maarten M. Kampert & Jacqueline J. Meulman & Jerome H. Friedman, 2017. "rCOSA: A Software Package for Clustering Objects on Subsets of Attributes," Journal of Classification, Springer;The Classification Society, vol. 34(3), pages 514-547, October.
    7. Peter D. Hoff, 2005. "Subset Clustering of Binary Sequences, with an Application to Genomic Abnormality Data," Biometrics, The International Biometric Society, vol. 61(4), pages 1027-1036, December.
    8. Schwender, Holger, 2007. "A note on the simultaneous computation of thousands of Pearson's X2-Statistics," Technical Reports 2007,19, Technische Universität Dortmund, Sonderforschungsbereich 475: Komplexitätsreduktion in multivariaten Datenstrukturen.
    9. Lian, Heng, 2010. "Sparse Bayesian hierarchical modeling of high-dimensional clustering problems," Journal of Multivariate Analysis, Elsevier, vol. 101(7), pages 1728-1737, August.
    10. Gaynor, Sheila & Bair, Eric, 2017. "Identification of relevant subtypes via preweighted sparse clustering," Computational Statistics & Data Analysis, Elsevier, vol. 116(C), pages 139-154.
    11. Nikulin, V., 2006. "Threshold-based clustering with merging and regularization in application to network intrusion detection," Computational Statistics & Data Analysis, Elsevier, vol. 51(2), pages 1184-1196, November.
    12. Beibei Yuan & Willem Heiser & Mark Rooij, 2019. "The δ-Machine: Classification Based on Distances Towards Prototypes," Journal of Classification, Springer;The Classification Society, vol. 36(3), pages 442-470, October.
    13. Ronglai Shen & Qianxing Mo & Nikolaus Schultz & Venkatraman E Seshan & Adam B Olshen & Jason Huse & Marc Ladanyi & Chris Sander, 2012. "Integrative Subtype Discovery in Glioblastoma Using iCluster," PLOS ONE, Public Library of Science, vol. 7(4), pages 1-9, April.
    14. Grn, Bettina & Leisch, Friedrich, 2009. "Dealing with label switching in mixture models under genuine multimodality," Journal of Multivariate Analysis, Elsevier, vol. 100(5), pages 851-861, May.
    15. Benhuai Xie & Wei Pan & Xiaotong Shen, 2008. "Variable Selection in Penalized Model‐Based Clustering Via Regularization on Grouped Parameters," Biometrics, The International Biometric Society, vol. 64(3), pages 921-930, September.
    16. Francisco de A. T. Carvalho & Antonio Irpino & Rosanna Verde & Antonio Balzanella, 2022. "Batch Self-Organizing Maps for Distributional Data with an Automatic Weighting of Variables and Components," Journal of Classification, Springer;The Classification Society, vol. 39(2), pages 343-375, July.
    17. Arias-Castro, Ery & Pu, Xiao, 2017. "A simple approach to sparse clustering," Computational Statistics & Data Analysis, Elsevier, vol. 105(C), pages 217-228.
    18. Yang, Aijun & Jiang, Xuejun & Liu, Pengfei & Lin, Jinguan, 2016. "Sparse Bayesian multinomial probit regression model with correlation prior for high-dimensional data classification," Statistics & Probability Letters, Elsevier, vol. 119(C), pages 241-247.
    19. Ickstadt, Katja & Selinski, Silvia, 2005. "Similarity Measures for Clustering SNP Data," Technical Reports 2005,27, Technische Universität Dortmund, Sonderforschungsbereich 475: Komplexitätsreduktion in multivariaten Datenstrukturen.
    20. Ickstadt, Katja & Selinski, Silvia & Müller, Tina, 2005. "Cluster Analysis : A Comparison of Different Similarity Measures for SNP Data," Technical Reports 2005,14, Technische Universität Dortmund, Sonderforschungsbereich 475: Komplexitätsreduktion in multivariaten Datenstrukturen.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:zbw:sfb475:200625. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ZBW - Leibniz Information Centre for Economics (email available below). General contact details of provider: https://edirc.repec.org/data/isdorde.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.