IDEAS home Printed from https://ideas.repec.org/p/zbw/sfb475/200527.html
   My bibliography  Save this paper

Similarity Measures for Clustering SNP Data

Author

Listed:
  • Ickstadt, Katja
  • Selinski, Silvia

Abstract

The issue of suitable similarity measures for a particular kind of genetic data – so called SNP data – arises from the GENICA (Interdisciplinary Study Group on Gene Environment Interaction and Breast Cancer in Germany) case-control study of sporadic breast cancer. The GENICA study aims to investigate the influence and interaction of single nucleotide polymorphic (SNP) loci and exogenous risk factors. A single nucleotide polymorphism is a point mutation that is present in at least 1 % of a population. SNPs are the most common form of human genetic variations. In particular, we consider 65 SNP loci and 2 insertions of longer sequences in genes involved in the metabolism of hormones, xenobiotics and drugs as well as in the repair of DNA and signal transduction. Assuming that these single nucleotide changes may lead, for instance, to altered enzymes or to a reduced or enhanced amount of the original enzymes – with each alteration alone having minor effects – we aim to detect combinations of SNPs that under certain environmental conditions increase the risk of sporadic breast cancer. The search for patterns in the present data set may be performed by a variety of clustering and classification approaches. We consider here the problem of suitable measures of proximity of two variables or subjects as an indispensable basis for a further cluster analysis. Generally, clustering approaches are a useful tool to detect structures and to generate hypothesis about potential relationships in complex data situations. Searching for patterns in the data there are two possible objectives: the identification of groups of similar objects or subjects or the identification of groups of similar variables within the whole or within subpopulations. Comparing the individual genetic profiles as well as comparing the genetic information across subpopulations we discuss possible choices of similarity measures, in particular similarity measures based on the counts of matches and mismatches. New matching coefficients are introduced with a more flexible weighting scheme to account for the general problem of the comparison of SNP data: The large proportion of homozygous reference sequences relative to the homo- and heterozygous SNPs is masking the accordances and differences of interest.

Suggested Citation

  • Ickstadt, Katja & Selinski, Silvia, 2005. "Similarity Measures for Clustering SNP Data," Technical Reports 2005,27, Technische Universität Dortmund, Sonderforschungsbereich 475: Komplexitätsreduktion in multivariaten Datenstrukturen.
  • Handle: RePEc:zbw:sfb475:200527
    as

    Download full text from publisher

    File URL: https://www.econstor.eu/bitstream/10419/22617/1/tr27-05.pdf
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Ickstadt, Katja & Selinski, Silvia & Müller, Tina, 2005. "Cluster Analysis : A Comparison of Different Similarity Measures for SNP Data," Technical Reports 2005,14, Technische Universität Dortmund, Sonderforschungsbereich 475: Komplexitätsreduktion in multivariaten Datenstrukturen.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Ickstadt, Katja & Selinski, Silvia & Müller, Tina, 2005. "Cluster Analysis : A Comparison of Different Similarity Measures for SNP Data," Technical Reports 2005,14, Technische Universität Dortmund, Sonderforschungsbereich 475: Komplexitätsreduktion in multivariaten Datenstrukturen.
    2. Selinski, Silvia, 2006. "Similarity Measures for Clustering SNP and Epidemiological Data," Technical Reports 2006,25, Technische Universität Dortmund, Sonderforschungsbereich 475: Komplexitätsreduktion in multivariaten Datenstrukturen.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Selinski, Silvia, 2006. "Similarity Measures for Clustering SNP and Epidemiological Data," Technical Reports 2006,25, Technische Universität Dortmund, Sonderforschungsbereich 475: Komplexitätsreduktion in multivariaten Datenstrukturen.
    2. Schwender, Holger, 2007. "A note on the simultaneous computation of thousands of Pearson's X2-Statistics," Technical Reports 2007,19, Technische Universität Dortmund, Sonderforschungsbereich 475: Komplexitätsreduktion in multivariaten Datenstrukturen.
    3. Schwender, Holger & Ickstadt, Katja, 2008. "Imputing missing genotypes with weighted k nearest neighbors," Technical Reports 2008,03, Technische Universität Dortmund, Sonderforschungsbereich 475: Komplexitätsreduktion in multivariaten Datenstrukturen.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:zbw:sfb475:200527. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ZBW - Leibniz Information Centre for Economics (email available below). General contact details of provider: https://edirc.repec.org/data/isdorde.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.