IDEAS home Printed from https://ideas.repec.org/a/spr/annopr/v263y2018i1d10.1007_s10479-014-1589-3.html
   My bibliography  Save this article

Information-theoretic feature selection with discrete $$k$$ k -median clustering

Author

Listed:
  • Onur Şeref

    (Virginia Polytechnic Institute and State University)

  • Ya-Ju Fan

    (Lawrence Livermore National Laboratory)

  • Elan Borenstein

    (Rutgers University)

  • Wanpracha A. Chaovalitwongse

    (University of Washington)

Abstract

We propose a novel computational framework that integrates information-theoretic feature selection with discrete $$k$$ k -median clustering (DKM). DKM is a domain-independent clustering algorithm which requires a pairwise distance matrix between samples that can be defined arbitrarily as input. In the proposed DKM clustering, the center of each cluster is represented by a set of samples, which induce a separate set of clusters for each feature dimension. We evaluate the relevance of each feature by the normalized mutual information (NMI) scores between the base clusters using all features and the induced clusters for that feature dimension. We propose a spectral cluster analysis (SCA) method to determine the number of clusters using the average of the relevance NMI scores. We introduce filter- and wrapper-based feature selection algorithms that produce a ranked list of features using the relevance NMI scores. We create an information gain curve and calculate the normalized area under this curve to quantify information gain and identify the contributing features. We study the properties of our information-theoretic framework for clustering, SCA and feature selection on simulated data. We demonstrate that SCA can accurately identify the number of clusters in simulated data and public benchmark datasets. We also compare the clustering and feature selection performance of our framework to other domain-dependent and domain-independent algorithms on public benchmark datasets and a real-life neural time series dataset. We show that DKM runs comparably fast with better performance.

Suggested Citation

  • Onur Şeref & Ya-Ju Fan & Elan Borenstein & Wanpracha A. Chaovalitwongse, 2018. "Information-theoretic feature selection with discrete $$k$$ k -median clustering," Annals of Operations Research, Springer, vol. 263(1), pages 93-118, April.
  • Handle: RePEc:spr:annopr:v:263:y:2018:i:1:d:10.1007_s10479-014-1589-3
    DOI: 10.1007/s10479-014-1589-3
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s10479-014-1589-3
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s10479-014-1589-3?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. W. Art Chaovalitwongse & Ya-Ju Fan & Rajesh C. Sachdeo, 2008. "Novel Optimization Models for Abnormal Brain Activity Classification," Operations Research, INFORMS, vol. 56(6), pages 1450-1460, December.
    2. Onur Seref & O. Erhun Kundakcioglu & Oleg A. Prokopyev & Panos M. Pardalos, 2009. "Selective support vector machines," Journal of Combinatorial Optimization, Springer, vol. 17(1), pages 3-20, January.
    3. Onur Seref & Ya-Ju Fan & Wanpracha Art Chaovalitwongse, 2014. "Mathematical Programming Formulations and Algorithms for Discrete k-Median Clustering of Time-Series Data," INFORMS Journal on Computing, INFORMS, vol. 26(1), pages 160-172, February.
    4. Dilip Chhajed & Timothy J. Lowe, 1992. "m-Median and m-Center Problems with Mutual Communication: Solvable Special Cases," Operations Research, INFORMS, vol. 40(1-supplem), pages 56-66, February.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. F. Benedetto & L. Mastroeni & P. Vellucci, 2021. "Modeling the flow of information between financial time-series by an entropy-based approach," Annals of Operations Research, Springer, vol. 299(1), pages 1235-1252, April.
    2. Tai Vovan & Dinh Phamtoan & Le Hoang Tuan & Thao Nguyentrang, 2021. "An automatic clustering for interval data using the genetic algorithm," Annals of Operations Research, Springer, vol. 303(1), pages 359-380, August.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Wanpracha Chaovalitwongse, 2009. "Comments on: Optimization and data mining in medicine," TOP: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 17(2), pages 247-249, December.
    2. Aykin, Turgut, 1995. "The hub location and routing problem," European Journal of Operational Research, Elsevier, vol. 83(1), pages 200-219, May.
    3. Carrizosa, Emilio & Nogales-Gómez, Amaya & Romero Morales, Dolores, 2017. "Clustering categories in support vector machines," Omega, Elsevier, vol. 66(PA), pages 28-37.
    4. Carrizosa, Emilio & Kurishchenko, Kseniia & Marín, Alfredo & Romero Morales, Dolores, 2022. "Interpreting clusters via prototype optimization," Omega, Elsevier, vol. 107(C).
    5. Onur Seref & Ya-Ju Fan & Wanpracha Art Chaovalitwongse, 2014. "Mathematical Programming Formulations and Algorithms for Discrete k-Median Clustering of Time-Series Data," INFORMS Journal on Computing, INFORMS, vol. 26(1), pages 160-172, February.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:annopr:v:263:y:2018:i:1:d:10.1007_s10479-014-1589-3. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.