IDEAS home Printed from https://ideas.repec.org/a/eee/ejores/v308y2023i2p555-567.html
   My bibliography  Save this article

Dendrograms, minimum spanning trees and feature selection

Author

Listed:
  • Labbé, Martine
  • Landete, Mercedes
  • Leal, Marina

Abstract

Feature selection is a fundamental process to avoid overfitting and to reduce the size of databases without significant loss of information that applies to hierarchical clustering. Dendrograms are graphical representations of hierarchical clustering algorithms that for single linkage clustering can be interpreted as minimum spanning trees in the complete network defined by the database. In this work, we introduce the problem that determines jointly a set of features and a dendrogram, according to the single linkage method. We propose different formulations that include the minimum spanning tree problem constraints as well as the feature selection constraints. Different bounds on the objective function are studied. For one of the models, several families of valid inequalities are proposed and the problem of separating them is studied. For another formulation, a decomposition algorithm is designed. In an extensive computational study, the effectiveness of the different models is discussed, the model with valid inequalities is compared with the decomposition algorithm. The computational results also illustrate that the integration of feature selection to the optimization model allows to keep a satisfactory percentage of information.

Suggested Citation

  • Labbé, Martine & Landete, Mercedes & Leal, Marina, 2023. "Dendrograms, minimum spanning trees and feature selection," European Journal of Operational Research, Elsevier, vol. 308(2), pages 555-567.
  • Handle: RePEc:eee:ejores:v:308:y:2023:i:2:p:555-567
    DOI: 10.1016/j.ejor.2022.11.031
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0377221722008906
    Download Restriction: Full text for ScienceDirect subscribers only

    File URL: https://libkey.io/10.1016/j.ejor.2022.11.031?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Lee, In Gyu & Yoon, Sang Won & Won, Daehan, 2022. "A Mixed Integer Linear Programming Support Vector Machine for Cost-Effective Group Feature Selection: Branch-Cut-and-Price Approach," European Journal of Operational Research, Elsevier, vol. 299(3), pages 1055-1068.
    2. Jiménez-Cordero, Asunción & Morales, Juan Miguel & Pineda, Salvador, 2021. "A novel embedded min-max approach for feature selection in nonlinear Support Vector Machine classification," European Journal of Operational Research, Elsevier, vol. 293(1), pages 24-35.
    3. Stefano Benati & Sergio García & Justo Puerto, 2018. "Mixed integer linear programming and heuristic methods for feature selection in clustering," Journal of the Operational Research Society, Taylor & Francis Journals, vol. 69(9), pages 1379-1395, September.
    4. Benítez-Peña, Sandra & Bogetoft, Peter & Romero Morales, Dolores, 2020. "Feature Selection in Data Envelopment Analysis: A Mathematical Optimization approach," Omega, Elsevier, vol. 96(C).
    5. Wang, Shaobin & Liu, Haimeng & Pu, Haixia & Yang, Hao, 2020. "Spatial disparity and hierarchical cluster analysis of final energy consumption in China," Energy, Elsevier, vol. 197(C).
    6. Jiang, He & Luo, Shihua & Dong, Yao, 2021. "Simultaneous feature selection and clustering based on square root optimization," European Journal of Operational Research, Elsevier, vol. 289(1), pages 214-231.
    7. Witten, Daniela M. & Tibshirani, Robert, 2010. "A Framework for Feature Selection in Clustering," Journal of the American Statistical Association, American Statistical Association, vol. 105(490), pages 713-726.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Díaz, Verónica & Montoya, Ricardo & Maldonado, Sebastián, 2023. "Preference estimation under bounded rationality: Identification of attribute non-attendance in stated-choice data using a support vector machines approach," European Journal of Operational Research, Elsevier, vol. 304(2), pages 797-812.
    2. Chen, Huadun & Du, Qianxi & Huo, Tengfei & Liu, Peiran & Cai, Weiguang & Liu, Bingsheng, 2023. "Spatiotemporal patterns and driving mechanism of carbon emissions in China's urban residential building sector," Energy, Elsevier, vol. 263(PE).
    3. Yaeji Lim & Hee-Seok Oh & Ying Kuen Cheung, 2019. "Multiscale Clustering for Functional Data," Journal of Classification, Springer;The Classification Society, vol. 36(2), pages 368-391, July.
    4. Yujia Li & Xiangrui Zeng & Chien‐Wei Lin & George C. Tseng, 2022. "Simultaneous estimation of cluster number and feature sparsity in high‐dimensional cluster analysis," Biometrics, The International Biometric Society, vol. 78(2), pages 574-585, June.
    5. Dong Liu & Changwei Zhao & Yong He & Lei Liu & Ying Guo & Xinsheng Zhang, 2023. "Simultaneous cluster structure learning and estimation of heterogeneous graphs for matrix‐variate fMRI data," Biometrics, The International Biometric Society, vol. 79(3), pages 2246-2259, September.
    6. Jeffrey Andrews & Paul McNicholas, 2014. "Variable Selection for Clustering and Classification," Journal of Classification, Springer;The Classification Society, vol. 31(2), pages 136-153, July.
    7. Lifang Zhang & Jianzhou Wang & Zhenkun Liu, 2023. "Power grid operation optimization and forecasting using a combined forecasting system," Journal of Forecasting, John Wiley & Sons, Ltd., vol. 42(1), pages 124-153, January.
    8. Jiang, Ping & Liu, Zhenkun & Wang, Jianzhou & Zhang, Lifang, 2021. "Decomposition-selection-ensemble forecasting system for energy futures price forecasting based on multi-objective version of chaos game optimization algorithm," Resources Policy, Elsevier, vol. 73(C).
    9. J. Fernando Vera & Rodrigo Macías, 2021. "On the Behaviour of K-Means Clustering of a Dissimilarity Matrix by Means of Full Multidimensional Scaling," Psychometrika, Springer;The Psychometric Society, vol. 86(2), pages 489-513, June.
    10. Seiya Maki & Satoshi Ohnishi & Minoru Fujii & Naohiro Goto & Lu Sun, 2022. "Using waste to supply steam for industry transition: Selection of target industries through economic evaluation and statistical analysis," Journal of Industrial Ecology, Yale University, vol. 26(4), pages 1475-1486, August.
    11. Zhiguang Huo & Li Zhu & Tianzhou Ma & Hongcheng Liu & Song Han & Daiqing Liao & Jinying Zhao & George Tseng, 2020. "Two-Way Horizontal and Vertical Omics Integration for Disease Subtype Discovery," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 12(1), pages 1-22, April.
    12. Cheng, Shulei & Fan, Wei & Zhang, Jian & Wang, Ning & Meng, Fanxin & Liu, Gengyuan, 2021. "Multi-sectoral determinants of carbon emission inequality in Chinese clustering cities," Energy, Elsevier, vol. 214(C).
    13. Charles Bouveyron & Camille Brunet-Saumard, 2014. "Discriminative variable selection for clustering with the sparse Fisher-EM algorithm," Computational Statistics, Springer, vol. 29(3), pages 489-513, June.
    14. Ozcan, Erhan C. & Görgülü, Berk & Baydogan, Mustafa G., 2024. "Column generation-based prototype learning for optimizing area under the receiver operating characteristic curve," European Journal of Operational Research, Elsevier, vol. 314(1), pages 297-307.
    15. Corinna Kleinert & Alexander Vosseler & Uwe Blien, 2018. "Classifying vocational training markets," The Annals of Regional Science, Springer;Western Regional Science Association, vol. 61(1), pages 31-48, July.
    16. Hosik Choi & Seokho Lee, 2019. "Convex clustering for binary data," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 13(4), pages 991-1018, December.
    17. Pi, J. & Wang, Honggang & Pardalos, Panos M., 2021. "A dual reformulation and solution framework for regularized convex clustering problems," European Journal of Operational Research, Elsevier, vol. 290(3), pages 844-856.
    18. Yang, Dongchuan & Guo, Ju-e & Li, Yanzhao & Sun, Shaolong & Wang, Shouyang, 2023. "Short-term load forecasting with an improved dynamic decomposition-reconstruction-ensemble approach," Energy, Elsevier, vol. 263(PA).
    19. Wang, Shaobin & Zhao, Chao & Liu, Hanbin & Tian, Xinglei, 2021. "Exploring the spatial spillover effects of low-grade coal consumption and influencing factors in China," Resources Policy, Elsevier, vol. 70(C).
    20. He Jiang, 2023. "Robust forecasting in spatial autoregressive model with total variation regularization," Journal of Forecasting, John Wiley & Sons, Ltd., vol. 42(2), pages 195-211, March.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:ejores:v:308:y:2023:i:2:p:555-567. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/eor .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.