IDEAS home Printed from https://ideas.repec.org/a/eee/csdana/v71y2014icp832-848.html
   My bibliography  Save this article

Estimating mutual information for feature selection in the presence of label noise

Author

Listed:
  • Frénay, Benoît
  • Doquire, Gauthier
  • Verleysen, Michel

Abstract

A way to achieve feature selection for classification problems polluted by label noise is proposed. The performances of traditional feature selection algorithms often decrease sharply when some samples are wrongly labelled. A method based on a probabilistic label noise model combined with a nearest neighbours-based entropy estimator is introduced to robustly evaluate the mutual information, a popular relevance criterion for feature selection. A backward greedy search procedure is used in combination with this criterion to find relevant sets of features. Experiments establish that (i) there is a real need to take a possible label noise into account when selecting features and (ii) the proposed methodology is effectively able to reduce the negative impact of the mislabelled data points on the feature selection process.

Suggested Citation

  • Frénay, Benoît & Doquire, Gauthier & Verleysen, Michel, 2014. "Estimating mutual information for feature selection in the presence of label noise," Computational Statistics & Data Analysis, Elsevier, vol. 71(C), pages 832-848.
  • Handle: RePEc:eee:csdana:v:71:y:2014:i:c:p:832-848
    DOI: 10.1016/j.csda.2013.05.001
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S016794731300159X
    Download Restriction: Full text for ScienceDirect subscribers only.

    File URL: https://libkey.io/10.1016/j.csda.2013.05.001?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Xu, Ping & Brock, Guy N. & Parrish, Rudolph S., 2009. "Modified linear discriminant analysis approaches for classification of high-dimensional microarray data," Computational Statistics & Data Analysis, Elsevier, vol. 53(5), pages 1674-1687, March.
    2. Lee, Jae Won & Lee, Jung Bok & Park, Mira & Song, Seuck Heun, 2005. "An extensive comparison of recent classification tools applied to microarray data," Computational Statistics & Data Analysis, Elsevier, vol. 48(4), pages 869-885, April.
    3. Hall, Peter & Xue, Jing-Hao, 2014. "On selecting interacting features from high-dimensional data," Computational Statistics & Data Analysis, Elsevier, vol. 71(C), pages 694-708.
    4. Paulino, Carlos Daniel & Silva, Giovani & Alberto Achcar, Jorge, 2005. "Bayesian analysis of correlated misclassified binary data," Computational Statistics & Data Analysis, Elsevier, vol. 49(4), pages 1120-1131, June.
    5. Li, Chin-Shang & Cheng, Cheng, 2004. "Stable classification with applications to microarray data," Computational Statistics & Data Analysis, Elsevier, vol. 47(3), pages 599-609, October.
    6. Wang, Xiaoming & Park, Taesung & Carriere, K.C., 2010. "Variable selection via combined penalization for high-dimensional data analysis," Computational Statistics & Data Analysis, Elsevier, vol. 54(10), pages 2230-2243, October.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Abpeykar, Shadi & Ghatee, Mehdi & Zare, Hadi, 2019. "Ensemble decision forest of RBF networks via hybrid feature clustering approach for high-dimensional data classification," Computational Statistics & Data Analysis, Elsevier, vol. 131(C), pages 12-36.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Pires, Ana M. & Branco, João A., 2010. "Projection-pursuit approach to robust linear discriminant analysis," Journal of Multivariate Analysis, Elsevier, vol. 101(10), pages 2464-2485, November.
    2. Wenya Liu & Qi Li, 2017. "An Efficient Elastic Net with Regression Coefficients Method for Variable Selection of Spectrum Data," PLOS ONE, Public Library of Science, vol. 12(2), pages 1-13, February.
    3. Brendan P. W. Ames & Mingyi Hong, 2016. "Alternating direction method of multipliers for penalized zero-variance discriminant analysis," Computational Optimization and Applications, Springer, vol. 64(3), pages 725-754, July.
    4. Herbert Pang & Tiejun Tong & Hongyu Zhao, 2009. "Shrinkage-based Diagonal Discriminant Analysis and Its Applications in High-Dimensional Data," Biometrics, The International Biometric Society, vol. 65(4), pages 1021-1029, December.
    5. Lambert-Lacroix, Sophie & Peyre, Julie, 2006. "Local likelihood regression in generalized linear single-index models with applications to microarray data," Computational Statistics & Data Analysis, Elsevier, vol. 51(3), pages 2091-2113, December.
    6. Kubokawa, Tatsuya & Hyodo, Masashi & Srivastava, Muni S., 2013. "Asymptotic expansion and estimation of EPMC for linear classification rules in high dimension," Journal of Multivariate Analysis, Elsevier, vol. 115(C), pages 496-515.
    7. Bang, Sungwan & Jhun, Myoungshic, 2012. "Simultaneous estimation and factor selection in quantile regression via adaptive sup-norm regularization," Computational Statistics & Data Analysis, Elsevier, vol. 56(4), pages 813-826.
    8. Jong Victor L. & Novianti Putri W. & Roes Kit C.B. & Eijkemans Marinus J.C., 2014. "Exploring homogeneity of correlation structures of gene expression datasets within and between etiological disease categories," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 13(6), pages 1-16, December.
    9. Yang, Tae Young, 2009. "Efficient multi-class cancer diagnosis algorithm, using a global similarity pattern," Computational Statistics & Data Analysis, Elsevier, vol. 53(3), pages 756-765, January.
    10. Sung, Bongjung & Lee, Jaeyong, 2023. "Covariance structure estimation with Laplace approximation," Journal of Multivariate Analysis, Elsevier, vol. 198(C).
    11. Dennis Kostka & Rainer Spang, 2008. "Microarray Based Diagnosis Profits from Better Documentation of Gene Expression Signatures," PLOS Computational Biology, Public Library of Science, vol. 4(2), pages 1-6, February.
    12. Mohammad S. Uddin & Guotai Chi & Mazin A. M. Al Janabi & Tabassum Habib, 2022. "Leveraging random forest in micro‐enterprises credit risk modelling for accuracy and interpretability," International Journal of Finance & Economics, John Wiley & Sons, Ltd., vol. 27(3), pages 3713-3729, July.
    13. Pedro Duarte Silva, A., 2011. "Two-group classification with high-dimensional correlated data: A factor model approach," Computational Statistics & Data Analysis, Elsevier, vol. 55(11), pages 2975-2990, November.
    14. Scrucca, Luca, 2007. "Class prediction and gene selection for DNA microarrays using regularized sliced inverse regression," Computational Statistics & Data Analysis, Elsevier, vol. 52(1), pages 438-451, September.
    15. Ruiyan Luo & Xin Qi, 2017. "Asymptotic Optimality of Sparse Linear Discriminant Analysis with Arbitrary Number of Classes," Scandinavian Journal of Statistics, Danish Society for Theoretical Statistics;Finnish Statistical Society;Norwegian Statistical Association;Swedish Statistical Association, vol. 44(3), pages 598-616, September.
    16. Shen, Yanfeng & Lin, Zhengyan & Zhu, Jun, 2011. "Shrinkage-based regularization tests for high-dimensional data with application to gene set analysis," Computational Statistics & Data Analysis, Elsevier, vol. 55(7), pages 2221-2233, July.
    17. Timothy I. Cannings & Richard J. Samworth, 2017. "Random-projection ensemble classification," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 79(4), pages 959-1035, September.
    18. Parrish, Rudolph S. & Spencer III, Horace J. & Xu, Ping, 2009. "Distribution modeling and simulation of gene expression data," Computational Statistics & Data Analysis, Elsevier, vol. 53(5), pages 1650-1660, March.
    19. Ivana Krtolica & Dragan Savić & Bojana Bajić & Snežana Radulović, 2022. "Machine Learning for Water Quality Assessment Based on Macrophyte Presence," Sustainability, MDPI, vol. 15(1), pages 1-13, December.
    20. Alan R Dabney & John D Storey, 2007. "Optimality Driven Nearest Centroid Classification from Genomic Data," PLOS ONE, Public Library of Science, vol. 2(10), pages 1-7, October.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:csdana:v:71:y:2014:i:c:p:832-848. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/csda .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.