IDEAS home Printed from https://ideas.repec.org/a/spr/jclass/v40y2023i1d10.1007_s00357-023-09430-6.html
   My bibliography  Save this article

Classification Trees with Mismeasured Responses

Author

Listed:
  • Liqun Diao

    (University of Waterloo)

  • Grace Y. Yi

    (University of Western Ontario)

Abstract

Classification trees are a popular machine learning tool for studying a variety of problems, including prediction, inference, risk factors identification, and risk groups classification. Classification trees are basically developed under the assumption that the response and covariate variables are accurately measured. This condition, however, is often violated in practice. Ignoring this feature commonly yields invalid analysis results. In this paper, we study the impact of mismeasured responses on the performance of standard classification trees and propose a novel classification trees algorithm for mismeasured responses. Our study is directed to settings with binary responses which are subject to mismeasurement. To address the effects of mismeasured responses, we modify the decision rules which are valid for tree building in the mismeasurement-free settings by introducing new measures for the node impurity and misclassification cost. To characterize the magnitude of mismeasurement in responses, we consider two data scenarios. In the first scenario, the mismeasurement rates are known, either from previous studies of the same nature or being set by researchers who are interested in conducting sensitivity analyses to assess the impact of mismeasured responses. In the second scenario, the mismeasurement rates are unknown and are estimated from a validation dataset which contains both accurate measurements and mismeasurements for responses. We conduct a variety of simulation studies to assess the performance of the proposed classification trees algorithm, in comparison to the usual classification trees algorithms which ignore response mismeasurement. It is demonstrated that ignoring response mismeasurement can yield seriously erroneous results and that our proposed method provides superior performance with the mismeasurement effects accommodated. To illustrate the usage of the proposed method, we analyze the data arising from the National Health and Nutrition Examination Surveys (NHANES) by conducting sensitivity analyses to assess how classification results may be affected by different misclassification costs.

Suggested Citation

  • Liqun Diao & Grace Y. Yi, 2023. "Classification Trees with Mismeasured Responses," Journal of Classification, Springer;The Classification Society, vol. 40(1), pages 168-191, April.
  • Handle: RePEc:spr:jclass:v:40:y:2023:i:1:d:10.1007_s00357-023-09430-6
    DOI: 10.1007/s00357-023-09430-6
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s00357-023-09430-6
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s00357-023-09430-6?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Joseph Sexton & Petter Laake, 2007. "Boosted Regression Trees with Errors in Variables," Biometrics, The International Biometric Society, vol. 63(2), pages 586-592, June.
    2. Grace Y. Yi & Wenqing He, 2017. "Analysis of case-control data with interacting misclassified covariates," Journal of Statistical Distributions and Applications, Springer, vol. 4(1), pages 1-16, December.
    3. Douglas M. Hawkins, 1980. "Critical Values for Identifying Outliers," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 29(1), pages 95-96, March.
    4. Grace Y. Yi & Yanyuan Ma & Donna Spiegelman & Raymond J. Carroll, 2015. "Functional and Structural Methods With Mixed Measurement Error and Misclassification in Covariates," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 110(510), pages 681-696, June.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Li‐Pang Chen & Grace Y. Yi, 2021. "Analysis of noisy survival data with graphical proportional hazards measurement error models," Biometrics, The International Biometric Society, vol. 77(3), pages 956-969, September.
    2. Dlugosz, Stephan & Mammen, Enno & Wilke, Ralf A., 2017. "Generalized partially linear regression with misclassified data and an application to labour market transitions," Computational Statistics & Data Analysis, Elsevier, vol. 110(C), pages 145-159.
    3. Qihuang Zhang & Grace Y. Yi, 2023. "Zero‐inflated Poisson models with measurement error in the response," Biometrics, The International Biometric Society, vol. 79(2), pages 1089-1102, June.
    4. Karol Pilot & Alicja Ganczarek-Gamrot & Krzysztof Kania, 2024. "Dealing with Anomalies in Day-Ahead Market Prediction Using Machine Learning Hybrid Model," Energies, MDPI, vol. 17(17), pages 1-20, September.
    5. Li-Pang Chen & Grace Y. Yi, 2021. "Semiparametric methods for left-truncated and right-censored survival data with covariate measurement error," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 73(3), pages 481-517, June.
    6. Francesca Ieva & Anna Maria Paganoni, 2020. "Component-wise outlier detection methods for robustifying multivariate functional samples," Statistical Papers, Springer, vol. 61(2), pages 595-614, April.
    7. Andrzej Chmielowiec, 2021. "Algorithm for error-free determination of the variance of all contiguous subsequences and fixed-length contiguous subsequences for a sequence of industrial measurement data," Computational Statistics, Springer, vol. 36(4), pages 2813-2840, December.
    8. Marc Chataigner & Stéphane Crépey & Jiang Pu, 2020. "Nowcasting Networks," Post-Print hal-03910123, HAL.
    9. Greco, Salvatore & Ishizaka, Alessio & Tasiou, Menelaos & Torrisi, Gianpiero, 2019. "Sigma-Mu efficiency analysis: A methodology for evaluating units through composite indicators," European Journal of Operational Research, Elsevier, vol. 278(3), pages 942-960.
    10. David Juárez-Varón & Victoria Tur-Viñes & Alejandro Rabasa-Dolado & Kristina Polotskaya, 2020. "An Adaptive Machine Learning Methodology Applied to Neuromarketing Analysis: Prediction of Consumer Behaviour Regarding the Key Elements of the Packaging Design of an Educational Toy," Social Sciences, MDPI, vol. 9(9), pages 1-23, September.
    11. Zhongqiu Wang & Guan Yuan & Haoran Pei & Yanmei Zhang & Xiao Liu, 2020. "Unsupervised learning trajectory anomaly detection algorithm based on deep representation," International Journal of Distributed Sensor Networks, , vol. 16(12), pages 15501477209, December.
    12. Arata, Linda & Fabrizi, Enrico & Sckokai, Paolo, 2020. "A worldwide analysis of trend in crop yields and yield variability: Evidence from FAO data," Economic Modelling, Elsevier, vol. 90(C), pages 190-208.
    13. Wentao Yang & Huaxi He & Dongsheng Wei & Hao Chen, 2022. "Generating pseudo-absence samples of invasive species based on outlier detection in the geographical characteristic space," Journal of Geographical Systems, Springer, vol. 24(2), pages 261-279, April.
    14. Fournier, Nicholas PhD & Farid, Yashar Zeinali PhD & Patire, Anthony David PhD, 2021. "Potential Erroneous Degradation of High Occupancy Vehicle (HOV) Facilities," Institute of Transportation Studies, Research Reports, Working Papers, Proceedings qt3z76r7tj, Institute of Transportation Studies, UC Berkeley.
    15. Puteri Paramita & Zuduo Zheng & Md Mazharul Haque & Simon Washington & Paul Hyland, 2018. "User satisfaction with train fares: A comparative analysis in five Australian cities," PLOS ONE, Public Library of Science, vol. 13(6), pages 1-26, June.
    16. Gasser, Patrick, 2020. "A review on energy security indices to compare country performances," Energy Policy, Elsevier, vol. 139(C).
    17. Qianqian Wang & Yanyuan Ma & Guangren Yang, 2020. "Locally efficient estimation in generalized partially linear model with measurement error in nonlinear function," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 29(2), pages 553-572, June.
    18. Nirpeksh Kumar, 2019. "Exact distributions of tests of outliers for exponential samples," Statistical Papers, Springer, vol. 60(6), pages 2031-2061, December.
    19. Stanley Munamato Mbiva & Fabio Mathias Correa, 2024. "Machine Learning to Enhance the Detection of Terrorist Financing and Suspicious Transactions in Migrant Remittances," JRFM, MDPI, vol. 17(5), pages 1-19, April.
    20. Taha Yehia & Ali Wahba & Sondos Mostafa & Omar Mahmoud, 2022. "Suitability of Different Machine Learning Outlier Detection Algorithms to Improve Shale Gas Production Data for Effective Decline Curve Analysis," Energies, MDPI, vol. 15(23), pages 1-25, November.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:jclass:v:40:y:2023:i:1:d:10.1007_s00357-023-09430-6. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.