IDEAS home Printed from https://ideas.repec.org/a/spr/compst/v34y2019i4d10.1007_s00180-019-00900-3.html
   My bibliography  Save this article

Predicting missing values: a comparative study on non-parametric approaches for imputation

Author

Listed:
  • Burim Ramosaj

    (Technical University of Dortmund)

  • Markus Pauly

    (Technical University of Dortmund)

Abstract

Missing data is an expected issue when large amounts of data is collected, and several imputation techniques have been proposed to tackle this problem. Beneath classical approaches such as MICE, the application of Machine Learning techniques is tempting. Here, the recently proposed missForest imputation method has shown high imputation accuracy under the Missing (Completely) at Random scheme with various missing rates. In its core, it is based on a random forest for classification and regression, respectively. In this paper we study whether this approach can even be enhanced by other methods such as the stochastic gradient tree boosting method, the C5.0 algorithm, BART or modified random forest procedures. In particular, other resampling strategies within the random forest protocol are suggested. In an extensive simulation study, we analyze their performances for continuous, categorical as well as mixed-type data. An empirical analysis focusing on credit information and Facebook data complements our investigations.

Suggested Citation

  • Burim Ramosaj & Markus Pauly, 2019. "Predicting missing values: a comparative study on non-parametric approaches for imputation," Computational Statistics, Springer, vol. 34(4), pages 1741-1764, December.
  • Handle: RePEc:spr:compst:v:34:y:2019:i:4:d:10.1007_s00180-019-00900-3
    DOI: 10.1007/s00180-019-00900-3
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s00180-019-00900-3
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s00180-019-00900-3?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. van Buuren, Stef & Groothuis-Oudshoorn, Karin, 2011. "mice: Multivariate Imputation by Chained Equations in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 45(i03).
    2. Strobl, Carolin & Boulesteix, Anne-Laure & Augustin, Thomas, 2007. "Unbiased split selection for classification trees based on the Gini Index," Computational Statistics & Data Analysis, Elsevier, vol. 52(1), pages 483-501, September.
    3. Konietschke, F. & Harrar, S.W. & Lange, K. & Brunner, E., 2012. "Ranking procedures for matched pairs with missing data — Asymptotic theory and a small sample approximation," Computational Statistics & Data Analysis, Elsevier, vol. 56(5), pages 1090-1102.
    4. Claudio Conversano & Roberta Siciliano, 2009. "Incremental Tree-Based Missing Data Imputation with Lexicographic Ordering," Journal of Classification, Springer;The Classification Society, vol. 26(3), pages 361-379, December.
    5. Xu, Li-Wen & Yang, Fang-Qin & Abula, Aji’erguli & Qin, Shuang, 2013. "A parametric bootstrap approach for two-way ANOVA in presence of possible interactions with unequal variances," Journal of Multivariate Analysis, Elsevier, vol. 115(C), pages 172-180.
    6. Friedman, Jerome H., 2002. "Stochastic gradient boosting," Computational Statistics & Data Analysis, Elsevier, vol. 38(4), pages 367-378, February.
    7. Konietschke, Frank & Bathke, Arne C. & Harrar, Solomon W. & Pauly, Markus, 2015. "Parametric and nonparametric bootstrap methods for general MANOVA," Journal of Multivariate Analysis, Elsevier, vol. 140(C), pages 291-301.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Yang, Yadong & Shahbeik, Hossein & Shafizadeh, Alireza & Masoudnia, Nima & Rafiee, Shahin & Zhang, Yijia & Pan, Junting & Tabatabaei, Meisam & Aghbashlo, Mortaza, 2022. "Biomass microwave pyrolysis characterization by machine learning for sustainable rural biorefineries," Renewable Energy, Elsevier, vol. 201(P2), pages 70-86.
    2. Mohamed Lamine Sidibé & Roland Yonaba & Fowé Tazen & Héla Karoui & Ousmane Koanda & Babacar Lèye & Harinaivo Anderson Andrianisa & Harouna Karambiri, 2023. "Understanding the COVID-19 pandemic prevalence in Africa through optimal feature selection and clustering: evidence from a statistical perspective," Environment, Development and Sustainability: A Multidisciplinary Approach to the Theory and Practice of Sustainable Development, Springer, vol. 25(11), pages 13565-13593, November.
    3. Christoph Stach & Clémentine Gritti & Julia Bräcker & Michael Behringer & Bernhard Mitschang, 2022. "Protecting Sensitive Data in the Information Age: State of the Art and Future Prospects," Future Internet, MDPI, vol. 14(11), pages 1-43, October.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Friedrich, Sarah & Pauly, Markus, 2018. "MATS: Inference for potentially singular and heteroscedastic MANOVA," Journal of Multivariate Analysis, Elsevier, vol. 165(C), pages 166-179.
    2. Huang Lin & Merete Eggesbø & Shyamal Das Peddada, 2022. "Linear and nonlinear correlation estimators unveil undescribed taxa interactions in microbiome data," Nature Communications, Nature, vol. 13(1), pages 1-16, December.
    3. Ali B. Barlas & Seda Guler Mert & Berk Orkun Isa & Alvaro Ortiz & Tomasa Rodrigo & Baris Soybilgen & Ege Yazgan, 2024. "Big data financial transactions and GDP nowcasting: The case of Turkey," Journal of Forecasting, John Wiley & Sons, Ltd., vol. 43(2), pages 227-248, March.
    4. Milica Maricic & Jose A. Egea & Veljko Jeremic, 2019. "A Hybrid Enhanced Scatter Search—Composite I-Distance Indicator (eSS-CIDI) Optimization Approach for Determining Weights Within Composite Indicators," Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, Springer, vol. 144(2), pages 497-537, July.
    5. Hapfelmeier, A. & Ulm, K., 2014. "Variable selection by Random Forests using data with missing values," Computational Statistics & Data Analysis, Elsevier, vol. 80(C), pages 129-139.
    6. Lukasz Struski & Marek Śmieja & Jacek Tabor, 2020. "Pointed Subspace Approach to Incomplete Data," Journal of Classification, Springer;The Classification Society, vol. 37(1), pages 42-57, April.
    7. Mondal, Anjana & Sattler, Paavo & Kumar, Somesh, 2023. "Testing against ordered alternatives in a two-way model without interaction under heteroscedasticity," Journal of Multivariate Analysis, Elsevier, vol. 196(C).
    8. Mansoor, Umer & Jamal, Arshad & Su, Junbiao & Sze, N.N. & Chen, Anthony, 2023. "Investigating the risk factors of motorcycle crash injury severity in Pakistan: Insights and policy recommendations," Transport Policy, Elsevier, vol. 139(C), pages 21-38.
    9. Noémi Kreif & Richard Grieve & Iván Díaz & David Harrison, 2015. "Evaluation of the Effect of a Continuous Treatment: A Machine Learning Approach with an Application to Treatment for Traumatic Brain Injury," Health Economics, John Wiley & Sons, Ltd., vol. 24(9), pages 1213-1228, September.
    10. Abhilash Bandam & Eedris Busari & Chloi Syranidou & Jochen Linssen & Detlef Stolten, 2022. "Classification of Building Types in Germany: A Data-Driven Modeling Approach," Data, MDPI, vol. 7(4), pages 1-23, April.
    11. Boonstra Philip S. & Little Roderick J.A. & West Brady T. & Andridge Rebecca R. & Alvarado-Leiton Fernanda, 2021. "A Simulation Study of Diagnostics for Selection Bias," Journal of Official Statistics, Sciendo, vol. 37(3), pages 751-769, September.
    12. Bissan Ghaddar & Ignacio Gómez-Casares & Julio González-Díaz & Brais González-Rodríguez & Beatriz Pateiro-López & Sofía Rodríguez-Ballesteros, 2023. "Learning for Spatial Branching: An Algorithm Selection Approach," INFORMS Journal on Computing, INFORMS, vol. 35(5), pages 1024-1043, September.
    13. Akash Malhotra, 2018. "A hybrid econometric-machine learning approach for relative importance analysis: Prioritizing food policy," Papers 1806.04517, arXiv.org, revised Aug 2020.
    14. Christopher J Greenwood & George J Youssef & Primrose Letcher & Jacqui A Macdonald & Lauryn J Hagg & Ann Sanson & Jenn Mcintosh & Delyse M Hutchinson & John W Toumbourou & Matthew Fuller-Tyszkiewicz &, 2020. "A comparison of penalised regression methods for informing the selection of predictive markers," PLOS ONE, Public Library of Science, vol. 15(11), pages 1-14, November.
    15. Liangyuan Hu & Lihua Li, 2022. "Using Tree-Based Machine Learning for Health Studies: Literature Review and Case Series," IJERPH, MDPI, vol. 19(23), pages 1-13, December.
    16. Norah Alyabs & Sy Han Chiou, 2022. "The Missing Indicator Approach for Accelerated Failure Time Model with Covariates Subject to Limits of Detection," Stats, MDPI, vol. 5(2), pages 1-13, May.
    17. Feldkircher, Martin, 2014. "The determinants of vulnerability to the global financial crisis 2008 to 2009: Credit growth and other sources of risk," Journal of International Money and Finance, Elsevier, vol. 43(C), pages 19-49.
    18. Nahushananda Chakravarthy H G & Karthik M Seenappa & Sujay Raghavendra Naganna & Dayananda Pruthviraja, 2023. "Machine Learning Models for the Prediction of the Compressive Strength of Self-Compacting Concrete Incorporating Incinerated Bio-Medical Waste Ash," Sustainability, MDPI, vol. 15(18), pages 1-22, September.
    19. Tim Voigt & Martin Kohlhase & Oliver Nelles, 2021. "Incremental DoE and Modeling Methodology with Gaussian Process Regression: An Industrially Applicable Approach to Incorporate Expert Knowledge," Mathematics, MDPI, vol. 9(19), pages 1-26, October.
    20. Eunsil Seok & Akhgar Ghassabian & Yuyan Wang & Mengling Liu, 2024. "Statistical Methods for Modeling Exposure Variables Subject to Limit of Detection," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 16(2), pages 435-458, July.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:compst:v:34:y:2019:i:4:d:10.1007_s00180-019-00900-3. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.