IDEAS home Printed from https://ideas.repec.org/a/gam/jijerp/v18y2021i3p1333-d491512.html
   My bibliography  Save this article

Handling Complex Missing Data Using Random Forest Approach for an Air Quality Monitoring Dataset: A Case Study of Kuwait Environmental Data (2012 to 2018)

Author

Listed:
  • Ahmad R. Alsaber

    (Department of Mathematics and Statistics, University of Strathclyde, Glasgow G1 1XH, UK
    Current address: Livingstone Tower (Level 9), 26 Richmond Street, Glasgow G1 1XH, UK.)

  • Jiazhu Pan

    (Department of Mathematics and Statistics, University of Strathclyde, Glasgow G1 1XH, UK)

  • Adeeba Al-Hurban

    (Department of Earth and Environmental Sciences, Faculty of Science, Kuwait University, P.O. Box 5969, Safat 13060, Kuwait)

Abstract

In environmental research, missing data are often a challenge for statistical modeling. This paper addressed some advanced techniques to deal with missing values in a data set measuring air quality using a multiple imputation (MI) approach. MCAR, MAR, and NMAR missing data techniques are applied to the data set. Five missing data levels are considered: 5%, 10%, 20%, 30%, and 40%. The imputation method used in this paper is an iterative imputation method, missForest, which is related to the random forest approach. Air quality data sets were gathered from five monitoring stations in Kuwait, aggregated to a daily basis. Logarithm transformation was carried out for all pollutant data, in order to normalize their distributions and to minimize skewness. We found high levels of missing values for N O 2 (18.4%), C O (18.5%), P M 10 (57.4%), S O 2 (19.0%), and O 3 (18.2%) data. Climatological data (i.e., air temperature, relative humidity, wind direction, and wind speed) were used as control variables for better estimation. The results show that the MAR technique had the lowest RMSE and MAE. We conclude that MI using the missForest approach has a high level of accuracy in estimating missing values. MissForest had the lowest imputation error (RMSE and MAE) among the other imputation methods and, thus, can be considered to be appropriate for analyzing air quality data.

Suggested Citation

  • Ahmad R. Alsaber & Jiazhu Pan & Adeeba Al-Hurban, 2021. "Handling Complex Missing Data Using Random Forest Approach for an Air Quality Monitoring Dataset: A Case Study of Kuwait Environmental Data (2012 to 2018)," IJERPH, MDPI, vol. 18(3), pages 1-25, February.
  • Handle: RePEc:gam:jijerp:v:18:y:2021:i:3:p:1333-:d:491512
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/1660-4601/18/3/1333/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/1660-4601/18/3/1333/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Kowarik, Alexander & Templ, Matthias, 2016. "Imputation with the R Package VIM," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 74(i07).
    2. Sartori, Nicola & Salvan, Alberto & Thomaseth, Karl, 2005. "Multiple imputation of missing values in a cancer mortality analysis with estimated exposure dose," Computational Statistics & Data Analysis, Elsevier, vol. 49(3), pages 937-953, June.
    3. Patrick Royston, 2004. "Multiple imputation of missing values," Stata Journal, StataCorp LP, vol. 4(3), pages 227-241, September.
    4. White, Ian R. & Daniel, Rhian & Royston, Patrick, 2010. "Avoiding bias due to perfect prediction in multiple imputation of incomplete categorical variables," Computational Statistics & Data Analysis, Elsevier, vol. 54(10), pages 2267-2275, October.
    5. Di Zio, Marco & Guarnera, Ugo & Luzi, Orietta, 2007. "Imputation through finite Gaussian mixture models," Computational Statistics & Data Analysis, Elsevier, vol. 51(11), pages 5305-5316, July.
    6. Horton N. J. & Lipsitz S. R., 2001. "Multiple Imputation in Practice: Comparison of Software Packages for Regression Models With Missing Variables," The American Statistician, American Statistical Association, vol. 55, pages 244-254, August.
    7. King, Gary & Honaker, James & Joseph, Anne & Scheve, Kenneth, 2001. "Analyzing Incomplete Political Science Data: An Alternative Algorithm for Multiple Imputation," American Political Science Review, Cambridge University Press, vol. 95(1), pages 49-69, March.
    8. Schenker, Nathaniel & Taylor, Jeremy M. G., 1996. "Partially parametric techniques for multiple imputation," Computational Statistics & Data Analysis, Elsevier, vol. 22(4), pages 425-446, August.
    9. Honaker, James & King, Gary & Blackwell, Matthew, 2011. "Amelia II: A Program for Missing Data," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 45(i07).
    10. Ahmad Alsaber & Jiazhu Pan & Adeeba Al-Herz & Dhary S. Alkandary & Adeeba Al-Hurban & Parul Setiya & on behalf of the KRRD Group, 2020. "Influence of Ambient Air Pollution on Rheumatoid Arthritis Disease Activity Score Index," IJERPH, MDPI, vol. 17(2), pages 1-17, January.
    11. Van Ginkel, Joost R. & Andries Van der Ark, L. & Sijtsma, Klaas & Vermunt, Jeroen K., 2007. "Two-way imputation: A Bayesian method for estimating missing scores in tests and questionnaires, and an accurate approximation," Computational Statistics & Data Analysis, Elsevier, vol. 51(8), pages 4013-4027, May.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Jue Yang & Shunsuke Managi & Masayuki Sato, 2015. "The effect of institutional quality on national wealth: an examination using multiple imputation method," Environmental Economics and Policy Studies, Springer;Society for Environmental Economics and Policy Studies - SEEPS, vol. 17(3), pages 431-453, July.
    2. Nengsih Titin Agustin & Bertrand Frédéric & Maumy-Bertrand Myriam & Meyer Nicolas, 2019. "Determining the number of components in PLS regression on incomplete data set," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 18(6), pages 1-28, December.
    3. Gabriele Beissel Durrant, 2009. "Imputation Methods for Handling Item-Nonresponse in the Social Sciences: A Methodological Review," Working Papers id:2007, eSocialSciences.
    4. Vincent Bauer & Keven Ruby & Robert Pape, 2017. "Solving the Problem of Unattributed Political Violence," Journal of Conflict Resolution, Peace Science Society (International), vol. 61(7), pages 1537-1564, August.
    5. Cohen, Joseph N, 2010. "Neoliberalism’s relationship with economic growth in the developing world: Was it the power of the market or the resolution of financial crisis?," MPRA Paper 24527, University Library of Munich, Germany.
    6. Wurriehausen, Nadine & Ihle, Rico & Lakner, Sebastian, 2011. "The Integration of the Conventional and Organic Wheat Market," 2011 International Congress, August 30-September 2, 2011, Zurich, Switzerland 115784, European Association of Agricultural Economists.
    7. Seiler, Christian & Heumann, Christian, 2013. "Microdata imputations and macrodata implications: Evidence from the Ifo Business Survey," Economic Modelling, Elsevier, vol. 35(C), pages 722-733.
    8. Zhong, Hua & Hu, Wuyang, 2015. "Farmers’ Willingness to Engage in Best Management Practices: an Application of Multiple Imputation," 2015 Annual Meeting, January 31-February 3, 2015, Atlanta, Georgia 196962, Southern Agricultural Economics Association.
    9. Sarah Mustillo, 2012. "The Effects of Auxiliary Variables on Coefficient Bias and Efficiency in Multiple Imputation," Sociological Methods & Research, , vol. 41(2), pages 335-361, May.
    10. Ihle, Rico & Rubin, Ofir D., 2012. "Price Transmission Subject to Security‐based Trade Barriers in the Context of the Israeli‐Palestinian Conflict," 2012 Conference, August 18-24, 2012, Foz do Iguacu, Brazil 125392, International Association of Agricultural Economists.
    11. Nicholas Tierney & Dianne Cook, 2018. "Expanding tidy data principles to facilitate missing data exploration, visualization and assessment of imputations," Monash Econometrics and Business Statistics Working Papers 14/18, Monash University, Department of Econometrics and Business Statistics.
    12. Siddique, Juned & Harel, Ofer, 2009. "MIDAS: A SAS Macro for Multiple Imputation Using Distance-Aided Selection of Donors," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 29(i09).
    13. Cohen, Joseph N, 2010. "Neoliberalism’s relationship with economic growth in the developing world: Was it the power of the market or the resolution of financial crisis?," MPRA Paper 24399, University Library of Munich, Germany.
    14. Aderiana Mutheu Mbandi & Jan R. Böhnke & Dietrich Schwela & Harry Vallack & Mike R. Ashmore & Lisa Emberson, 2019. "Estimating On-Road Vehicle Fuel Economy in Africa: A Case Study Based on an Urban Transport Survey in Nairobi, Kenya," Energies, MDPI, vol. 12(6), pages 1-28, March.
    15. Roman Matkovskyy, 2016. "A comparison of pre- and post-crisis efficiency of OECD countries: evidence from a model with temporal heterogeneity in time and unobservable individual effect," European Journal of Comparative Economics, Cattaneo University (LIUC), vol. 13(2), pages 135-167, December.
    16. Catherine Norman, 2009. "Rule of Law and the Resource Curse: Abundance Versus Intensity," Environmental & Resource Economics, Springer;European Association of Environmental and Resource Economists, vol. 43(2), pages 183-207, June.
    17. Ann Bostrom & Adam L. Hayes & Katherine M. Crosman, 2019. "Efficacy, Action, and Support for Reducing Climate Change Risks," Risk Analysis, John Wiley & Sons, vol. 39(4), pages 805-828, April.
    18. Christian Seiler, 2013. "Nonresponse in Business Tendency Surveys: Theoretical Discourse and Empirical Evidence," ifo Beiträge zur Wirtschaftsforschung, ifo Institute - Leibniz Institute for Economic Research at the University of Munich, number 52.
    19. Eriko Miyama & Shunsuke Managi, 2014. "Global environmental emissions estimate: application of multiple imputation," Environmental Economics and Policy Studies, Springer;Society for Environmental Economics and Policy Studies - SEEPS, vol. 16(2), pages 115-135, April.
    20. Talebian, Ahmadreza & Zou, Bo & Hansen, Mark, 2018. "Assessing the impacts of state-supported rail services on local population and employment: A California case study," Transport Policy, Elsevier, vol. 63(C), pages 108-121.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jijerp:v:18:y:2021:i:3:p:1333-:d:491512. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.