IDEAS home Printed from https://ideas.repec.org/p/arx/papers/2402.07521.html
   My bibliography  Save this paper

A step towards the integration of machine learning and small area estimation

Author

Listed:
  • Tomasz .Zk{a}d{l}o
  • Adam Chwila

Abstract

The use of machine-learning techniques has grown in numerous research areas. Currently, it is also widely used in statistics, including the official statistics for data collection (e.g. satellite imagery, web scraping and text mining, data cleaning, integration and imputation) but also for data analysis. However, the usage of these methods in survey sampling including small area estimation is still very limited. Therefore, we propose a predictor supported by these algorithms which can be used to predict any population or subpopulation characteristics based on cross-sectional and longitudinal data. Machine learning methods have already been shown to be very powerful in identifying and modelling complex and nonlinear relationships between the variables, which means that they have very good properties in case of strong departures from the classic assumptions. Therefore, we analyse the performance of our proposal under a different set-up, in our opinion of greater importance in real-life surveys. We study only small departures from the assumed model, to show that our proposal is a good alternative in this case as well, even in comparison with optimal methods under the model. What is more, we propose the method of the accuracy estimation of machine learning predictors, giving the possibility of the accuracy comparison with classic methods, where the accuracy is measured as in survey sampling practice. The solution of this problem is indicated in the literature as one of the key issues in integration of these approaches. The simulation studies are based on a real, longitudinal dataset, freely available from the Polish Local Data Bank, where the prediction problem of subpopulation characteristics in the last period, with "borrowing strength" from other subpopulations and time periods, is considered.

Suggested Citation

  • Tomasz .Zk{a}d{l}o & Adam Chwila, 2024. "A step towards the integration of machine learning and small area estimation," Papers 2402.07521, arXiv.org.
  • Handle: RePEc:arx:papers:2402.07521
    as

    Download full text from publisher

    File URL: http://arxiv.org/pdf/2402.07521
    File Function: Latest version
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Flores-Agreda, Daniel & Cantoni, Eva, 2019. "Bootstrap estimation of uncertainty in prediction for generalized linear mixed models," Computational Statistics & Data Analysis, Elsevier, vol. 130(C), pages 1-17.
    2. Gonzalez-Manteiga, W. & Lombardia, M.J. & Molina, I. & Morales, D. & Santamaria, L., 2007. "Estimation of the mean squared error of predictors of small area linear parameters under a logistic mixed model," Computational Statistics & Data Analysis, Elsevier, vol. 51(5), pages 2720-2733, February.
    3. Peter Hall & Tapabrata Maiti, 2006. "On parametric bootstrap methods for small area prediction," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 68(2), pages 221-238, April.
    4. James R. Carpenter & Harvey Goldstein & Jon Rasbash, 2003. "A novel bootstrap procedure for assessing the relationship between class size and achievement," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 52(4), pages 431-443, October.
    5. Nikos Tzavidis & Li‐Chun Zhang & Angela Luna & Timo Schmid & Natalia Rojas‐Perilla, 2018. "From start to finish: a framework for the production of small area official statistics," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 181(4), pages 927-979, October.
    6. Davidson, Russell & MacKinnon, James G., 2007. "Improving the reliability of bootstrap tests with the fast double bootstrap," Computational Statistics & Data Analysis, Elsevier, vol. 51(7), pages 3259-3281, April.
    7. Patrick Krennmair & Timo Schmid, 2022. "Flexible domain prediction using mixed effects random forests," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 71(5), pages 1865-1894, November.
    8. Sugasawa, Shonosuke & Kawakubo, Yuki & Datta, Gauri Sankar, 2019. "Observed best selective prediction in small area estimation," Journal of Multivariate Analysis, Elsevier, vol. 173(C), pages 383-392.
    9. Mehdi Dagdoug & Camelia Goga & David Haziza, 2023. "Model-Assisted Estimation Through Random Forests in Finite Population Sampling," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 118(542), pages 1234-1251, April.
    10. Isabel Molina & Nicola Salvati & Monica Pratesi, 2009. "Bootstrap for estimating the MSE of the Spatial EBLUP," Computational Statistics, Springer, vol. 24(3), pages 441-458, August.
    11. Jiang, Jiming & Nguyen, Thuan & Rao, J. Sunil, 2011. "Best Predictive Small Area Estimation," Journal of the American Statistical Association, American Statistical Association, vol. 106(494), pages 732-745.
    12. Monica Pratesi & Nicola Salvati, 2008. "Small area estimation: the EBLUP estimator based on spatially correlated random area effects," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 17(1), pages 113-141, February.
    13. Chandra, Hukum & Salvati, Nicola & Chambers, Ray & Tzavidis, Nikos, 2012. "Small area estimation under spatial nonstationarity," Computational Statistics & Data Analysis, Elsevier, vol. 56(10), pages 2875-2888.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Rebecca C. Steorts & Timo Schmid & Nikos Tzavidis, 2020. "Smoothing and Benchmarking for Small Area Estimation," International Statistical Review, International Statistical Institute, vol. 88(3), pages 580-598, December.
    2. Tomasz Ża̧dło, 2015. "On longitudinal moving average model for prediction of subpopulation total," Statistical Papers, Springer, vol. 56(3), pages 749-771, August.
    3. Katarzyna Reluga & María‐José Lombardía & Stefan Sperlich, 2023. "Simultaneous inference for linear mixed model parameters with an application to small area estimation," International Statistical Review, International Statistical Institute, vol. 91(2), pages 193-217, August.
    4. Schmid, Timo & Tzavidis, Nikos & Münnich, Ralf & Chambers, Ray, 2015. "Outlier robust small area estimation under spatial correlation," Discussion Papers 2015/8, Free University Berlin, School of Business & Economics.
    5. Timo Schmid & Nikos Tzavidis & Ralf Münnich & Ray Chambers, 2016. "Outlier Robust Small-Area Estimation Under Spatial Correlation," Scandinavian Journal of Statistics, Danish Society for Theoretical Statistics;Finnish Statistical Society;Norwegian Statistical Association;Swedish Statistical Association, vol. 43(3), pages 806-826, September.
    6. Patrick Krennmair & Timo Schmid, 2022. "Flexible domain prediction using mixed effects random forests," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 71(5), pages 1865-1894, November.
    7. repec:csb:stintr:v:17:y:2016:i:1:p:9-24 is not listed on IDEAS
    8. Erciulescu Andreea L. & Fuller Wayne A., 2016. "Small Area Prediction Under Alternative Model Specifications," Statistics in Transition New Series, Statistics Poland, vol. 17(1), pages 9-24, March.
    9. Dian Handayani & Henk Folmer & Anang Kurnia & Khairil Anwar Notodiputro, 2018. "The spatial empirical Bayes predictor of the small area mean for a lognormal variable of interest and spatially correlated random effects," Empirical Economics, Springer, vol. 55(1), pages 147-167, August.
    10. G. Bertarelli & R. Chambers & N. Salvati, 2021. "Outlier robust small domain estimation via bias correction and robust bootstrapping," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 30(1), pages 331-357, March.
    11. Andreea L. Erciulescu & Wayne A. Fuller, 2016. "Small Area Prediction Under Alternative Model Specifications," Statistics in Transition New Series, Polish Statistical Association, vol. 17(1), pages 9-24, March.
    12. Baldermann, Claudia & Salvati, Nicola & Schmid, Timo, 2016. "Robust small area estimation under spatial non-stationarity," Discussion Papers 2016/5, Free University Berlin, School of Business & Economics.
    13. Angelo Moretti, 2023. "Estimation of small area proportions under a bivariate logistic mixed model," Quality & Quantity: International Journal of Methodology, Springer, vol. 57(4), pages 3663-3684, August.
    14. Flores-Agreda, Daniel & Cantoni, Eva, 2019. "Bootstrap estimation of uncertainty in prediction for generalized linear mixed models," Computational Statistics & Data Analysis, Elsevier, vol. 130(C), pages 1-17.
    15. Ralf Münnich & Jan Burgard & Martin Vogt, 2013. "Small Area-Statistik: Methoden und Anwendungen," AStA Wirtschafts- und Sozialstatistisches Archiv, Springer;Deutsche Statistische Gesellschaft - German Statistical Society, vol. 6(3), pages 149-191, March.
    16. Chandra, Hukum & Salvati, Nicola & Chambers, Ray, 2018. "Small area estimation under a spatially non-linear model," Computational Statistics & Data Analysis, Elsevier, vol. 126(C), pages 19-38.
    17. Marhuenda, Yolanda & Molina, Isabel & Morales, Domingo, 2013. "Small area estimation with spatio-temporal Fay–Herriot models," Computational Statistics & Data Analysis, Elsevier, vol. 58(C), pages 308-325.
    18. Isabel Molina & Nicola Salvati & Monica Pratesi, 2009. "Bootstrap for estimating the MSE of the Spatial EBLUP," Computational Statistics, Springer, vol. 24(3), pages 441-458, August.
    19. Masaki,Takaaki & Newhouse,David Locke & Silwal,Ani Rudra & Bedada,Adane & Engstrom,Ryan, 2020. "Small Area Estimation of Non-Monetary Poverty with Geospatial Data," Policy Research Working Paper Series 9383, The World Bank.
    20. Caterina Giusti & Lucio Masserini & Monica Pratesi, 2017. "Local Comparisons of Small Area Estimates of Poverty: An Application Within the Tuscany Region in Italy," Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, Springer, vol. 131(1), pages 235-254, March.
    21. Shonosuke Sugasawa & Tatsuya Kubokawa & J. N. K. Rao, 2018. "Small area estimation via unmatched sampling and linking models," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 27(2), pages 407-427, June.

    More about this item

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arx:papers:2402.07521. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: arXiv administrators (email available below). General contact details of provider: http://arxiv.org/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.