IDEAS home Printed from https://ideas.repec.org/a/gam/jmathe/v9y2021i23p2991-d685449.html
   My bibliography  Save this article

On the Use of Gradient Boosting Methods to Improve the Estimation with Data Obtained with Self-Selection Procedures

Author

Listed:
  • Luis Castro-Martín

    (Department of Statistics and Operational Research, University of Granada, 18011 Granada, Spain)

  • María del Mar Rueda

    (Department of Statistics and Operational Research, University of Granada, 18011 Granada, Spain)

  • Ramón Ferri-García

    (Department of Statistics and Operational Research, University of Granada, 18011 Granada, Spain)

  • César Hernando-Tamayo

    (Department of Statistics and Operational Research, University of Granada, 18011 Granada, Spain)

Abstract

In the last years, web surveys have established themselves as one of the main methods in empirical research. However, the effect of coverage and selection bias in such surveys has undercut their utility for statistical inference in finite populations. To compensate for these biases, researchers have employed a variety of statistical techniques to adjust nonprobability samples so that they more closely match the population. In this study, we test the potential of the XGBoost algorithm in the most important methods for estimation that integrate data from a probability survey and a nonprobability survey. At the same time, a comparison is made of the effectiveness of these methods for the elimination of biases. The results show that the four proposed estimators based on gradient boosting frameworks can improve survey representativity with respect to other classic prediction methods. The proposed methodology is also used to analyze a real nonprobability survey sample on the social effects of COVID-19.

Suggested Citation

  • Luis Castro-Martín & María del Mar Rueda & Ramón Ferri-García & César Hernando-Tamayo, 2021. "On the Use of Gradient Boosting Methods to Improve the Estimation with Data Obtained with Self-Selection Procedures," Mathematics, MDPI, vol. 9(23), pages 1-23, November.
  • Handle: RePEc:gam:jmathe:v:9:y:2021:i:23:p:2991-:d:685449
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2227-7390/9/23/2991/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2227-7390/9/23/2991/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Lingxiao Wang & Barry I. Graubard & Hormuzd A. Katki & and Yan Li, 2020. "Improving external validity of epidemiologic cohort analyses: a kernel weighting approach," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 183(3), pages 1293-1311, June.
    2. Yilin Chen & Pengfei Li & Changbao Wu, 2020. "Doubly Robust Inference With Nonprobability Survey Samples," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 115(532), pages 2011-2021, December.
    3. Wu C. & Sitter R. R, 2001. "A Model-Calibration Approach to Using Complete Auxiliary Information From Survey Data," Journal of the American Statistical Association, American Statistical Association, vol. 96, pages 185-193, March.
    4. Montanari, Giorgio E. & Ranalli, M. Giovanna, 2005. "Nonparametric Model Calibration Estimation in Survey Sampling," Journal of the American Statistical Association, American Statistical Association, vol. 100, pages 1429-1442, December.
    5. Jiang, Depeng & Zhao, Puying & Tang, Niansheng, 2016. "A propensity score adjustment method for regression models with nonignorable missing covariates," Computational Statistics & Data Analysis, Elsevier, vol. 94(C), pages 98-119.
    6. Brian K Lee & Justin Lessler & Elizabeth A Stuart, 2011. "Weight Trimming and Propensity Score Weighting," PLOS ONE, Public Library of Science, vol. 6(3), pages 1-6, March.
    7. Yue, Mu & Li, Jialiang & Cheng, Ming-Yen, 2019. "Two-step sparse boosting for high-dimensional longitudinal data with varying coefficients," Computational Statistics & Data Analysis, Elsevier, vol. 131(C), pages 222-234.
    8. Bart Buelens & Joep Burger & Jan A. van den Brakel, 2018. "Comparing Inference Methods for Non‐probability Samples," International Statistical Review, International Statistical Institute, vol. 86(2), pages 322-343, August.
    9. Hsu, Hsiang-Ling & Chang, Yuan-chin Ivan & Chen, Ray-Bing, 2019. "Greedy active learning algorithm for logistic regression models," Computational Statistics & Data Analysis, Elsevier, vol. 129(C), pages 119-134.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. María del Mar Rueda & Sergio Martínez-Puertas & Luis Castro-Martín, 2022. "Methods to Counter Self-Selection Bias in Estimations of the Distribution Function and Quantiles," Mathematics, MDPI, vol. 10(24), pages 1-19, December.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Maria del Mar Rueda, 2019. "Comments on: Deville and Särndal’s calibration: revisiting a 25 years old successful optimization problem," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 28(4), pages 1077-1081, December.
    2. Domingo Morales & María del Mar Rueda & Dolores Esteban, 2018. "Model-Assisted Estimation of Small Area Poverty Measures: An Application within the Valencia Region in Spain," Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, Springer, vol. 138(3), pages 873-900, August.
    3. Ieva Burakauskaitė & Andrius Čiginas, 2023. "An Approach to Integrating a Non-Probability Sample in the Population Census," Mathematics, MDPI, vol. 11(8), pages 1-14, April.
    4. M. Rueda & I. Sánchez-Borrego & A. Arcos & S. Martínez, 2010. "Model-calibration estimation of the distribution function using nonparametric regression," Metrika: International Journal for Theoretical and Applied Statistics, Springer, vol. 71(1), pages 33-44, January.
    5. Barranco-Chamorro, I. & Jiménez-Gamero, M.D. & Moreno-Rebollo, J.L. & Muñoz-Pichardo, J.M., 2012. "Case-deletion type diagnostics for calibration estimators in survey sampling," Computational Statistics & Data Analysis, Elsevier, vol. 56(7), pages 2219-2236.
    6. Jan Pablo Burgard & Ralf Münnich & Martin Rupp, 2019. "A Generalized Calibration Approach Ensuring Coherent Estimates with Small Area Constraints," Research Papers in Economics 2019-10, University of Trier, Department of Economics.
    7. Changbao Wu & Shixiao Zhang, 2019. "Comments on: Deville and Särndal’s calibration: revisiting a 25 years old successful optimization problem," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 28(4), pages 1082-1086, December.
    8. Zhan Liu & Chaofeng Tu & Yingli Pan, 2022. "Model-assisted calibration with SCAD to estimated control for non-probability samples," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 31(4), pages 849-879, October.
    9. Sumanta Adhya & Tathagata Banerjee & Gaurangadeb Chattopadhyay, 2012. "Inference on finite population categorical response: nonparametric regression-based predictive approach," AStA Advances in Statistical Analysis, Springer;German Statistical Society, vol. 96(1), pages 69-98, January.
    10. Maciej Berk{e}sewicz & Greta Bia{l}kowska & Krzysztof Marcinkowski & Magdalena Ma'slak & Piotr Opiela & Robert Pater & Katarzyna Zadroga, 2019. "Enhancing the Demand for Labour survey by including skills from online job advertisements using model-assisted calibration," Papers 1908.06731, arXiv.org.
    11. María del Mar Rueda & Sergio Martínez-Puertas & Luis Castro-Martín, 2022. "Methods to Counter Self-Selection Bias in Estimations of the Distribution Function and Quantiles," Mathematics, MDPI, vol. 10(24), pages 1-19, December.
    12. Denis Devaud & Yves Tillé, 2019. "Rejoinder on: Deville and Särndal’s calibration: revisiting a 25-year-old successful optimization problem," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 28(4), pages 1087-1091, December.
    13. Singh, Sarjinder & Kim, Jong-Min, 2011. "A pseudo-empirical log-likelihood estimator using scrambled responses," Statistics & Probability Letters, Elsevier, vol. 81(3), pages 345-351, March.
    14. Debashis Ghosh & Michael S. Sabel, 2022. "A Weighted Sample Framework to Incorporate External Calculators for Risk Modeling," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 14(3), pages 363-379, December.
    15. Glover, Steven & Jones, Sam, 2019. "Can commercial farming promote rural dynamism in sub-Saharan Africa? Evidence from Mozambique," World Development, Elsevier, vol. 114(C), pages 110-121.
    16. Ramón Ferri-García & María del Mar Rueda, 2022. "Variable selection in Propensity Score Adjustment to mitigate selection bias in online surveys," Statistical Papers, Springer, vol. 63(6), pages 1829-1881, December.
    17. Nazmul Islam & Natalie E. Sheils & Megan S. Jarvis & Kenneth Cohen, 2022. "Comparative effectiveness over time of the mRNA-1273 (Moderna) vaccine and the BNT162b2 (Pfizer-BioNTech) vaccine," Nature Communications, Nature, vol. 13(1), pages 1-7, December.
    18. Wendy Chan, 2018. "Applications of Small Area Estimation to Generalization With Subclassification by Propensity Scores," Journal of Educational and Behavioral Statistics, , vol. 43(2), pages 182-224, April.
    19. Bonnie E. Shook‐Sa & Michael G. Hudgens, 2022. "Power and sample size for observational studies of point exposure effects," Biometrics, The International Biometric Society, vol. 78(1), pages 388-398, March.
    20. Carl-Erik Särndal & Imbi Traat & Kaur Lumiste, 2018. "Interaction Between Data Collection And Estimation Phases In Surveys With Nonresponse," Statistics in Transition New Series, Polish Statistical Association, vol. 19(2), pages 183-200, June.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jmathe:v:9:y:2021:i:23:p:2991-:d:685449. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.