IDEAS home Printed from https://ideas.repec.org/a/eee/csdana/v173y2022ics0167947322000858.html
   My bibliography  Save this article

A two-stage optimal subsampling estimation for missing data problems with large-scale data

Author

Listed:
  • Su, Miaomiao
  • Wang, Ruoyu
  • Wang, Qihua

Abstract

Subsampling is useful to downsize data volumes and speed up calculations for large-scale data and is well studied with completely observed data. In the presence of missing data, computation is more challenging and subsampling becomes more crucial and complex. However, there is still a lack of study on subsampling for missing data problems. This paper fills the gap by studying the subsampling method for a widely used missing data estimator, the augmented inverse probability weighting (AIPW) estimator. The response mean estimation problem with missing responses is discussed for illustration. A two-stage subsampling method is proposed via Poisson sampling framework. A small subsample of expected size n1 is used in the first stage to estimate the parameters in the propensity score and the outcome regression models, while a larger subsample of expected size n2 is used in the computationally simple second stage to calculate the final estimator. An attractive property of the resulting estimator is that its convergence rate is n2−1/2 rather than n1−1/2 when both the propensity score and the outcome regression functions are correctly specified. The rate n2−1/2 is still attainable for some important cases if only one of the two functions is correctly specified. This indicates that using a small subsample in the computationally complex first stage can reduce the computational burden with little impact on the statistical accuracy. Asymptotic normality of the resulting estimator is established and the optimal subsampling probability is derived by minimizing the asymptotic variance of the resulting estimator. Simulations and a real data analysis were conducted to demonstrate the empirical performance of the resulting estimator.

Suggested Citation

  • Su, Miaomiao & Wang, Ruoyu & Wang, Qihua, 2022. "A two-stage optimal subsampling estimation for missing data problems with large-scale data," Computational Statistics & Data Analysis, Elsevier, vol. 173(C).
  • Handle: RePEc:eee:csdana:v:173:y:2022:i:c:s0167947322000858
    DOI: 10.1016/j.csda.2022.107505
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0167947322000858
    Download Restriction: Full text for ScienceDirect subscribers only.

    File URL: https://libkey.io/10.1016/j.csda.2022.107505?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Victor Chernozhukov & Denis Chetverikov & Mert Demirer & Esther Duflo & Christian Hansen & Whitney Newey & James Robins, 2018. "Double/debiased machine learning for treatment and structural parameters," Econometrics Journal, Royal Economic Society, vol. 21(1), pages 1-68, February.
    2. Weihua Cao & Anastasios A. Tsiatis & Marie Davidian, 2009. "Improving efficiency and robustness of the doubly robust estimator for a population mean with incomplete data," Biometrika, Biometrika Trust, vol. 96(3), pages 723-734.
    3. Alexandre Belloni & Victor Chernozhukov & Christian Hansen, 2014. "Inference on Treatment Effects after Selection among High-Dimensional Controlsâ€," The Review of Economic Studies, Review of Economic Studies Ltd, vol. 81(2), pages 608-650.
    4. Jinyong Hahn, 1998. "On the Role of the Propensity Score in Efficient Semiparametric Estimation of Average Treatment Effects," Econometrica, Econometric Society, vol. 66(2), pages 315-332, March.
    5. Heejung Bang & James M. Robins, 2005. "Doubly Robust Estimation in Missing Data and Causal Inference Models," Biometrics, The International Biometric Society, vol. 61(4), pages 962-973, December.
    6. Yaqiong Yao & HaiYing Wang, 2019. "Optimal subsampling for softmax regression," Statistical Papers, Springer, vol. 60(2), pages 585-599, April.
    7. Zhiwei Zhang & Zhen Chen & James F. Troendle & Jun Zhang, 2012. "Causal Inference on Quantiles with an Obstetric Application," Biometrics, The International Biometric Society, vol. 68(3), pages 697-706, September.
    8. A. Belloni & V. Chernozhukov & I. Fernández‐Val & C. Hansen, 2017. "Program Evaluation and Causal Inference With High‐Dimensional Data," Econometrica, Econometric Society, vol. 85, pages 233-298, January.
    9. HaiYing Wang & Min Yang & John Stufken, 2019. "Information-Based Optimal Subdata Selection for Big Data Linear Regression," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 114(525), pages 393-405, January.
    10. HaiYing Wang & Rong Zhu & Ping Ma, 2018. "Optimal Subsampling for Large Sample Logistic Regression," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 113(522), pages 829-844, April.
    11. Andrea Rotnitzky & Quanhong Lei & Mariela Sued & James M. Robins, 2012. "Improved double-robust estimation in missing data and causal inference models," Biometrika, Biometrika Trust, vol. 99(2), pages 439-456.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Sant’Anna, Pedro H.C. & Zhao, Jun, 2020. "Doubly robust difference-in-differences estimators," Journal of Econometrics, Elsevier, vol. 219(1), pages 101-122.
    2. Victor Chernozhukov & Juan Carlos Escanciano & Hidehiko Ichimura & Whitney K. Newey & James M. Robins, 2022. "Locally Robust Semiparametric Estimation," Econometrica, Econometric Society, vol. 90(4), pages 1501-1535, July.
    3. Michael C. Knaus, 2021. "A double machine learning approach to estimate the effects of musical practice on student’s skills," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 184(1), pages 282-300, January.
    4. Kyle Colangelo & Ying-Ying Lee, 2019. "Double debiased machine learning nonparametric inference with continuous treatments," CeMMAP working papers CWP54/19, Centre for Microdata Methods and Practice, Institute for Fiscal Studies.
    5. Su, Miaomiao & Wang, Qihua, 2022. "A convex programming solution based debiased estimator for quantile with missing response and high-dimensional covariables," Computational Statistics & Data Analysis, Elsevier, vol. 168(C).
    6. Iván Díaz & Elizabeth Colantuoni & Daniel F. Hanley & Michael Rosenblum, 2019. "Improved precision in the analysis of randomized trials with survival outcomes, without assuming proportional hazards," Lifetime Data Analysis: An International Journal Devoted to Statistical Methods and Applications for Time-to-Event Data, Springer, vol. 25(3), pages 439-468, July.
    7. Agboola, Oluwagbenga David & Yu, Han, 2023. "Neighborhood-based cross fitting approach to treatment effects with high-dimensional data," Computational Statistics & Data Analysis, Elsevier, vol. 186(C).
    8. Kyle Colangelo & Ying-Ying Lee, 2020. "Double Debiased Machine Learning Nonparametric Inference with Continuous Treatments," Papers 2004.03036, arXiv.org, revised Sep 2023.
    9. Neng-Chieh Chang, 2020. "The Mode Treatment Effect," Papers 2007.11606, arXiv.org.
    10. Achim Ahrens & Christian B. Hansen & Mark E. Schaffer & Thomas Wiemann, 2024. "ddml: Double/debiased machine learning in Stata," Stata Journal, StataCorp LP, vol. 24(1), pages 3-45, March.
    11. Matias D Cattaneo & Michael Jansson & Xinwei Ma, 2019. "Two-Step Estimation and Inference with Possibly Many Included Covariates," The Review of Economic Studies, Review of Economic Studies Ltd, vol. 86(3), pages 1095-1122.
    12. Chen, Xiaohong & Liu, Ying & Ma, Shujie & Zhang, Zheng, 2024. "Causal inference of general treatment effects using neural networks with a diverging number of confounders," Journal of Econometrics, Elsevier, vol. 238(1).
    13. Haitian Xie, 2020. "Efficient and Robust Estimation of the Generalized LATE Model," Papers 2001.06746, arXiv.org, revised Feb 2022.
    14. Kyle Colangelo & Ying-Ying Lee, 2019. "Double debiased machine learning nonparametric inference with continuous treatments," CeMMAP working papers CWP72/19, Centre for Microdata Methods and Practice, Institute for Fiscal Studies.
    15. Wang, Qihua & Su, Miaomiao & Wang, Ruoyu, 2021. "A beyond multiple robust approach for missing response problem," Computational Statistics & Data Analysis, Elsevier, vol. 155(C).
    16. Huber, Martin, 2019. "An introduction to flexible methods for policy evaluation," FSES Working Papers 504, Faculty of Economics and Social Sciences, University of Freiburg/Fribourg Switzerland.
    17. AmirEmad Ghassami & Andrew Ying & Ilya Shpitser & Eric Tchetgen Tchetgen, 2021. "Minimax Kernel Machine Learning for a Class of Doubly Robust Functionals with Application to Proximal Causal Inference," Papers 2104.02929, arXiv.org, revised Mar 2022.
    18. Jianxuan Liu & Yanyuan Ma & Lan Wang, 2018. "An alternative robust estimator of average treatment effect in causal inference," Biometrics, The International Biometric Society, vol. 74(3), pages 910-923, September.
    19. Michael Lechner & Jana Mareckova, 2024. "Comprehensive Causal Machine Learning," Papers 2405.10198, arXiv.org.
    20. Martin Wiegand, 2019. "Do early-ending conditional cash transfer programs crowd out school enrollment?," Tinbergen Institute Discussion Papers 19-053/V, Tinbergen Institute.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:csdana:v:173:y:2022:i:c:s0167947322000858. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/csda .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.