IDEAS home Printed from https://ideas.repec.org/a/bla/jorssa/v184y2021i4p1368-1389.html
   My bibliography  Save this article

Two‐phase sampling designs for data validation in settings with covariate measurement error and continuous outcome

Author

Listed:
  • Gustavo Amorim
  • Ran Tao
  • Sarah Lotspeich
  • Pamela A. Shaw
  • Thomas Lumley
  • Bryan E. Shepherd

Abstract

Measurement errors are present in many data collection procedures and can harm analyses by biasing estimates. To correct for measurement error, researchers often validate a subsample of records and then incorporate the information learned from this validation sample into estimation. In practice, the validation sample is often selected using simple random sampling (SRS). However, SRS leads to inefficient estimates because it ignores information on the error‐prone variables, which can be highly correlated to the unknown truth. Applying and extending ideas from the two‐phase sampling literature, we propose optimal and nearly optimal designs for selecting the validation sample in the classical measurement‐error framework. We target designs to improve the efficiency of model‐based and design‐based estimators, and show how the resulting designs compare to each other. Our results suggest that sampling schemes that extract more information from the error‐prone data are substantially more efficient than SRS, for both design‐ and model‐based estimators. The optimal procedure, however, depends on the analysis method, and can differ substantially. This is supported by theory and simulations. We illustrate the various designs using data from an HIV cohort study.

Suggested Citation

  • Gustavo Amorim & Ran Tao & Sarah Lotspeich & Pamela A. Shaw & Thomas Lumley & Bryan E. Shepherd, 2021. "Two‐phase sampling designs for data validation in settings with covariate measurement error and continuous outcome," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 184(4), pages 1368-1389, October.
  • Handle: RePEc:bla:jorssa:v:184:y:2021:i:4:p:1368-1389
    DOI: 10.1111/rssa.12689
    as

    Download full text from publisher

    File URL: https://doi.org/10.1111/rssa.12689
    Download Restriction: no

    File URL: https://libkey.io/10.1111/rssa.12689?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Blattman, Christopher & Jamison, Julian & Koroknay-Palicz, Tricia & Rodrigues, Katherine & Sheridan, Margaret, 2016. "Measuring the measurement error: A method to qualitatively validate survey data," Journal of Development Economics, Elsevier, vol. 120(C), pages 99-112.
    2. Ran Tao & Donglin Zeng & Dan-Yu Lin, 2017. "Efficient Semiparametric Inference Under Two-Phase Sampling, With Applications to Genetic Association Studies," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 112(520), pages 1468-1476, October.
    3. Bound, John & Brown, Charles & Mathiowetz, Nancy, 2001. "Measurement error in survey data," Handbook of Econometrics, in: J.J. Heckman & E.E. Leamer (ed.), Handbook of Econometrics, edition 1, volume 5, chapter 59, pages 3705-3843, Elsevier.
    4. Norman E. Breslow & Jon A. Wellner, 2007. "Weighted Likelihood for Semiparametric Models and Two‐phase Stratified Samples, with Application to Cox Regression," Scandinavian Journal of Statistics, Danish Society for Theoretical Statistics;Finnish Statistical Society;Norwegian Statistical Association;Swedish Statistical Association, vol. 34(1), pages 86-102, March.
    5. Thomas Lumley & Pamela A. Shaw & James Y. Dai, 2011. "Connections between Survey Calibration Estimators and Semiparametric Models for Incomplete Data," International Statistical Review, International Statistical Institute, vol. 79(2), pages 200-220, August.
    6. M. M. Shoukri & M. H. Asyali & S. D. Walter, 2003. "Issues of Cost and Efficiency in the Design of Reliability Studies," Biometrics, The International Biometric Society, vol. 59(4), pages 1107-1112, December.
    7. Haibo Zhou & M. A. Weaver & J. Qin & M. P. Longnecker & M. C. Wang, 2002. "A Semiparametric Empirical Likelihood Method for Data from an Outcome-Dependent Sampling Scheme with a Continuous Outcome," Biometrics, The International Biometric Society, vol. 58(2), pages 413-421, June.
    8. K. G. Reddy & M. G. M. Khan, 2020. "stratifyR: An R Package for optimal stratification and sample allocation for univariate populations," Australian & New Zealand Journal of Statistics, Australian Statistical Publishing Association Inc., vol. 62(3), pages 383-405, September.
    9. J. F. Lawless & J. D. Kalbfleisch & C. J. Wild, 1999. "Semiparametric methods for response‐selective and missing data problems in regression," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 61(2), pages 413-438, April.
    10. Christina A. Holcroft & Donna Spiegelman, 1999. "Design of Validation Studies for Estimating the Odds Ratio of Exposure–Disease Relationships When Exposure Is Misclassified," Biometrics, The International Biometric Society, vol. 55(4), pages 1193-1201, December.
    11. Haibo Zhou & Rui Song & Yuanshan Wu & Jing Qin, 2011. "Statistical Inference for a Two-Stage Outcome-Dependent Sampling Design with a Continuous Outcome," Biometrics, The International Biometric Society, vol. 67(1), pages 194-202, March.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Brady Ryan & Ananthika Nirmalkanna & Candemir Cigsar & Yildiz E. Yilmaz, 2023. "Evaluation of Designs and Estimation Methods Under Response-Dependent Two-Phase Sampling for Genetic Association Studies," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 15(2), pages 510-539, July.
    2. Qingning Zhou & Jianwen Cai & Haibo Zhou, 2018. "Outcome†dependent sampling with interval†censored failure time data," Biometrics, The International Biometric Society, vol. 74(1), pages 58-67, March.
    3. Jieli Ding & Tsui-Shan Lu & Jianwen Cai & Haibo Zhou, 2017. "Recent progresses in outcome-dependent sampling with failure time data," Lifetime Data Analysis: An International Journal Devoted to Statistical Methods and Applications for Time-to-Event Data, Springer, vol. 23(1), pages 57-82, January.
    4. Jonathan S. Schildcrout & Shawn P. Garbett & Patrick J. Heagerty, 2013. "Outcome Vector Dependent Sampling with Longitudinal Continuous Response Data: Stratified Sampling Based on Summary Statistics," Biometrics, The International Biometric Society, vol. 69(2), pages 405-416, June.
    5. Grant Miller & Aureo de Paula & Christine Valente, 2020. "Subjective Expectations and Demand for Contraception," Bristol Economics Discussion Papers 20/724, School of Economics, University of Bristol, UK.
    6. Haibo Zhou & Rui Song & Yuanshan Wu & Jing Qin, 2011. "Statistical Inference for a Two-Stage Outcome-Dependent Sampling Design with a Continuous Outcome," Biometrics, The International Biometric Society, vol. 67(1), pages 194-202, March.
    7. Xiaofei Wang & Haibo Zhou, 2006. "A Semiparametric Empirical Likelihood Method for Biased Sampling Schemes with Auxiliary Covariates," Biometrics, The International Biometric Society, vol. 62(4), pages 1149-1160, December.
    8. Christopher Vahl & Qing Kang, 2015. "Analysis of an outcome-dependent enriched sample: hypothesis tests," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 24(3), pages 387-409, September.
    9. M. Niaz Asadullah & Elisabetta De Cao & Fathema Zhura Khatoon & Zahra Siddique, 2021. "Measuring gender attitudes using list experiments," Journal of Population Economics, Springer;European Society for Population Economics, vol. 34(2), pages 367-400, April.
    10. Ben Gillen & Erik Snowberg & Leeat Yariv, 2015. "Experimenting with Measurement Error: Techniques with Applications to the Caltech Cohort Study," NBER Working Papers 21517, National Bureau of Economic Research, Inc.
    11. Jason P. Estes & Bhramar Mukherjee & Jeremy M. G. Taylor, 2018. "Empirical Bayes Estimation and Prediction Using Summary-Level Information From External Big Data Sources Adjusting for Violations of Transportability," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 10(3), pages 568-586, December.
    12. Sarah C. Lotspeich & Bryan E. Shepherd & Gustavo G. C. Amorim & Pamela A. Shaw & Ran Tao, 2022. "Efficient odds ratio estimation under two‐phase sampling using error‐prone data from a multi‐national HIV research cohort," Biometrics, The International Biometric Society, vol. 78(4), pages 1674-1685, December.
    13. Qingning Zhou & Jianwen Cai & Haibo Zhou, 2020. "Semiparametric inference for a two-stage outcome-dependent sampling design with interval-censored failure time data," Lifetime Data Analysis: An International Journal Devoted to Statistical Methods and Applications for Time-to-Event Data, Springer, vol. 26(1), pages 85-108, January.
    14. Takumi Saegusa, 2015. "Variance Estimation under Two-Phase Sampling," Scandinavian Journal of Statistics, Danish Society for Theoretical Statistics;Finnish Statistical Society;Norwegian Statistical Association;Swedish Statistical Association, vol. 42(4), pages 1078-1091, December.
    15. Chiara Di Gravio & Ran Tao & Jonathan S. Schildcrout, 2023. "Design and analysis of two‐phase studies with multivariate longitudinal data," Biometrics, The International Biometric Society, vol. 79(2), pages 1420-1432, June.
    16. Yonghong An & Pengfei Liu, 2020. "Eliciting Information from Sensitive Survey Questions," Papers 2009.01430, arXiv.org.
    17. John Abowd & Martha Stinson, 2011. "Estimating Measurement Error in SIPP Annual Job Earnings: A Comparison of Census Bureau Survey and SSA Administrative Data," Working Papers 11-20, Center for Economic Studies, U.S. Census Bureau.
    18. Kaspar W thrich, 2013. "Set Identification of Generalized Linear Predictors in the Presence of Non-Classical Measurement Errors," Diskussionsschriften dp1304, Universitaet Bern, Departement Volkswirtschaft.
    19. Liran Einav & Ephraim Leibtag & Aviv Nevo, 2010. "Recording discrepancies in Nielsen Homescan data: Are they present and do they matter?," Quantitative Marketing and Economics (QME), Springer, vol. 8(2), pages 207-239, June.
    20. G. Miller & Yuriy Pylypchuk, 2014. "Marital Status, Spousal Characteristics, and the Use of Preventive Care," Journal of Family and Economic Issues, Springer, vol. 35(3), pages 323-338, September.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:jorssa:v:184:y:2021:i:4:p:1368-1389. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: https://edirc.repec.org/data/rssssea.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.