IDEAS home Printed from https://ideas.repec.org/a/spr/stabio/v10y2018i3d10.1007_s12561-018-9217-4.html
   My bibliography  Save this article

Empirical Bayes Estimation and Prediction Using Summary-Level Information From External Big Data Sources Adjusting for Violations of Transportability

Author

Listed:
  • Jason P. Estes

    (University of Michigan)

  • Bhramar Mukherjee

    (University of Michigan)

  • Jeremy M. G. Taylor

    (University of Michigan)

Abstract

Large external data sources may be available to augment studies that collect data to address a specific research objective. In this article we consider the problem of building regression models for prediction based on individual-level data from an “internal” study while incorporating summary information from an “external” big data source. We extend the work of Chatterjee et al. (J Am Stat Assoc 111(513):107–117, 2006) by introducing an adaptive empirical Bayes shrinkage estimator that uses the external summary-level information and the internal data to trade bias with variance for protection against departures in the conditional probability distribution of the outcome given a set of covariates between the two populations. We use simulation studies and a real data application using external summary information from the Prostate Cancer Prevention Trial to assess the performance of the proposed methods in contrast to maximum likelihood estimation and the constrained maximum likelihood (CML) method developed by Chatterjee et al. (J Am Stat Assoc 111(513):107–117, 2006). Our simulation studies show that the CML method can be biased and inefficient when the assumption of a transportable covariate distribution between the external and internal populations is violated, and our empirical Bayes estimator provides protection against bias and loss of efficiency.

Suggested Citation

  • Jason P. Estes & Bhramar Mukherjee & Jeremy M. G. Taylor, 2018. "Empirical Bayes Estimation and Prediction Using Summary-Level Information From External Big Data Sources Adjusting for Violations of Transportability," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 10(3), pages 568-586, December.
  • Handle: RePEc:spr:stabio:v:10:y:2018:i:3:d:10.1007_s12561-018-9217-4
    DOI: 10.1007/s12561-018-9217-4
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s12561-018-9217-4
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s12561-018-9217-4?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Wu C. & Sitter R. R, 2001. "A Model-Calibration Approach to Using Complete Auxiliary Information From Survey Data," Journal of the American Statistical Association, American Statistical Association, vol. 96, pages 185-193, March.
    2. Joel A. Mefford & Noah A. Zaitlen & John S. Witte, 2016. "Comment: A Human Genetics Perspective," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 111(513), pages 124-127, March.
    3. Yi‐Hau Chen & Hung Chen, 2000. "A unified approach to regression analysis under double‐sampling designs," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 62(3), pages 449-460.
    4. Thomas Lumley & Pamela A. Shaw & James Y. Dai, 2011. "Connections between Survey Calibration Estimators and Semiparametric Models for Incomplete Data," International Statistical Review, International Statistical Institute, vol. 79(2), pages 200-220, August.
    5. Changbao Wu, 2003. "Optimal calibration estimators in survey sampling," Biometrika, Biometrika Trust, vol. 90(4), pages 937-951, December.
    6. J. F. Lawless & J. D. Kalbfleisch & C. J. Wild, 1999. "Semiparametric methods for response‐selective and missing data problems in regression," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 61(2), pages 413-438, April.
    7. Chirag J. Patel & Francesca Dominici, 2016. "Comment: Addressing the Need for Portability in Big Data Model Building and Calibration," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 111(513), pages 127-129, March.
    8. Bhramar Mukherjee & Nilanjan Chatterjee, 2008. "Exploiting Gene‐Environment Independence for Analysis of Case–Control Studies: An Empirical Bayes‐Type Shrinkage Estimator to Trade‐Off between Bias and Efficiency," Biometrics, The International Biometric Society, vol. 64(3), pages 685-694, September.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Barranco-Chamorro, I. & Jiménez-Gamero, M.D. & Moreno-Rebollo, J.L. & Muñoz-Pichardo, J.M., 2012. "Case-deletion type diagnostics for calibration estimators in survey sampling," Computational Statistics & Data Analysis, Elsevier, vol. 56(7), pages 2219-2236.
    2. Brady Ryan & Ananthika Nirmalkanna & Candemir Cigsar & Yildiz E. Yilmaz, 2023. "Evaluation of Designs and Estimation Methods Under Response-Dependent Two-Phase Sampling for Genetic Association Studies," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 15(2), pages 510-539, July.
    3. Yei Eun Shin & Ruth M. Pfeiffer & Barry I. Graubard & Mitchell H. Gail, 2022. "Weight calibration to improve efficiency for estimating pure risks from the additive hazards model with the nested case‐control design," Biometrics, The International Biometric Society, vol. 78(1), pages 179-191, March.
    4. Changbao Wu & Shixiao Zhang, 2019. "Comments on: Deville and Särndal’s calibration: revisiting a 25 years old successful optimization problem," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 28(4), pages 1082-1086, December.
    5. Yang Zhao & Meng Liu, 2021. "Unified approach for regression models with nonmonotone missing at random data," AStA Advances in Statistical Analysis, Springer;German Statistical Society, vol. 105(1), pages 87-101, March.
    6. Stearns, Matthew & Singh, Sarjinder, 2008. "On the estimation of the general parameter," Computational Statistics & Data Analysis, Elsevier, vol. 52(9), pages 4253-4271, May.
    7. Shixiao Zhang & Peisong Han & Changbao Wu, 2023. "Calibration Techniques Encompassing Survey Sampling, Missing Data Analysis and Causal Inference," International Statistical Review, International Statistical Institute, vol. 91(2), pages 165-192, August.
    8. Aylin Alkaya & H. Öztaş Ayhan & Alptekin Esin, 2017. "Sequential Data Weighting Procedures For Combined Ratio Estimators In Complex Sample Surveys," Statistics in Transition New Series, Polish Statistical Association, vol. 18(2), pages 247-270, June.
    9. Yei Eun Shin & Ruth M. Pfeiffer & Barry I. Graubard & Mitchell H. Gail, 2020. "Weight calibration to improve the efficiency of pure risk estimates from case‐control samples nested in a cohort," Biometrics, The International Biometric Society, vol. 76(4), pages 1087-1097, December.
    10. Zhan Liu & Chaofeng Tu & Yingli Pan, 2022. "Model-assisted calibration with SCAD to estimated control for non-probability samples," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 31(4), pages 849-879, October.
    11. Alkaya Aylin & Ayhan H. Öztaş & Esin Alptekin, 2017. "Sequential Data Weighting Procedures for Combined Ratio Estimators in Complex Sample Surveys," Statistics in Transition New Series, Polish Statistical Association, vol. 18(2), pages 247-270, June.
    12. Gustavo Amorim & Ran Tao & Sarah Lotspeich & Pamela A. Shaw & Thomas Lumley & Bryan E. Shepherd, 2021. "Two‐phase sampling designs for data validation in settings with covariate measurement error and continuous outcome," Journal of the Royal Statistical Society Series A, Royal Statistical Society, vol. 184(4), pages 1368-1389, October.
    13. Changbao Wu & Wilson W. Lu, 2016. "Calibration Weighting Methods for Complex Surveys," International Statistical Review, International Statistical Institute, vol. 84(1), pages 79-98, April.
    14. Tan, Zhiqiang, 2014. "Second-order asymptotic theory for calibration estimators in sampling and missing-data problems," Journal of Multivariate Analysis, Elsevier, vol. 131(C), pages 240-253.
    15. Debashis Ghosh & Michael S. Sabel, 2022. "A Weighted Sample Framework to Incorporate External Calculators for Risk Modeling," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 14(3), pages 363-379, December.
    16. Yuan Zhang & Shili Lin & Swati Biswas, 2017. "Detecting rare and common haplotype–environment interaction under uncertainty of gene–environment independence assumption," Biometrics, The International Biometric Society, vol. 73(1), pages 344-355, March.
    17. Esmerelda A. Ramalho & Richard Smith, 2003. "Discrete choice non-response," CeMMAP working papers 07/03, Institute for Fiscal Studies.
    18. Jinbo Chen & Dongyu Lin & Hagit Hochner, 2012. "Semiparametric Maximum Likelihood Methods for Analyzing Genetic and Environmental Effects with Case-Control Mother–Child Pair Data," Biometrics, The International Biometric Society, vol. 68(3), pages 869-877, September.
    19. Ryo Kato & Takahiro Hoshino, 2020. "Semiparametric Bayesian multiple imputation for regression models with missing mixed continuous–discrete covariates," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 72(3), pages 803-825, June.
    20. Brisa N. Sánchez & Shan Kang & Bhramar Mukherjee, 2012. "A Latent Variable Approach to Study Gene–Environment Interactions in the Presence of Multiple Correlated Exposures," Biometrics, The International Biometric Society, vol. 68(2), pages 466-476, June.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:stabio:v:10:y:2018:i:3:d:10.1007_s12561-018-9217-4. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.