IDEAS home Printed from https://ideas.repec.org/a/bla/biomet/v79y2023i4p3831-3845.html
   My bibliography  Save this article

A synthetic data integration framework to leverage external summary‐level information from heterogeneous populations

Author

Listed:
  • Tian Gu
  • Jeremy Michael George Taylor
  • Bhramar Mukherjee

Abstract

There is a growing need for flexible general frameworks that integrate individual‐level data with external summary information for improved statistical inference. External information relevant for a risk prediction model may come in multiple forms, through regression coefficient estimates or predicted values of the outcome variable. Different external models may use different sets of predictors and the algorithm they used to predict the outcome Y given these predictors may or may not be known. The underlying populations corresponding to each external model may be different from each other and from the internal study population. Motivated by a prostate cancer risk prediction problem where novel biomarkers are measured only in the internal study, this paper proposes an imputation‐based methodology, where the goal is to fit a target regression model with all available predictors in the internal study while utilizing summary information from external models that may have used only a subset of the predictors. The method allows for heterogeneity of covariate effects across the external populations. The proposed approach generates synthetic outcome data in each external population, uses stacked multiple imputation to create a long dataset with complete covariate information. The final analysis of the stacked imputed data is conducted by weighted regression. This flexible and unified approach can improve statistical efficiency of the estimated coefficients in the internal study, improve predictions by utilizing even partial information available from models that use a subset of the full set of covariates used in the internal study, and provide statistical inference for the external population with potentially different covariate effects from the internal population.

Suggested Citation

  • Tian Gu & Jeremy Michael George Taylor & Bhramar Mukherjee, 2023. "A synthetic data integration framework to leverage external summary‐level information from heterogeneous populations," Biometrics, The International Biometric Society, vol. 79(4), pages 3831-3845, December.
  • Handle: RePEc:bla:biomet:v:79:y:2023:i:4:p:3831-3845
    DOI: 10.1111/biom.13852
    as

    Download full text from publisher

    File URL: https://doi.org/10.1111/biom.13852
    Download Restriction: no

    File URL: https://libkey.io/10.1111/biom.13852?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Fei Wang & Lu Wang & Peter X.-K. Song, 2012. "Quadratic inference function approach to merging longitudinal studies: validation and joint estimation," Biometrika, Biometrika Trust, vol. 99(3), pages 755-762.
    2. Lawrence C. McCandless & Sylvia Richardson & Nicky Best, 2012. "Adjustment for Missing Confounders Using External Validation Data and Propensity Scores," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 107(497), pages 40-51, March.
    3. Nilanjan Chatterjee & Yi-Hau Chen & Paige Maas & Raymond J. Carroll, 2016. "Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-Level Information From External Big Data Sources," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 111(513), pages 107-117, March.
    4. Han Zhang & Lu Deng & Mark Schiffman & Jing Qin & Kai Yu, 2020. "Generalized integration model for improved statistical inference by leveraging external summary data," Biometrika, Biometrika Trust, vol. 107(3), pages 689-703.
    5. Ziqi Chen & Jing Ning & Yu Shen & Jing Qin, 2021. "Combining primary cohort data with external aggregate information without assuming comparability," Biometrics, The International Biometric Society, vol. 77(3), pages 1024-1036, September.
    6. Shu Yang & Peng Ding, 2020. "Combining Multiple Observational Data Sources to Estimate Causal Effects," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 115(531), pages 1540-1554, July.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Cao, Yongxiu & Yu, Jichang, 2023. "Adjusting for unmeasured confounding in survival causal effect using validation data," Computational Statistics & Data Analysis, Elsevier, vol. 180(C).
    2. Ruoyu Wang & Qihua Wang & Wang Miao, 2023. "A robust fusion-extraction procedure with summary statistics in the presence of biased sources," Biometrika, Biometrika Trust, vol. 110(4), pages 1023-1040.
    3. Yu‐Jen Cheng & Yen‐Chun Liu & Chang‐Yu Tsai & Chiung‐Yu Huang, 2023. "Semiparametric estimation of the transformation model by leveraging external aggregate data in the presence of population heterogeneity," Biometrics, The International Biometric Society, vol. 79(3), pages 1996-2009, September.
    4. Chixiang Chen & Ming Wang & Shuo Chen, 2023. "An efficient data integration scheme for synthesizing information from multiple secondary datasets for the parameter inference of the main analysis," Biometrics, The International Biometric Society, vol. 79(4), pages 2947-2960, December.
    5. Fei Gao & K. C. G. Chan, 2023. "Noniterative adjustment to regression estimators with population‐based auxiliary information for semiparametric models," Biometrics, The International Biometric Society, vol. 79(1), pages 140-150, March.
    6. Han Zhang & Lu Deng & William Wheeler & Jing Qin & Kai Yu, 2022. "Integrative analysis of multiple case‐control studies," Biometrics, The International Biometric Society, vol. 78(3), pages 1080-1091, September.
    7. Debashis Ghosh & Michael S. Sabel, 2022. "A Weighted Sample Framework to Incorporate External Calculators for Risk Modeling," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 14(3), pages 363-379, December.
    8. Albert S. Berahas & Jiahao Shi & Zihong Yi & Baoyu Zhou, 2023. "Accelerating stochastic sequential quadratic programming for equality constrained optimization using predictive variance reduction," Computational Optimization and Applications, Springer, vol. 86(1), pages 79-116, September.
    9. Jie He & Hui Li & Shumei Zhang & Xiaogang Duan, 2019. "Additive hazards model with auxiliary subgroup survival information," Lifetime Data Analysis: An International Journal Devoted to Statistical Methods and Applications for Time-to-Event Data, Springer, vol. 25(1), pages 128-149, January.
    10. Fei Wang & Lu Wang & Peter X.‐K. Song, 2016. "Fused lasso with the adaptation of parameter ordering in combining multiple studies with repeated measurements," Biometrics, The International Biometric Society, vol. 72(4), pages 1184-1193, December.
    11. Bo Han & Ingrid Van Keilegom & Xiaoguang Wang, 2022. "Semiparametric estimation of the nonmixture cure model with auxiliary survival information," Biometrics, The International Biometric Society, vol. 78(2), pages 448-459, June.
    12. Sahar Z. Zangeneh & Roderick J. Little, 2022. "Likelihood‐Based Inference for the Finite Population Mean with Post‐Stratification Information Under Non‐Ignorable Non‐Response," International Statistical Review, International Statistical Institute, vol. 90(S1), pages 17-36, December.
    13. Corwin M. Zigler & Krista Watts & Robert W. Yeh & Yun Wang & Brent A. Coull & Francesca Dominici, 2013. "Model Feedback in Bayesian Propensity Score Estimation," Biometrics, The International Biometric Society, vol. 69(1), pages 263-273, March.
    14. Jan Pablo Burgard & Joscha Krause & Simon Schmaus, 2019. "Estimation of Regional Transition Probabilities for Spatial Dynamic Microsimulations from Survey Data Lacking in Regional Detail," Research Papers in Economics 2019-12, University of Trier, Department of Economics.
    15. Ying Sheng & Yifei Sun & Chiung‐Yu Huang & Mi‐Ok Kim, 2022. "Synthesizing external aggregated information in the presence of population heterogeneity: A penalized empirical likelihood approach," Biometrics, The International Biometric Society, vol. 78(2), pages 679-690, June.
    16. Hector, Emily C. & Luo, Lan & Song, Peter X.-K., 2023. "Parallel-and-stream accelerator for computationally fast supervised learning," Computational Statistics & Data Analysis, Elsevier, vol. 177(C).
    17. Takumi Saegusa, 2020. "Confidence bands for a distribution function with merged data from multiple sources," Statistics in Transition New Series, Polish Statistical Association, vol. 21(4), pages 144-158, August.
    18. Ying Sheng & Yifei Sun & Detian Deng & Chiung‐Yu Huang, 2020. "Censored linear regression in the presence or absence of auxiliary survival information," Biometrics, The International Biometric Society, vol. 76(3), pages 734-745, September.
    19. Ziqi Chen & Jing Ning & Yu Shen & Jing Qin, 2021. "Combining primary cohort data with external aggregate information without assuming comparability," Biometrics, The International Biometric Society, vol. 77(3), pages 1024-1036, September.
    20. Prosenjit Kundu & Nilanjan Chatterjee, 2023. "Logistic regression analysis of two‐phase studies using generalized method of moments," Biometrics, The International Biometric Society, vol. 79(1), pages 241-252, March.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:biomet:v:79:y:2023:i:4:p:3831-3845. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: http://www.blackwellpublishing.com/journal.asp?ref=0006-341X .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.