IDEAS home Printed from https://ideas.repec.org/a/bla/biomet/v79y2023i1p190-202.html
   My bibliography  Save this article

Risk prediction with imperfect survival outcome information from electronic health records

Author

Listed:
  • Jue Hou
  • Stephanie F. Chan
  • Xuan Wang
  • Tianxi Cai

Abstract

Readily available proxies for the time of disease onset such as the time of the first diagnostic code can lead to substantial risk prediction error if performing analyses based on poor proxies. Due to the lack of detailed documentation and labor intensiveness of manual annotation, it is often only feasible to ascertain for a small subset the current status of the disease by a follow‐up time rather than the exact time. In this paper, we aim to develop risk prediction models for the onset time efficiently leveraging both a small number of labels on the current status and a large number of unlabeled observations on imperfect proxies. Under a semiparametric transformation model for onset and a highly flexible measurement error model for proxy onset time, we propose the semisupervised risk prediction method by combining information from proxies and limited labels efficiently. From an initially estimator solely based on the labeled subset, we perform a one‐step correction with the full data augmenting against a mean zero rank correlation score derived from the proxies. We establish the consistency and asymptotic normality of the proposed semisupervised estimator and provide a resampling procedure for interval estimation. Simulation studies demonstrate that the proposed estimator performs well in a finite sample. We illustrate the proposed estimator by developing a genetic risk prediction model for obesity using data from Mass General Brigham Healthcare Biobank.

Suggested Citation

  • Jue Hou & Stephanie F. Chan & Xuan Wang & Tianxi Cai, 2023. "Risk prediction with imperfect survival outcome information from electronic health records," Biometrics, The International Biometric Society, vol. 79(1), pages 190-202, March.
  • Handle: RePEc:bla:biomet:v:79:y:2023:i:1:p:190-202
    DOI: 10.1111/biom.13599
    as

    Download full text from publisher

    File URL: https://doi.org/10.1111/biom.13599
    Download Restriction: no

    File URL: https://libkey.io/10.1111/biom.13599?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Laber, Eric B. & Murphy, Susan A., 2011. "Adaptive Confidence Intervals for the Test Error in Classification," Journal of the American Statistical Association, American Statistical Association, vol. 106(495), pages 904-913.
    2. Donglin Zeng & Lu Mao & D. Y. Lin, 2016. "Maximum likelihood estimation for semiparametric transformation models with interval-censored data," Biometrika, Biometrika Trust, vol. 103(2), pages 253-271.
    3. Chen, Ling & Sun, Jianguo, 2010. "A multiple imputation approach to the analysis of interval-censored failure time data with the additive hazards model," Computational Statistics & Data Analysis, Elsevier, vol. 54(4), pages 1109-1116, April.
    4. Lu Tian & Tianxi Cai, 2006. "On the accelerated failure time model for current status and interval censored data," Biometrika, Biometrika Trust, vol. 93(2), pages 329-342, June.
    5. Sherman, Robert P, 1993. "The Limiting Distribution of the Maximum Rank Correlation Estimator," Econometrica, Econometric Society, vol. 61(1), pages 123-137, January.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Choi, Taehwa & Kim, Arlene K.H. & Choi, Sangbum, 2021. "Semiparametric least-squares regression with doubly-censored data," Computational Statistics & Data Analysis, Elsevier, vol. 164(C).
    2. Patrick Bajari & Jeremy Fox & Stephen Ryan, 2008. "Evaluating wireless carrier consolidation using semiparametric demand estimation," Quantitative Marketing and Economics (QME), Springer, vol. 6(4), pages 299-338, December.
    3. Jiannan Lu & Peng Ding & Tirthankar Dasgupta, 2018. "Treatment Effects on Ordinal Outcomes: Causal Estimands and Sharp Bounds," Journal of Educational and Behavioral Statistics, , vol. 43(5), pages 540-567, October.
    4. Qingning Zhou & Jianwen Cai & Haibo Zhou, 2018. "Outcome†dependent sampling with interval†censored failure time data," Biometrics, The International Biometric Society, vol. 74(1), pages 58-67, March.
    5. Ming-Yueh Huang & Chin-Tsang Chiang, 2017. "Estimation and Inference Procedures for Semiparametric Distribution Models with Varying Linear-Index," Scandinavian Journal of Statistics, Danish Society for Theoretical Statistics;Finnish Statistical Society;Norwegian Statistical Association;Swedish Statistical Association, vol. 44(2), pages 396-424, June.
    6. repec:hal:wpspec:info:hdl:2441/3vl5fe4i569nbr005tctlc8ll5 is not listed on IDEAS
    7. Shakeeb Khan & Arnaud Maurel & Yichong Zhang, 2023. "Informational Content of Factor Structures in Simultaneous Binary Response Models," Advances in Econometrics, in: Essays in Honor of Joon Y. Park: Econometric Methodology in Empirical Applications, volume 45, pages 385-410, Emerald Group Publishing Limited.
    8. Chin-Tsang Chiang & Shr-Yan Huang, 2009. "Estimation for the Optimal Combination of Markers without Modeling the Censoring Distribution," Biometrics, The International Biometric Society, vol. 65(1), pages 152-158, March.
    9. Sokbae Lee & Myung Hwan Seo & Youngki Shin, 2017. "Correction," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 112(518), pages 883-883, April.
    10. Margaret Sullivan Pepe & Tianxi Cai & Gary Longton, 2006. "Combining Predictors for Classification Using the Area under the Receiver Operating Characteristic Curve," Biometrics, The International Biometric Society, vol. 62(1), pages 221-229, March.
    11. Beilin Jia & Donglin Zeng & Jason J. Z. Liao & Guanghan F. Liu & Xianming Tan & Guoqing Diao & Joseph G. Ibrahim, 2022. "Mixture survival trees for cancer risk classification," Lifetime Data Analysis: An International Journal Devoted to Statistical Methods and Applications for Time-to-Event Data, Springer, vol. 28(3), pages 356-379, July.
    12. Youngki Shin & Zvezdomir Todorov, 2021. "Exact computation of maximum rank correlation estimator," The Econometrics Journal, Royal Economic Society, vol. 24(3), pages 589-607.
    13. Xin Qiu & Donglin Zeng & Yuanjia Wang, 2018. "Estimation and evaluation of linear individualized treatment rules to guarantee performance," Biometrics, The International Biometric Society, vol. 74(2), pages 517-528, June.
    14. Gorgens, Tue & Horowitz, Joel L., 1999. "Semiparametric estimation of a censored regression model with an unknown transformation of the dependent variable," Journal of Econometrics, Elsevier, vol. 90(2), pages 155-191, June.
    15. Lewbel, Arthur & McFadden, Daniel & Linton, Oliver, 2011. "Estimating features of a distribution from binomial data," Journal of Econometrics, Elsevier, vol. 162(2), pages 170-188, June.
    16. Khan, Shakeeb, 2001. "Two-stage rank estimation of quantile index models," Journal of Econometrics, Elsevier, vol. 100(2), pages 319-355, February.
    17. Isaiah Andrews & Toru Kitagawa & Adam McCloskey, 2024. "Inference on Winners," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 139(1), pages 305-358.
    18. Ian W. McKeague & Min Qian, 2015. "An Adaptive Resampling Test for Detecting the Presence of Significant Predictors," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 110(512), pages 1422-1433, December.
    19. Christoph Breunig & Stephan Martin, 2020. "Nonclassical Measurement Error in the Outcome Variable," Papers 2009.12665, arXiv.org, revised May 2021.
    20. Coppejans, Mark, 2001. "Estimation of the binary response model using a mixture of distributions estimator (MOD)," Journal of Econometrics, Elsevier, vol. 102(2), pages 231-269, June.
    21. Hausman, Jerry A. & Woutersen, Tiemen, 2014. "Estimating a semi-parametric duration model without specifying heterogeneity," Journal of Econometrics, Elsevier, vol. 178(P1), pages 114-131.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:biomet:v:79:y:2023:i:1:p:190-202. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: http://www.blackwellpublishing.com/journal.asp?ref=0006-341X .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.