IDEAS home Printed from https://ideas.repec.org/a/bla/biomet/v79y2023i4p2974-2986.html
   My bibliography  Save this article

Optimal sampling for positive only electronic health record data

Author

Listed:
  • Seong‐H. Lee
  • Yanyuan Ma
  • Ying Wei
  • Jinbo Chen

Abstract

Identifying a patient's disease/health status from electronic medical records is a frequently encountered task in electronic health records (EHR) related research, and estimation of a classification model often requires a benchmark training data with patients' known phenotype statuses. However, assessing a patient's phenotype is costly and labor intensive, hence a proper selection of EHR records as a training set is desired. We propose a procedure to tailor the best training subsample with limited sample size for a classification model, minimizing its mean‐squared phenotyping/classification error (MSE). Our approach incorporates “positive only” information, an approximation of the true disease status without false alarm, when it is available. In addition, our sampling procedure is applicable for training a chosen classification model which can be misspecified. We provide theoretical justification on its optimality in terms of MSE. The performance gain from our method is illustrated through simulation and a real‐data example, and is found often satisfactory under criteria beyond MSE.

Suggested Citation

  • Seong‐H. Lee & Yanyuan Ma & Ying Wei & Jinbo Chen, 2023. "Optimal sampling for positive only electronic health record data," Biometrics, The International Biometric Society, vol. 79(4), pages 2974-2986, December.
  • Handle: RePEc:bla:biomet:v:79:y:2023:i:4:p:2974-2986
    DOI: 10.1111/biom.13824
    as

    Download full text from publisher

    File URL: https://doi.org/10.1111/biom.13824
    Download Restriction: no

    File URL: https://libkey.io/10.1111/biom.13824?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Jing Qin & Biao Zhang & Denis H.Y. Leung, 2017. "Efficient Augmented Inverse Probability Weighted Estimation in Missing Data Problems," Journal of Business & Economic Statistics, Taylor & Francis Journals, vol. 35(1), pages 86-97, January.
    2. Haiying Wang & Yanyuan Ma, 2021. "Optimal subsampling for quantile regression in big data," Biometrika, Biometrika Trust, vol. 108(1), pages 99-112.
    3. Sebastian Gehrmann & Franck Dernoncourt & Yeran Li & Eric T Carlson & Joy T Wu & Jonathan Welt & John Foote Jr. & Edward T Moseley & David W Grant & Patrick D Tyler & Leo A Celi, 2018. "Comparing deep learning and concept extraction based methods for patient phenotyping from clinical narratives," PLOS ONE, Public Library of Science, vol. 13(2), pages 1-19, February.
    4. HaiYing Wang & Rong Zhu & Ping Ma, 2018. "Optimal Subsampling for Large Sample Logistic Regression," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 113(522), pages 829-844, April.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Jun Yu & Jiaqi Liu & HaiYing Wang, 2023. "Information-based optimal subdata selection for non-linear models," Statistical Papers, Springer, vol. 64(4), pages 1069-1093, August.
    2. Tianzhen Wang & Haixiang Zhang, 2022. "Optimal subsampling for multiplicative regression with massive data," Statistica Neerlandica, Netherlands Society for Statistics and Operations Research, vol. 76(4), pages 418-449, November.
    3. Ziyang Wang & HaiYing Wang & Nalini Ravishanker, 2023. "Subsampling in Longitudinal Models," Methodology and Computing in Applied Probability, Springer, vol. 25(1), pages 1-29, March.
    4. Deng, Jiayi & Huang, Danyang & Ding, Yi & Zhu, Yingqiu & Jing, Bingyi & Zhang, Bo, 2024. "Subsampling spectral clustering for stochastic block models in large-scale networks," Computational Statistics & Data Analysis, Elsevier, vol. 189(C).
    5. Yujing Shao & Lei Wang, 2022. "Optimal subsampling for composite quantile regression model in massive data," Statistical Papers, Springer, vol. 63(4), pages 1139-1161, August.
    6. Xiaohui Yuan & Yong Li & Xiaogang Dong & Tianqing Liu, 2022. "Optimal subsampling for composite quantile regression in big data," Statistical Papers, Springer, vol. 63(5), pages 1649-1676, October.
    7. Hao Cheng & Ying Wei, 2018. "A fast imputation algorithm in quantile regression," Computational Statistics, Springer, vol. 33(4), pages 1589-1603, December.
    8. Feifei Wang & Danyang Huang & Tianchen Gao & Shuyuan Wu & Hansheng Wang, 2022. "Sequential one‐step estimator by sub‐sampling for customer churn analysis with massive data sets," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 71(5), pages 1753-1786, November.
    9. Lee, JooChul & Schifano, Elizabeth D. & Wang, HaiYing, 2024. "Fast Optimal Subsampling Probability Approximation for Generalized Linear Models," Econometrics and Statistics, Elsevier, vol. 29(C), pages 224-237.
    10. Su, Miaomiao & Wang, Ruoyu & Wang, Qihua, 2022. "A two-stage optimal subsampling estimation for missing data problems with large-scale data," Computational Statistics & Data Analysis, Elsevier, vol. 173(C).
    11. Xiaojun Mao & Zhonglei Wang & Shu Yang, 2023. "Matrix completion under complex survey sampling," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 75(3), pages 463-492, June.
    12. Jun Yu & HaiYing Wang, 2022. "Subdata selection algorithm for linear model discrimination," Statistical Papers, Springer, vol. 63(6), pages 1883-1906, December.
    13. Jiadi Yang & Jinjin Wang, 2022. "TV program innovation and teaching under big data background in all media era," International Journal of System Assurance Engineering and Management, Springer;The Society for Reliability, Engineering Quality and Operations Management (SREQOM),India, and Division of Operation and Maintenance, Lulea University of Technology, Sweden, vol. 13(3), pages 1031-1041, December.
    14. Duarte, Belmiro P.M. & Atkinson, Anthony C. & Oliveira, Nuno M.C., 2024. "Using hierarchical information-theoretic criteria to optimize subsampling of extensive datasets," LSE Research Online Documents on Economics 121641, London School of Economics and Political Science, LSE Library.
    15. Ping Wang & Lu Lin, 2023. "Conditional characteristic feature screening for massive imbalanced data," Statistical Papers, Springer, vol. 64(3), pages 807-834, June.
    16. He, Xin & Mao, Xiaojun & Wang, Zhonglei, 2024. "Nonparametric augmented probability weighting with sparsity," Computational Statistics & Data Analysis, Elsevier, vol. 191(C).
    17. J. Lars Kirkby & Dang H. Nguyen & Duy Nguyen & Nhu N. Nguyen, 2022. "Inversion-free subsampling Newton’s method for large sample logistic regression," Statistical Papers, Springer, vol. 63(3), pages 943-963, June.
    18. Sun Hao & Ertefaie Ashkan & Lu Xin & Johnson Brent A., 2020. "Improved Doubly Robust Estimation in Marginal Mean Models for Dynamic Regimes," Journal of Causal Inference, De Gruyter, vol. 8(1), pages 300-314, January.
    19. Long, Wenjin & Pang, Xiaopeng & Dong, Xiao-yuan & Zeng, Junxia, 2020. "Is rented accommodation a good choice for primary school students' academic performance? – Evidence from rural China," China Economic Review, Elsevier, vol. 62(C).
    20. Amalan Mahendran & Helen Thompson & James M. McGree, 2023. "A model robust subsampling approach for Generalised Linear Models in big data settings," Statistical Papers, Springer, vol. 64(4), pages 1137-1157, August.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:biomet:v:79:y:2023:i:4:p:2974-2986. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: http://www.blackwellpublishing.com/journal.asp?ref=0006-341X .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.