IDEAS home Printed from https://ideas.repec.org/a/bla/jorssb/v84y2022i4p1353-1391.html
   My bibliography  Save this article

Efficient evaluation of prediction rules in semi‐supervised settings under stratified sampling

Author

Listed:
  • Jessica Gronsbell
  • Molei Liu
  • Lu Tian
  • Tianxi Cai

Abstract

In many contemporary applications, large amounts of unlabelled data are readily available while labelled examples are limited. There has been substantial interest in semi‐supervised learning (SSL) which aims to leverage unlabelled data to improve estimation or prediction. However, current SSL literature focuses primarily on settings where labelled data are selected uniformly at random from the population of interest. Stratified sampling, while posing additional analytical challenges, is highly applicable to many real‐world problems. Moreover, no SSL methods currently exist for estimating the prediction performance of a fitted model when the labelled data are not selected uniformly at random. In this paper, we propose a two‐step SSL procedure for evaluating a prediction rule derived from a working binary regression model based on the Brier score and overall misclassification rate under stratified sampling. In step I, we impute the missing labels via weighted regression with nonlinear basis functions to account for stratified sampling and to improve efficiency. In step II, we augment the initial imputations to ensure the consistency of the resulting estimators regardless of the specification of the prediction model or the imputation model. The final estimator is then obtained with the augmented imputations. We provide asymptotic theory and numerical studies illustrating that our proposals outperform their supervised counterparts in terms of efficiency gain. Our methods are motivated by electronic health record (EHR) research and validated with a real data analysis of an EHR‐based study of diabetic neuropathy.

Suggested Citation

  • Jessica Gronsbell & Molei Liu & Lu Tian & Tianxi Cai, 2022. "Efficient evaluation of prediction rules in semi‐supervised settings under stratified sampling," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 84(4), pages 1353-1391, September.
  • Handle: RePEc:bla:jorssb:v:84:y:2022:i:4:p:1353-1391
    DOI: 10.1111/rssb.12502
    as

    Download full text from publisher

    File URL: https://doi.org/10.1111/rssb.12502
    Download Restriction: no

    File URL: https://libkey.io/10.1111/rssb.12502?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Lu Tian & Tianxi Cai & Els Goetghebeur & L. J. Wei, 2007. "Model evaluation based on the sampling distribution of estimated absolute prediction error," Biometrika, Biometrika Trust, vol. 94(2), pages 297-311.
    2. Yingye Zheng & Tianxi Cai & Yuying Jin & Ziding Feng, 2012. "Evaluating Prognostic Accuracy of Biomarkers under Competing Risk," Biometrics, The International Biometric Society, vol. 68(2), pages 388-396, June.
    3. Zhiqiang Tan, 2010. "Bounded, efficient and doubly robust estimation with inverse weighting," Biometrika, Biometrika Trust, vol. 97(3), pages 661-682.
    4. Gneiting, Tilmann & Raftery, Adrian E., 2007. "Strictly Proper Scoring Rules, Prediction, and Estimation," Journal of the American Statistical Association, American Statistical Association, vol. 102, pages 359-378, March.
    5. D. J. Hand, 2001. "Measuring Diagnostic Accuracy of Statistical Prediction Rules," Statistica Neerlandica, Netherlands Society for Statistics and Operations Research, vol. 55(1), pages 3-16, March.
    6. S. M. Mirakhmedov & S. Rao Jammalamadaka & Ibrahim B. Mohamed, 2014. "On Edgeworth Expansions in Generalized Urn Models," Journal of Theoretical Probability, Springer, vol. 27(3), pages 725-753, September.
    7. Desislava Nedyalkova & Yves Tillé, 2008. "Optimal sampling and estimation strategies under the linear model," Biometrika, Biometrika Trust, vol. 95(3), pages 521-537.
    8. Dandan Liu & Tianxi Cai & Yingye Zheng, 2012. "Evaluating the Predictive Value of Biomarkers with Stratified Case-Cohort Design," Biometrics, The International Biometric Society, vol. 68(4), pages 1219-1227, December.
    9. Zongqi Xia & Elizabeth Secor & Lori B Chibnik & Riley M Bove & Suchun Cheng & Tanuja Chitnis & Andrew Cagan & Vivian S Gainer & Pei J Chen & Katherine P Liao & Stanley Y Shaw & Ashwin N Ananthakrishna, 2013. "Modeling Disease Severity in Multiple Sclerosis Using Electronic Health Records," PLOS ONE, Public Library of Science, vol. 8(11), pages 1-9, November.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Glenn Heller, 2021. "The added value of new covariates to the brier score in cox survival models," Lifetime Data Analysis: An International Journal Devoted to Statistical Methods and Applications for Time-to-Event Data, Springer, vol. 27(1), pages 1-14, January.
    2. Cuihong Zhang & Jing Ning & Steven H. Belle & Robert H. Squires & Jianwen Cai & Ruosha Li, 2022. "Assessing predictive discrimination performance of biomarkers in the presence of treatment‐induced dependent censoring," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 71(5), pages 1137-1157, November.
    3. Saeed Hayati & Kenji Fukumizu & Afshin Parvardeh, 2024. "Kernel mean embedding of probability measures and its applications to functional data analysis," Scandinavian Journal of Statistics, Danish Society for Theoretical Statistics;Finnish Statistical Society;Norwegian Statistical Association;Swedish Statistical Association, vol. 51(2), pages 447-484, June.
    4. Azar, Pablo D. & Micali, Silvio, 2018. "Computational principal agent problems," Theoretical Economics, Econometric Society, vol. 13(2), May.
    5. Angelica Gianfreda & Francesco Ravazzolo & Luca Rossini, 2023. "Large Time‐Varying Volatility Models for Hourly Electricity Prices," Oxford Bulletin of Economics and Statistics, Department of Economics, University of Oxford, vol. 85(3), pages 545-573, June.
    6. Tobias Fissler & Yannick Hoga, 2024. "How to Compare Copula Forecasts?," Papers 2410.04165, arXiv.org.
    7. Davide Pettenuzzo & Francesco Ravazzolo, 2016. "Optimal Portfolio Choice Under Decision‐Based Model Combinations," Journal of Applied Econometrics, John Wiley & Sons, Ltd., vol. 31(7), pages 1312-1332, November.
    8. Rubio, F.J. & Steel, M.F.J., 2011. "Inference for grouped data with a truncated skew-Laplace distribution," Computational Statistics & Data Analysis, Elsevier, vol. 55(12), pages 3218-3231, December.
    9. Susan Athey & Guido W. Imbens & Stefan Wager, 2018. "Approximate residual balancing: debiased inference of average treatment effects in high dimensions," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 80(4), pages 597-623, September.
    10. Hwang, Eunju, 2022. "Prediction intervals of the COVID-19 cases by HAR models with growth rates and vaccination rates in top eight affected countries: Bootstrap improvement," Chaos, Solitons & Fractals, Elsevier, vol. 155(C).
    11. R de Fondeville & A C Davison, 2018. "High-dimensional peaks-over-threshold inference," Biometrika, Biometrika Trust, vol. 105(3), pages 575-592.
    12. Armantier, Olivier & Treich, Nicolas, 2013. "Eliciting beliefs: Proper scoring rules, incentives, stakes and hedging," European Economic Review, Elsevier, vol. 62(C), pages 17-40.
    13. Domenico Piccolo & Rosaria Simone, 2019. "The class of cub models: statistical foundations, inferential issues and empirical evidence," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 28(3), pages 389-435, September.
    14. Finn Lindgren, 2015. "Comments on: Comparing and selecting spatial predictors using local criteria," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 24(1), pages 35-44, March.
    15. Chuliá, Helena & Garrón, Ignacio & Uribe, Jorge M., 2024. "Daily growth at risk: Financial or real drivers? The answer is not always the same," International Journal of Forecasting, Elsevier, vol. 40(2), pages 762-776.
    16. Kelly Trinh & Bo Zhang & Chenghan Hou, 2025. "Macroeconomic real‐time forecasts of univariate models with flexible error structures," Journal of Forecasting, John Wiley & Sons, Ltd., vol. 44(1), pages 59-78, January.
    17. Laura Liu & Hyungsik Roger Moon & Frank Schorfheide, 2023. "Forecasting with a panel Tobit model," Quantitative Economics, Econometric Society, vol. 14(1), pages 117-159, January.
    18. Warne, Anders, 2023. "DSGE model forecasting: rational expectations vs. adaptive learning," Working Paper Series 2768, European Central Bank.
    19. James Mitchell & Aubrey Poon & Dan Zhu, 2024. "Constructing density forecasts from quantile regressions: Multimodality in macrofinancial dynamics," Journal of Applied Econometrics, John Wiley & Sons, Ltd., vol. 39(5), pages 790-812, August.
    20. Rafael Frongillo, 2022. "Quantum Information Elicitation," Papers 2203.07469, arXiv.org.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:jorssb:v:84:y:2022:i:4:p:1353-1391. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: https://edirc.repec.org/data/rssssea.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.