Author
Listed:
- Angelike, Tim
- Papenberg, Martin
Abstract
Dataset shift occurs when there is a discrepancy between the distribution of samples used for training and testing a model, which can lead to a reduction in the model’s predictive performance when attempting to generalize model predictions. Dataset shift can occur with respect to the 1) criterion variable (prior probability shift), 2) predictor variables (covariate shift), and 3) the relationship between criterion and predictors (concept shift). The present paper investigated the implications of avoiding dataset shift during k-fold cross-validation. To circumvent the various forms of dataset shift during cross-validation, a range of anticlustering algorithms were employed to ensure equal means, variances, and covariance structure across folds. A comparative analysis was conducted to assess the bias in validation error obtained through anticlustering-based cross-validation and standard cross-validation. The bias in validation error was computed using the true test error based on unseen test data as reference. Utilizing linear regression models, simulated and empirical datasets demonstrated that avoiding prior probability shift in conjunction with covariate shift—but not concept shift—can effectively curtail the bias in the prediction error during cross-validation. The advantage of using anticlustering-based cross-validation over standard cross-validation however diminished when R2 increased. The study underscores the merits of leveraging anticlustering methodologies within the framework of k-fold cross-validation to mitigate adverse effects of prior probability and covariate shift by equating means and variances in predictor and criterion variables across folds.
Suggested Citation
Angelike, Tim & Papenberg, Martin, 2025.
"Preventing Dataset Shift During Cross-Validation: Is it worth it?,"
OSF Preprints
b5fus_v1, Center for Open Science.
Handle:
RePEc:osf:osfxxx:b5fus_v1
DOI: 10.31219/osf.io/b5fus_v1
Download full text from publisher
Corrections
All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:osf:osfxxx:b5fus_v1. See general information about how to correct material in RePEc.
If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.
We have no bibliographic references for this item. You can help adding them by using this form .
If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.
For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: OSF (email available below). General contact details of provider: https://osf.io/preprints/ .
Please note that corrections may take a couple of weeks to filter through
the various RePEc services.