IDEAS home Printed from https://ideas.repec.org/p/umc/wpaper/2310.html
   My bibliography  Save this paper

Addressing Sample Selection Bias for Machine Learning Methods

Author

Abstract

We study approaches for adjusting machine learning methods when the training sample differs from the prediction sample on unobserved dimensions. The machine learning literature predominately assumes selection only on observed dimensions. Common approaches are to weight or include variables that influence selection as solutions to selection on observables. Simulation results show that selection on unobservables increases mean squared prediction error using popular machine-learning algorithms. Common machine learning practices such as weighting or including variables that influence selection into the training or prediction sample often worsens sample selection bias. We propose two control-function approaches that remove the effects of selection bias before training and find that they reduce mean-squared prediction error in simulations. We apply these approaches to predicting the vote share of the incumbent in gubernatorial elections using previously observed re-election bids. We find that ignoring selection on unobservables leads to substantially higher predicted vote shares for the incumbent than when the control function approach is used.

Suggested Citation

  • Dylan Brewer & Alyssa Carlson, 2023. "Addressing Sample Selection Bias for Machine Learning Methods," Working Papers 2310, Department of Economics, University of Missouri.
  • Handle: RePEc:umc:wpaper:2310
    as

    Download full text from publisher

    File URL: https://drive.google.com/file/d/1n8EZlC89OnB6BC8AEwxk22yF_ny_ZLqS/view?usp=sharing
    Download Restriction: no
    ---><---

    Other versions of this item:

    References listed on IDEAS

    as
    1. Mariana Lopes da Fonseca, 2017. "Identifying the Source of Incumbency Advantage through a Constitutional Reform," American Journal of Political Science, John Wiley & Sons, vol. 61(3), pages 657-670, July.
    2. Masashi Sugiyama & Taiji Suzuki & Shinichi Nakajima & Hisashi Kashima & Paul Bünau & Motoaki Kawanabe, 2008. "Direct importance estimation for covariate shift adaptation," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 60(4), pages 699-746, December.
    3. Alexandre Belloni & Victor Chernozhukov & Christian Hansen, 2014. "Inference on Treatment Effects after Selection among High-Dimensional Controlsâ€," The Review of Economic Studies, Review of Economic Studies Ltd, vol. 81(2), pages 608-650.
    4. Keisuke Hirano & Guido W. Imbens & Geert Ridder, 2003. "Efficient Estimation of Average Treatment Effects Using the Estimated Propensity Score," Econometrica, Econometric Society, vol. 71(4), pages 1161-1189, July.
    5. Wooldridge, Jeffrey M., 2016. "Should instrumental variables be used as matching variables?," Research in Economics, Elsevier, vol. 70(2), pages 232-237.
    6. Lewbel, Arthur, 2007. "Endogenous selection or treatment model estimation," Journal of Econometrics, Elsevier, vol. 141(2), pages 777-806, December.
    7. Jeffrey M. Wooldridge, 2002. "Inverse probability weighted M-estimators for sample selection, attrition, and stratification," Portuguese Economic Journal, Springer;Instituto Superior de Economia e Gestao, vol. 1(2), pages 117-139, August.
    8. van der Klaauw, Bas & Koning, Ruud H, 2003. "Testing the Normality Assumption in the Sample Selection Model with an Application to Travel Demand," Journal of Business & Economic Statistics, American Statistical Association, vol. 21(1), pages 31-42, January.
    9. Patrick Puhani, 2000. "The Heckman Correction for Sample Selection and Its Critique," Journal of Economic Surveys, Wiley Blackwell, vol. 14(1), pages 53-68, February.
    10. D’Haultfoeuille, Xavier & Maurel, Arnaud, 2013. "Another Look At The Identification At Infinity Of Sample Selection Models," Econometric Theory, Cambridge University Press, vol. 29(1), pages 213-224, February.
    11. Gautam Gowrisankaran & Matthew F. Mitchell & Andrea Moro, 2008. "Electoral Design and Voter Welfare from the U.S. Senate: Evidence from a Dynamic Selection Model," Review of Economic Dynamics, Elsevier for the Society for Economic Dynamics, vol. 11(1), pages 1-17, January.
    12. Wooldridge, Jeffrey M., 2007. "Inverse probability weighted estimation for general missing data problems," Journal of Econometrics, Elsevier, vol. 141(2), pages 1281-1301, December.
    13. A. Belloni & V. Chernozhukov & I. Fernández‐Val & C. Hansen, 2017. "Program Evaluation and Causal Inference With High‐Dimensional Data," Econometrica, Econometric Society, vol. 85, pages 233-298, January.
    14. Hal R. Varian, 2014. "Big Data: New Tricks for Econometrics," Journal of Economic Perspectives, American Economic Association, vol. 28(2), pages 3-28, Spring.
    15. Julie Anderson Schaffner, 2002. "Heteroskedastic Sample Selection And Developing-Country Wage Equations," The Review of Economics and Statistics, MIT Press, vol. 84(2), pages 269-280, May.
    16. Heckman, James, 2013. "Sample selection bias as a specification error," Applied Econometrics, Russian Presidential Academy of National Economy and Public Administration (RANEPA), vol. 31(3), pages 129-137.
    17. Monica P. Escaleras & Peter T. Calcagno, 2009. "Does the Gubernatorial Term Limit Type Affect State Government Expenditures?," Public Finance Review, , vol. 37(5), pages 572-595, September.
    18. Sarah E. Wolfolds & Jordan Siegel, 2019. "Misaccounting for endogeneity: The peril of relying on the Heckman two‐step method without a valid instrument," Strategic Management Journal, Wiley Blackwell, vol. 40(3), pages 432-462, March.
    19. Arabmazar, Abbas & Schmidt, Peter, 1982. "An Investigation of the Robustness of the Tobit Estimator to Non-Normality," Econometrica, Econometric Society, vol. 50(4), pages 1055-1063, July.
    20. Timothy Besley & Anne Case, 1995. "Does Electoral Accountability Affect Economic Policy Choices? Evidence from Gubernatorial Term Limits," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 110(3), pages 769-798.
    21. Leung, Siu Fai & Yu, Shihti, 1996. "On the choice between sample selection and two-part models," Journal of Econometrics, Elsevier, vol. 72(1-2), pages 197-229.
    22. Hall, Andrew B. & Snyder, James M., 2015. "How Much of the Incumbency Advantage is Due to Scare-Off?," Political Science Research and Methods, Cambridge University Press, vol. 3(3), pages 493-514, September.
    23. Francis Vella, 1998. "Estimating Models with Sample Selection Bias: A Survey," Journal of Human Resources, University of Wisconsin Press, vol. 33(1), pages 127-169.
    24. Patrick Bajari & Denis Nekipelov & Stephen P. Ryan & Miaoyu Yang, 2015. "Machine Learning Methods for Demand Estimation," American Economic Review, American Economic Association, vol. 105(5), pages 481-485, May.
    25. Mitali Das & Whitney K. Newey & Francis Vella, 2003. "Nonparametric Estimation of Sample Selection Models," The Review of Economic Studies, Review of Economic Studies Ltd, vol. 70(1), pages 33-58.
    26. Sendhil Mullainathan & Jann Spiess, 2017. "Machine Learning: An Applied Econometric Approach," Journal of Economic Perspectives, American Economic Association, vol. 31(2), pages 87-106, Spring.
    27. Ahn, Hyungtaik & Powell, James L., 1993. "Semiparametric estimation of censored selection models with a nonparametric selection mechanism," Journal of Econometrics, Elsevier, vol. 58(1-2), pages 3-29, July.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Martin Huber, 2014. "Treatment Evaluation in the Presence of Sample Selection," Econometric Reviews, Taylor & Francis Journals, vol. 33(8), pages 869-905, November.
    2. D’Haultfœuille, Xavier & Maurel, Arnaud & Zhang, Yichong, 2018. "Extremal quantile regressions for selection models and the black–white wage gap," Journal of Econometrics, Elsevier, vol. 203(1), pages 129-142.
    3. Martin Huber & Anna Solovyeva, 2020. "Direct and Indirect Effects under Sample Selection and Outcome Attrition," Econometrics, MDPI, vol. 8(4), pages 1-25, December.
    4. Mikhail Zhelonkin & Marc G. Genton & Elvezio Ronchetti, 2016. "Robust inference in sample selection models," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 78(4), pages 805-827, September.
    5. Martin Huber, 2012. "Identification of Average Treatment Effects in Social Experiments Under Alternative Forms of Attrition," Journal of Educational and Behavioral Statistics, , vol. 37(3), pages 443-474, June.
    6. Emmanuel O. Ogundimu & Jane L. Hutton, 2016. "A Sample Selection Model with Skew-normal Distribution," Scandinavian Journal of Statistics, Danish Society for Theoretical Statistics;Finnish Statistical Society;Norwegian Statistical Association;Swedish Statistical Association, vol. 43(1), pages 172-190, March.
    7. Claudia PIGINI, 2012. "Of Butterflies and Caterpillars: Bivariate Normality in the Sample Selection Model," Working Papers 377, Universita' Politecnica delle Marche (I), Dipartimento di Scienze Economiche e Sociali.
    8. Breunig, Christoph & Mammen, Enno & Simoni, Anna, 2018. "Nonparametric estimation in case of endogenous selection," Journal of Econometrics, Elsevier, vol. 202(2), pages 268-285.
    9. Bryan T. Kelly & Asaf Manela & Alan Moreira, 2019. "Text Selection," NBER Working Papers 26517, National Bureau of Economic Research, Inc.
    10. McGovern, Mark E. & Canning, David & Bärnighausen, Till, 2018. "Accounting for non-response bias using participation incentives and survey design: An application using gift vouchers," Economics Letters, Elsevier, vol. 171(C), pages 239-244.
    11. Fan Wu & Yi Xin, 2024. "Estimating Nonseparable Selection Models: A Functional Contraction Approach," Papers 2411.01799, arXiv.org.
    12. Lewbel, Arthur, 2007. "Endogenous selection or treatment model estimation," Journal of Econometrics, Elsevier, vol. 141(2), pages 777-806, December.
    13. Seonho Shin, 2022. "To work or not? Wages or subsidies?: Copula-based evidence of subsidized refugees’ negative selection into employment," Empirical Economics, Springer, vol. 63(4), pages 2209-2252, October.
    14. Maarten Goos & Anna Salomons, 2017. "Measuring teaching quality in higher education: assessing selection bias in course evaluations," Research in Higher Education, Springer;Association for Institutional Research, vol. 58(4), pages 341-364, June.
    15. Nicoletti, Cheti, 2008. "Multiple sample selection in the estimation of intergenerational occupational mobility," ISER Working Paper Series 2008-20, Institute for Social and Economic Research.
    16. Marra, Giampiero & Radice, Rosalba, 2013. "Estimation of a regression spline sample selection model," Computational Statistics & Data Analysis, Elsevier, vol. 61(C), pages 158-173.
    17. Xavier D’Haultfoeuille & Arnaud Maurel & Xiaoyun Qiu & Yichong Zhang, 2020. "Estimating selection models without an instrument with Stata," Stata Journal, StataCorp LP, vol. 20(2), pages 297-308, June.
    18. Arulampalam, Wiji & Corradi, Valentina & Gutknecht, Daniel, 2021. "Intercept Estimation in Nonlinear Selection Models," IZA Discussion Papers 14364, Institute of Labor Economics (IZA).
    19. Mark McGovern & David Canning & Till Bärnighausen, 2018. "Accounting for Non-Response Bias using Participation Incentives and Survey Design," CHaRMS Working Papers 18-02, Centre for HeAlth Research at the Management School (CHaRMS).
    20. Michela Bia & Martin Huber & Lukáš Lafférs, 2024. "Double Machine Learning for Sample Selection Models," Journal of Business & Economic Statistics, Taylor & Francis Journals, vol. 42(3), pages 958-969, July.

    More about this item

    Keywords

    sample selection; machine learning; control function; inverse probability weighting;
    All these keywords.

    JEL classification:

    • C13 - Mathematical and Quantitative Methods - - Econometric and Statistical Methods and Methodology: General - - - Estimation: General
    • C31 - Mathematical and Quantitative Methods - - Multiple or Simultaneous Equation Models; Multiple Variables - - - Cross-Sectional Models; Spatial Models; Treatment Effect Models; Quantile Regressions; Social Interaction Models
    • C55 - Mathematical and Quantitative Methods - - Econometric Modeling - - - Large Data Sets: Modeling and Analysis
    • D72 - Microeconomics - - Analysis of Collective Decision-Making - - - Political Processes: Rent-seeking, Lobbying, Elections, Legislatures, and Voting Behavior

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:umc:wpaper:2310. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Chao Gu (email available below). General contact details of provider: https://edirc.repec.org/data/edumous.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.