IDEAS home Printed from https://ideas.repec.org/p/cwl/cwldpp/2421.html
   My bibliography  Save this paper

Inference for Regression with Variables Generated by AI or Machine Learning

Author

Listed:
  • Laura Battaglia

    (Oxford University)

  • Timothy Christensen

    (Yale University)

  • Stephen Hansen

    (UCL, IFS, and CEPR)

  • Szymon Sacher

    (Meta)

Abstract

It has become common practice for researchers to use AI-powered information retrieval algorithms or other machine learning methods to estimate variables of economic interest, then use these estimates as covariates in a regression model. We show both theoretically and empirically that naively treating AI- and ML-generated variables as ÒdataÓ leads to biased estimates and invalid inference. We propose two methods to correct bias and perform valid inference: (i) an explicit bias correction with bias-corrected confidence intervals, and (ii) joint maximum likelihood estimation of the regression model and the variables of interest. Through several applications, we demonstrate that the common approach generates substantial bias, while both corrections perform well.

Suggested Citation

  • Laura Battaglia & Timothy Christensen & Stephen Hansen & Szymon Sacher, 2025. "Inference for Regression with Variables Generated by AI or Machine Learning," Cowles Foundation Discussion Papers 2421, Cowles Foundation for Research in Economics, Yale University.
  • Handle: RePEc:cwl:cwldpp:2421
    as

    Download full text from publisher

    File URL: https://cowles.yale.edu/sites/default/files/2025-01/d2421.pdf
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Barrero, Jose Maria & Bloom, Nick & Davis, Steven J., 2020. "Why Working From Home Will Stick," SocArXiv wfdbe, Center for Open Science.
    2. Aigner, Dennis J., 1973. "Regression with a binary independent variable subject to errors of observation," Journal of Econometrics, Elsevier, vol. 1(1), pages 49-59, March.
    3. Cevat Giray Aksoy & Jose Maria Barrero & Nicholas Bloom & Steven J. Davis & Mathias Dolls & Pablo Zarate, 2022. "Working from Home Around the World," Brookings Papers on Economic Activity, Economic Studies Program, The Brookings Institution, vol. 53(2 (Fall)), pages 281-360.
    4. Hansen, Stephen & Lambert, Peter John & Bloom, Nicholas & Davis, Steven J. & Sadun, Raffaella & Taska, Bledi, 2023. "Remote Work across Jobs, Companies, and Space," IZA Discussion Papers 15980, Institute of Labor Economics (IZA).
    5. Yuriy Gorodnichenko & Tho Pham & Oleksandr Talavera, 2023. "The Voice of Monetary Policy," American Economic Review, American Economic Association, vol. 113(2), pages 548-584, February.
    6. Bryan Kelly & Dimitris Papanikolaou & Amit Seru & Matt Taddy, 2021. "Measuring Technological Innovation over the Long Run," American Economic Review: Insights, American Economic Association, vol. 3(3), pages 303-320, September.
    7. Stephen Hansen & Michael McMahon & Andrea Prat, 2018. "Transparency and Deliberation Within the FOMC: A Computational Linguistics Approach," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 133(2), pages 801-870.
    8. Ash, Elliott & Morelli, Massimo & Vannoni, Matia, 2022. "More Laws, More Growth? Evidence from U.S. States," CEPR Discussion Papers 15629, C.E.P.R. Discussion Papers.
    9. Jinyong Hahn & Guido Kuersteiner, 2002. "Asymptotically Unbiased Inference for a Dynamic Panel Model with Fixed Effects when Both "n" and "T" Are Large," Econometrica, Econometric Society, vol. 70(4), pages 1639-1657, July.
    10. Scott R. Baker & Nicholas Bloom & Steven J. Davis, 2016. "Measuring Economic Policy Uncertainty," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 131(4), pages 1593-1636.
    11. Oriana Bandiera & Andrea Prat & Stephen Hansen & Raffaella Sadun, 2020. "CEO Behavior and Firm Performance," Journal of Political Economy, University of Chicago Press, vol. 128(4), pages 1325-1369.
    12. Ben S. Bernanke & Jean Boivin & Piotr Eliasz, 2005. "Measuring the Effects of Monetary Policy: A Factor-Augmented Vector Autoregressive (FAVAR) Approach," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 120(1), pages 387-422.
    13. Mueller, Hannes & Rauh, Christopher, 2018. "Reading Between the Lines: Prediction of Political Violence Using Newspaper Text," American Political Science Review, Cambridge University Press, vol. 112(2), pages 358-375, May.
    14. Gerard Hoberg & Gordon Phillips, 2016. "Text-Based Network Industries and Endogenous Product Differentiation," Journal of Political Economy, University of Chicago Press, vol. 124(5), pages 1423-1465.
    15. Xiaohong Chen & Timothy M. Christensen & Elie Tamer, 2018. "Monte Carlo Confidence Sets for Identified Sets," Econometrica, Econometric Society, vol. 86(6), pages 1965-2018, November.
    16. Gonçalves, Sílvia & Perron, Benoit, 2014. "Bootstrapping factor-augmented regression models," Journal of Econometrics, Elsevier, vol. 182(1), pages 156-173.
    17. Leif Anders Thorsrud, 2020. "Words are the New Numbers: A Newsy Coincident Index of the Business Cycle," Journal of Business & Economic Statistics, Taylor & Francis Journals, vol. 38(2), pages 393-409, April.
    18. Adams, Renée B. & Ragunathan, Vanitha & Tumarkin, Robert, 2021. "Death by committee? An analysis of corporate board (sub-) committees," Journal of Financial Economics, Elsevier, vol. 141(3), pages 1119-1146.
    19. Margaret E. Roberts & Brandon M. Stewart & Dustin Tingley & Christopher Lucas & Jetson Leder‐Luis & Shana Kushner Gadarian & Bethany Albertson & David G. Rand, 2014. "Structural Topic Models for Open‐Ended Survey Responses," American Journal of Political Science, John Wiley & Sons, vol. 58(4), pages 1064-1082, October.
    20. Andrei Zeleneev & Kirill Evdokimov, 2023. "Simple estimation of semiparametric models with measurement errors," CeMMAP working papers 10/23, Institute for Fiscal Studies.
    21. Malmendier, Ulrike & Nagel, Stefan & Yan, Zhen, 2021. "The making of hawks and doves," Journal of Monetary Economics, Elsevier, vol. 117(C), pages 19-42.
    22. Pagan, Adrian, 1984. "Econometric Issues in the Analysis of Regressions with Generated Regressors," International Economic Review, Department of Economics, University of Pennsylvania and Osaka University Institute of Social and Economic Research Association, vol. 25(1), pages 221-247, February.
    23. Ruijia Wu & Linjun Zhang & T. Tony Cai, 2023. "Sparse Topic Modeling: Computational Efficiency, Near-Optimal Algorithms, and Statistical Inference," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 118(543), pages 1849-1861, July.
    24. Douglas Staiger & James H. Stock, 1997. "Instrumental Variables Regression with Weak Instruments," Econometrica, Econometric Society, vol. 65(3), pages 557-586, May.
    25. Evan Munro & Serena Ng, 2022. "Latent Dirichlet Analysis of Categorical Survey Responses," Journal of Business & Economic Statistics, Taylor & Francis Journals, vol. 40(1), pages 256-271, January.
    26. Jushan Bai & Serena Ng, 2006. "Confidence Intervals for Diffusion Index Forecasts and Inference for Factor-Augmented Regressions," Econometrica, Econometric Society, vol. 74(4), pages 1133-1150, July.
    27. Larsen, Vegard H. & Thorsrud, Leif A., 2019. "The value of news for economic developments," Journal of Econometrics, Elsevier, vol. 210(1), pages 203-218.
    28. Fong, Christian & Tyler, Matthew, 2021. "Machine Learning Predictions as Regression Covariates," Political Analysis, Cambridge University Press, vol. 29(4), pages 467-484, October.
    29. AIGNER, Dennis J., 1973. "Regression with a binary independent variable subject to errors of observation," LIDAM Reprints CORE 130, Université catholique de Louvain, Center for Operations Research and Econometrics (CORE).
    30. Liran Einav & Amy Finkelstein & Neale Mahoney, 2022. "Producing Health: Measuring Value Added of Nursing Homes," NBER Working Papers 30228, National Bureau of Economic Research, Inc.
    31. Giovanni Compiani & Ilya Morozov & Stephan Seiler, 2023. "Demand Estimation with Text and Image Data," CESifo Working Paper Series 10695, CESifo.
    32. Stock J.H. & Watson M.W., 2002. "Forecasting Using Principal Components From a Large Number of Predictors," Journal of the American Statistical Association, American Statistical Association, vol. 97, pages 1167-1179, December.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Laura Battaglia & Timothy M. Christensen & Stephen Hansen & Szymon Sacher, 2024. "Inference for regression with variables generated from unstructured data," CeMMAP working papers 10/24, Institute for Fiscal Studies.
    2. Laura Battaglia & Timothy Christensen & Stephen Hansen & Szymon Sacher, 2024. "Inference for Regression with Variables Generated by AI or Machine Learning," Papers 2402.15585, arXiv.org, revised Dec 2024.
    3. Szymon Sacher & Laura Battaglia & Stephen Hansen, 2021. "Hamiltonian Monte Carlo for Regression with High-Dimensional Categorical Data," Papers 2107.08112, arXiv.org, revised Feb 2024.
    4. Masud Alam, 2024. "Output, employment, and price effects of U.S. narrative tax changes: a factor-augmented vector autoregression approach," Empirical Economics, Springer, vol. 67(4), pages 1421-1471, October.
    5. Cheng, Xu & Hansen, Bruce E., 2015. "Forecasting with factor-augmented regression: A frequentist model averaging approach," Journal of Econometrics, Elsevier, vol. 186(2), pages 280-293.
    6. Zhang, Han, 2021. "How Using Machine Learning Classification as a Variable in Regression Leads to Attenuation Bias and What to Do About It," SocArXiv 453jk, Center for Open Science.
    7. Sium Bodha Hannadige & Jiti Gao & Mervyn J Silvapulle & Param Silvapulle, 2021. "Time Series Forecasting Using a Mixture of Stationary and Nonstationary Predictors," Monash Econometrics and Business Statistics Working Papers 6/21, Monash University, Department of Econometrics and Business Statistics.
    8. Stock, J.H. & Watson, M.W., 2016. "Dynamic Factor Models, Factor-Augmented Vector Autoregressions, and Structural Vector Autoregressions in Macroeconomics," Handbook of Macroeconomics, in: J. B. Taylor & Harald Uhlig (ed.), Handbook of Macroeconomics, edition 1, volume 2, chapter 0, pages 415-525, Elsevier.
    9. Istrefi, Klodiana & Odendahl, Florens & Sestieri, Giulia, 2023. "Fed communication on financial stability concerns and monetary policy decisions: Revelations from speeches," Journal of Banking & Finance, Elsevier, vol. 151(C).
    10. Asongu, Simplice A. & Andrés, Antonio R., 2020. "Trajectories of knowledge economy in SSA and MENA countries," Technology in Society, Elsevier, vol. 63(C).
    11. Shapiro, Adam Hale & Sudhof, Moritz & Wilson, Daniel J., 2022. "Measuring news sentiment," Journal of Econometrics, Elsevier, vol. 228(2), pages 221-243.
    12. Vegard H ghaug Larsen & Leif Anders Thorsrud, 2018. "Business cycle narratives," Working Papers No 6/2018, Centre for Applied Macro- and Petroleum economics (CAMP), BI Norwegian Business School.
    13. Moon, Hyungsik Roger & Weidner, Martin, 2017. "Dynamic Linear Panel Regression Models With Interactive Fixed Effects," Econometric Theory, Cambridge University Press, vol. 33(1), pages 158-195, February.
    14. Hubert, Paul & Labondance, Fabien, 2021. "The signaling effects of central bank tone," European Economic Review, Elsevier, vol. 133(C).
    15. Jianhao Lin & Jiacheng Fan & Yifan Zhang & Liangyuan Chen, 2023. "Real‐time macroeconomic projection using narrative central bank communication," Journal of Applied Econometrics, John Wiley & Sons, Ltd., vol. 38(2), pages 202-221, March.
    16. Kapetanios, George & Marcellino, Massimiliano, 2010. "Factor-GMM estimation with large sets of possibly weak instruments," Computational Statistics & Data Analysis, Elsevier, vol. 54(11), pages 2655-2675, November.
    17. Sium Bodha Hannadige & Jiti Gao & Mervyn J. Silvapulle & Param Silvapulle, 2020. "Forecasting a Nonstationary Time Series with a Mixture of Stationary and Nonstationary Factors as Predictors," Monash Econometrics and Business Statistics Working Papers 19/20, Monash University, Department of Econometrics and Business Statistics.
    18. Cormun, Vito & Ristolainen, Kim, 2024. "Exchange rate narratives," Bank of Finland Research Discussion Papers 11/2024, Bank of Finland.
    19. Hyungsik Roger Roger Moon & Martin Weidner, 2013. "Dynamic linear panel regression models with interactive fixed effects," CeMMAP working papers 63/13, Institute for Fiscal Studies.
    20. Hyungsik Roger Roger Moon & Martin Weidner, 2014. "Dynamic linear panel regression models with interactive fixed effects," CeMMAP working papers 47/14, Institute for Fiscal Studies.

    More about this item

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:cwl:cwldpp:2421. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Brittany Ladd (email available below). General contact details of provider: https://edirc.repec.org/data/cowleus.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.