IDEAS home Printed from https://ideas.repec.org/p/arx/papers/2402.15585.html
   My bibliography  Save this paper

Inference for Regression with Variables Generated from Unstructured Data

Author

Listed:
  • Laura Battaglia
  • Timothy Christensen
  • Stephen Hansen
  • Szymon Sacher

Abstract

The leading strategy for analyzing unstructured data uses two steps. First, latent variables of economic interest are estimated with an upstream information retrieval model. Second, the estimates are treated as "data" in a downstream econometric model. We establish theoretical arguments for why this two-step strategy leads to biased inference in empirically plausible settings. More constructively, we propose a one-step strategy for valid inference that uses the upstream and downstream models jointly. The one-step strategy (i) substantially reduces bias in simulations; (ii) has quantitatively important effects in a leading application using CEO time-use data; and (iii) can be readily adapted by applied researchers.

Suggested Citation

  • Laura Battaglia & Timothy Christensen & Stephen Hansen & Szymon Sacher, 2024. "Inference for Regression with Variables Generated from Unstructured Data," Papers 2402.15585, arXiv.org, revised May 2024.
  • Handle: RePEc:arx:papers:2402.15585
    as

    Download full text from publisher

    File URL: http://arxiv.org/pdf/2402.15585
    File Function: Latest version
    Download Restriction: no
    ---><---

    Other versions of this item:

    References listed on IDEAS

    as
    1. Fong, Christian & Tyler, Matthew, 2021. "Machine Learning Predictions as Regression Covariates," Political Analysis, Cambridge University Press, vol. 29(4), pages 467-484, October.
    2. Evan Munro & Serena Ng, 2022. "Latent Dirichlet Analysis of Categorical Survey Responses," Journal of Business & Economic Statistics, Taylor & Francis Journals, vol. 40(1), pages 256-271, January.
    3. Oriana Bandiera & Andrea Prat & Stephen Hansen & Raffaella Sadun, 2020. "CEO Behavior and Firm Performance," Journal of Political Economy, University of Chicago Press, vol. 128(4), pages 1325-1369.
    4. Yuriy Gorodnichenko & Tho Pham & Oleksandr Talavera, 2023. "The Voice of Monetary Policy," American Economic Review, American Economic Association, vol. 113(2), pages 548-584, February.
    5. Carpenter, Bob & Gelman, Andrew & Hoffman, Matthew D. & Lee, Daniel & Goodrich, Ben & Betancourt, Michael & Brubaker, Marcus & Guo, Jiqiang & Li, Peter & Riddell, Allen, 2017. "Stan: A Probabilistic Programming Language," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 76(i01).
    6. Angelo Mele & Lingjiong Zhu, 2023. "Approximate Variational Estimation for a Model of Network Formation," The Review of Economics and Statistics, MIT Press, vol. 105(1), pages 113-124, January.
    7. Bryan Kelly & Dimitris Papanikolaou & Amit Seru & Matt Taddy, 2021. "Measuring Technological Innovation over the Long Run," American Economic Review: Insights, American Economic Association, vol. 3(3), pages 303-320, September.
    8. Stephen Hansen & Michael McMahon & Andrea Prat, 2018. "Transparency and Deliberation Within the FOMC: A Computational Linguistics Approach," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 133(2), pages 801-870.
    9. Malmendier, Ulrike & Nagel, Stefan & Yan, Zhen, 2021. "The making of hawks and doves," Journal of Monetary Economics, Elsevier, vol. 117(C), pages 19-42.
    10. Mueller, Hannes & Rauh, Christopher, 2018. "Reading Between the Lines: Prediction of Political Violence Using Newspaper Text," American Political Science Review, Cambridge University Press, vol. 112(2), pages 358-375, May.
    11. Dang, Khue-Dung & Quiroz, Matias & Kohn, Robert & Tran, Minh-Ngoc & Villani, Mattias, 2019. "Hamiltonian Monte Carlo with Energy Conserving Subsampling," Working Paper Series 372, Sveriges Riksbank (Central Bank of Sweden).
    12. Leif Anders Thorsrud, 2020. "Words are the New Numbers: A Newsy Coincident Index of the Business Cycle," Journal of Business & Economic Statistics, Taylor & Francis Journals, vol. 38(2), pages 393-409, April.
    13. Xiaohong Chen & Timothy M. Christensen & Elie Tamer, 2018. "Monte Carlo Confidence Sets for Identified Sets," Econometrica, Econometric Society, vol. 86(6), pages 1965-2018, November.
    14. Scott R. Baker & Nicholas Bloom & Steven J. Davis, 2016. "Measuring Economic Policy Uncertainty," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 131(4), pages 1593-1636.
    15. Ash, Elliott & Morelli, Massimo & Vannoni, Matia, 2022. "More Laws, More Growth? Evidence from U.S. States," CEPR Discussion Papers 15629, C.E.P.R. Discussion Papers.
    16. Liran Einav & Amy Finkelstein & Neale Mahoney, 2022. "Producing Health: Measuring Value Added of Nursing Homes," NBER Working Papers 30228, National Bureau of Economic Research, Inc.
    17. Adams, Renée B. & Ragunathan, Vanitha & Tumarkin, Robert, 2021. "Death by committee? An analysis of corporate board (sub-) committees," Journal of Financial Economics, Elsevier, vol. 141(3), pages 1119-1146.
    18. Edward P. Herbst & Frank Schorfheide, 2016. "Bayesian Estimation of DSGE Models," Economics Books, Princeton University Press, edition 1, number 10612.
    19. Ruijia Wu & Linjun Zhang & T. Tony Cai, 2023. "Sparse Topic Modeling: Computational Efficiency, Near-Optimal Algorithms, and Statistical Inference," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 118(543), pages 1849-1861, July.
    20. Gloria Gennaro & Elliott Ash, 2022. "Emotion and Reason in Political Language," The Economic Journal, Royal Economic Society, vol. 132(643), pages 1037-1059.
    21. Douglas Staiger & James H. Stock, 1997. "Instrumental Variables Regression with Weak Instruments," Econometrica, Econometric Society, vol. 65(3), pages 557-586, May.
    22. Jinyong Hahn & Guido Kuersteiner, 2002. "Asymptotically Unbiased Inference for a Dynamic Panel Model with Fixed Effects when Both "n" and "T" Are Large," Econometrica, Econometric Society, vol. 70(4), pages 1639-1657, July.
    23. Leland Bybee & Bryan T. Kelly & Asaf Manela & Dacheng Xiu, 2020. "The Structure of Economic News," NBER Working Papers 26648, National Bureau of Economic Research, Inc.
    24. Margaret E. Roberts & Brandon M. Stewart & Dustin Tingley & Christopher Lucas & Jetson Leder‐Luis & Shana Kushner Gadarian & Bethany Albertson & David G. Rand, 2014. "Structural Topic Models for Open‐Ended Survey Responses," American Journal of Political Science, John Wiley & Sons, vol. 58(4), pages 1064-1082, October.
    25. Gerard Hoberg & Gordon Phillips, 2016. "Text-Based Network Industries and Endogenous Product Differentiation," Journal of Political Economy, University of Chicago Press, vol. 124(5), pages 1423-1465.
    26. Giovanni Compiani & Ilya Morozov & Stephan Seiler, 2023. "Demand Estimation with Text and Image Data," CESifo Working Paper Series 10695, CESifo.
    27. Oriana Bandiera & Greg Fischer & Andrea Prat & Erina Ytsma, 2021. "Do Women Respond Less to Performance Pay? Building Evidence from Multiple Experiments," American Economic Review: Insights, American Economic Association, vol. 3(4), pages 435-454, December.
    28. Matthew Gentzkow & Jesse M. Shapiro & Matt Taddy, 2019. "Measuring Group Differences in High‐Dimensional Choices: Method and Application to Congressional Speech," Econometrica, Econometric Society, vol. 87(4), pages 1307-1340, July.
    29. Ben S. Bernanke & Jean Boivin & Piotr Eliasz, 2005. "Measuring the Effects of Monetary Policy: A Factor-Augmented Vector Autoregressive (FAVAR) Approach," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 120(1), pages 387-422.
    30. Stock J.H. & Watson M.W., 2002. "Forecasting Using Principal Components From a Large Number of Predictors," Journal of the American Statistical Association, American Statistical Association, vol. 97, pages 1167-1179, December.
    31. Stephane Bonhomme, 2021. "Teams: Heterogeneity, Sorting, and Complementarity," Papers 2102.01802, arXiv.org.
    32. Pagan, Adrian, 1984. "Econometric Issues in the Analysis of Regressions with Generated Regressors," International Economic Review, Department of Economics, University of Pennsylvania and Osaka University Institute of Social and Economic Research Association, vol. 25(1), pages 221-247, February.
    33. Rachael Meager, 2019. "Understanding the Average Impact of Microcredit Expansions: A Bayesian Hierarchical Analysis of Seven Randomized Experiments," American Economic Journal: Applied Economics, American Economic Association, vol. 11(1), pages 57-91, January.
    34. Jushan Bai & Serena Ng, 2006. "Confidence Intervals for Diffusion Index Forecasts and Inference for Factor-Augmented Regressions," Econometrica, Econometric Society, vol. 74(4), pages 1133-1150, July.
    35. Larsen, Vegard H. & Thorsrud, Leif A., 2019. "The value of news for economic developments," Journal of Econometrics, Elsevier, vol. 210(1), pages 203-218.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Szymon Sacher & Laura Battaglia & Stephen Hansen, 2021. "Hamiltonian Monte Carlo for Regression with High-Dimensional Categorical Data," Papers 2107.08112, arXiv.org, revised Feb 2024.
    2. Jianhao Lin & Jiacheng Fan & Yifan Zhang & Liangyuan Chen, 2023. "Real‐time macroeconomic projection using narrative central bank communication," Journal of Applied Econometrics, John Wiley & Sons, Ltd., vol. 38(2), pages 202-221, March.
    3. Stock, J.H. & Watson, M.W., 2016. "Dynamic Factor Models, Factor-Augmented Vector Autoregressions, and Structural Vector Autoregressions in Macroeconomics," Handbook of Macroeconomics, in: J. B. Taylor & Harald Uhlig (ed.), Handbook of Macroeconomics, edition 1, volume 2, chapter 0, pages 415-525, Elsevier.
    4. Istrefi, Klodiana & Odendahl, Florens & Sestieri, Giulia, 2023. "Fed communication on financial stability concerns and monetary policy decisions: Revelations from speeches," Journal of Banking & Finance, Elsevier, vol. 151(C).
    5. Asongu, Simplice A. & Andrés, Antonio R., 2020. "Trajectories of knowledge economy in SSA and MENA countries," Technology in Society, Elsevier, vol. 63(C).
    6. Grajzl, Peter & Murrell, Peter, 2024. "Caselaw and England's economic performance during the Industrial Revolution: Data and evidence," Journal of Comparative Economics, Elsevier, vol. 52(1), pages 145-165.
    7. Shapiro, Adam Hale & Sudhof, Moritz & Wilson, Daniel J., 2022. "Measuring news sentiment," Journal of Econometrics, Elsevier, vol. 228(2), pages 221-243.
    8. Vegard H ghaug Larsen & Leif Anders Thorsrud, 2018. "Business cycle narratives," Working Papers No 6/2018, Centre for Applied Macro- and Petroleum economics (CAMP), BI Norwegian Business School.
    9. Moon, Hyungsik Roger & Weidner, Martin, 2017. "Dynamic Linear Panel Regression Models With Interactive Fixed Effects," Econometric Theory, Cambridge University Press, vol. 33(1), pages 158-195, February.
    10. Hubert, Paul & Labondance, Fabien, 2021. "The signaling effects of central bank tone," European Economic Review, Elsevier, vol. 133(C).
    11. Kapetanios, George & Marcellino, Massimiliano, 2010. "Factor-GMM estimation with large sets of possibly weak instruments," Computational Statistics & Data Analysis, Elsevier, vol. 54(11), pages 2655-2675, November.
    12. Masud Alam, 2024. "Output, employment, and price effects of U.S. narrative tax changes: a factor-augmented vector autoregression approach," Empirical Economics, Springer, vol. 67(4), pages 1421-1471, October.
    13. Amarasinghe, Ashani, 2022. "Diverting domestic turmoil," Journal of Public Economics, Elsevier, vol. 208(C).
    14. Hyungsik Roger Roger Moon & Martin Weidner, 2013. "Dynamic linear panel regression models with interactive fixed effects," CeMMAP working papers 63/13, Institute for Fiscal Studies.
    15. Oscar Calvo-Gonz'alez & Axel Eizmendi & Germ'an Reyes, 2022. "The Shifting Attention of Political Leaders: Evidence from Two Centuries of Presidential Speeches," Papers 2209.00540, arXiv.org, revised Jun 2023.
    16. Petropoulos, Fotios & Apiletti, Daniele & Assimakopoulos, Vassilios & Babai, Mohamed Zied & Barrow, Devon K. & Ben Taieb, Souhaib & Bergmeir, Christoph & Bessa, Ricardo J. & Bijak, Jakub & Boylan, Joh, 2022. "Forecasting: theory and practice," International Journal of Forecasting, Elsevier, vol. 38(3), pages 705-871.
      • Fotios Petropoulos & Daniele Apiletti & Vassilios Assimakopoulos & Mohamed Zied Babai & Devon K. Barrow & Souhaib Ben Taieb & Christoph Bergmeir & Ricardo J. Bessa & Jakub Bijak & John E. Boylan & Jet, 2020. "Forecasting: theory and practice," Papers 2012.03854, arXiv.org, revised Jan 2022.
    17. Hyungsik Roger Roger Moon & Martin Weidner, 2014. "Dynamic linear panel regression models with interactive fixed effects," CeMMAP working papers 47/14, Institute for Fiscal Studies.
    18. Daniel Borup & Jorge Wolfgang Hansen & Benjamin Dybro Liengaard & Erik Christian Montes Schütte, 2023. "Quantifying investor narratives and their role during COVID‐19," Journal of Applied Econometrics, John Wiley & Sons, Ltd., vol. 38(4), pages 512-532, June.
    19. Herwartz, Helmut & Rohloff, Hannes, 2018. "Less bang for the buck? Assessing the role of inflation uncertainty for U.S. monetary policy transmission in a data rich environment," University of Göttingen Working Papers in Economics 358, University of Goettingen, Department of Economics.
    20. Cheng, Xu & Hansen, Bruce E., 2015. "Forecasting with factor-augmented regression: A frequentist model averaging approach," Journal of Econometrics, Elsevier, vol. 186(2), pages 280-293.

    More about this item

    JEL classification:

    • C11 - Mathematical and Quantitative Methods - - Econometric and Statistical Methods and Methodology: General - - - Bayesian Analysis: General
    • C51 - Mathematical and Quantitative Methods - - Econometric Modeling - - - Model Construction and Estimation
    • C55 - Mathematical and Quantitative Methods - - Econometric Modeling - - - Large Data Sets: Modeling and Analysis

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arx:papers:2402.15585. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: arXiv administrators (email available below). General contact details of provider: http://arxiv.org/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.