IDEAS home Printed from https://ideas.repec.org/p/nbr/nberwo/23276.html
   My bibliography  Save this paper

Text as Data

Author

Listed:
  • Matthew Gentzkow
  • Bryan T. Kelly
  • Matt Taddy

Abstract

An ever increasing share of human interaction, communication, and culture is recorded as digital text. We provide an introduction to the use of text as an input to economic research. We discuss the features that make text different from other forms of data, offer a practical overview of relevant statistical methods, and survey a variety of applications.

Suggested Citation

  • Matthew Gentzkow & Bryan T. Kelly & Matt Taddy, 2017. "Text as Data," NBER Working Papers 23276, National Bureau of Economic Research, Inc.
  • Handle: RePEc:nbr:nberwo:23276
    Note: AP CF IO POL
    as

    Download full text from publisher

    File URL: http://www.nber.org/papers/w23276.pdf
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Joshua D. Angrist & Alan B. Krueger, 2001. "Instrumental Variables and the Search for Identification: From Supply and Demand to Natural Experiments," Journal of Economic Perspectives, American Economic Association, vol. 15(4), pages 69-85, Fall.
    2. Sanjiv R. Das & Mike Y. Chen, 2007. "Yahoo! for Amazon: Sentiment Extraction from Small Talk on the Web," Management Science, INFORMS, vol. 53(9), pages 1375-1388, September.
    3. Chris Hans, 2009. "Bayesian lasso regression," Biometrika, Biometrika Trust, vol. 96(4), pages 835-845.
    4. Park, Trevor & Casella, George, 2008. "The Bayesian Lasso," Journal of the American Statistical Association, American Statistical Association, vol. 103, pages 681-686, June.
    5. Matt Taddy, 2013. "Rejoinder: Efficiency and Structure in MNIR," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 108(503), pages 772-774, September.
    6. Matthias M M Buehlmaier & Toni M Whited, 2018. "Are Financial Constraints Priced? Evidence from Textual Analysis," The Review of Financial Studies, Society for Financial Studies, vol. 31(7), pages 2693-2728.
    7. Alexandre Belloni & Victor Chernozhukov & Christian Hansen, 2011. "Inference for High-Dimensional Sparse Econometric Models," Papers 1201.0220, arXiv.org.
    8. Matthew Gentzkow & Jesse M. Shapiro, 2010. "What Drives Media Slant? Evidence From U.S. Daily Newspapers," Econometrica, Econometric Society, vol. 78(1), pages 35-71, January.
    9. Matthew Gentzkow & Jesse M. Shapiro & Matt Taddy, 2019. "Measuring Group Differences in High‐Dimensional Choices: Method and Application to Congressional Speech," Econometrica, Econometric Society, vol. 87(4), pages 1307-1340, July.
    10. Matt Taddy, 2013. "Multinomial Inverse Regression for Text Analysis," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 108(503), pages 755-770, September.
    11. Cheryl J. Flynn & Clifford M. Hurvich & Jeffrey S. Simonoff, 2013. "Efficiency for Regularization Parameter Selection in Penalized Likelihood Estimation of Misspecified Models," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 108(503), pages 1031-1043, September.
    12. Scott R. Baker & Nicholas Bloom & Steven J. Davis, 2016. "Measuring Economic Policy Uncertainty," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 131(4), pages 1593-1636.
    13. Albert Saiz & Uri Simonsohn, 2013. "Proxying For Unobservable Variables With Internet Document-Frequency," Journal of the European Economic Association, European Economic Association, vol. 11(1), pages 137-165, February.
    14. Grimmer, Justin & Stewart, Brandon M., 2013. "Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts," Political Analysis, Cambridge University Press, vol. 21(3), pages 267-297, July.
    15. Margaret Roberts & Brandon Stewart & Tingley, Dustin & Edoardo Airoldi, 2013. "The structural topic model and applied social science," Working Paper 132666, Harvard University OpenScholar.
    16. Wisniewski, Tomasz Piotr & Lambe, Brendan, 2013. "The role of media in the credit crunch: The case of the banking sector," Journal of Economic Behavior & Organization, Elsevier, vol. 85(C), pages 163-175.
    17. Bańbura, Marta & Giannone, Domenico & Modugno, Michele & Reichlin, Lucrezia, 2013. "Now-Casting and the Real-Time Data Flow," Handbook of Economic Forecasting, in: G. Elliott & C. Granger & A. Timmermann (ed.), Handbook of Economic Forecasting, edition 1, volume 2, chapter 0, pages 195-237, Elsevier.
    18. Stephen Hansen & Michael McMahon & Andrea Prat, 2018. "Transparency and Deliberation Within the FOMC: A Computational Linguistics Approach," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 133(2), pages 801-870.
    19. Hyunyoung Choi & Hal Varian, 2012. "Predicting the Present with Google Trends," The Economic Record, The Economic Society of Australia, vol. 88(s1), pages 2-9, June.
    20. repec:fth:prinin:455 is not listed on IDEAS
    21. Benjamin Born & Michael Ehrmann & Marcel Fratzscher, 2014. "Central Bank Communication on Financial Stability," Economic Journal, Royal Economic Society, vol. 124(577), pages 701-734, June.
    22. Matt Taddy & Matt Gardner & Liyun Chen & David Draper, 2016. "A Nonparametric Bayesian Analysis of Heterogenous Treatment Effects in Digital Experimentation," Journal of Business & Economic Statistics, Taylor & Francis Journals, vol. 34(4), pages 661-672, October.
    23. Tim Loughran & Bill Mcdonald, 2011. "When Is a Liability Not a Liability? Textual Analysis, Dictionaries, and 10‐Ks," Journal of Finance, American Finance Association, vol. 66(1), pages 35-65, February.
    24. Grimmer, Justin, 2010. "A Bayesian Hierarchical Topic Model for Political Texts: Measuring Expressed Agendas in Senate Press Releases," Political Analysis, Cambridge University Press, vol. 18(1), pages 1-35, January.
    25. Stephens-Davidowitz, Seth, 2014. "The cost of racial animus on a black candidate: Evidence using Google search data," Journal of Public Economics, Elsevier, vol. 118(C), pages 26-40.
    26. Carlos M. Carvalho & Nicholas G. Polson & James G. Scott, 2010. "The horseshoe estimator for sparse signals," Biometrika, Biometrika Trust, vol. 97(2), pages 465-480.
    27. James H. Stock & Francesco Trebbi, 2003. "Retrospectives: Who Invented Instrumental Variable Regression?," Journal of Economic Perspectives, American Economic Association, vol. 17(3), pages 177-194, Summer.
    28. Jegadeesh, Narasimhan & Wu, Di, 2013. "Word power: A new approach for content analysis," Journal of Financial Economics, Elsevier, vol. 110(3), pages 712-729.
    29. Fan J. & Li R., 2001. "Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties," Journal of the American Statistical Association, American Statistical Association, vol. 96, pages 1348-1360, December.
    30. Bradley Efron, 2004. "The Estimation of Prediction Error: Covariance Penalties and Cross-Validation," Journal of the American Statistical Association, American Statistical Association, vol. 99, pages 619-632, January.
    31. Tim Groseclose & Jeffrey Milyo, 2005. "A Measure of Media Bias," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 120(4), pages 1191-1237.
    32. Steven L. Scott & Hal R. Varian, 2015. "Bayesian Variable Selection for Nowcasting Economic Time Series," NBER Chapters, in: Economic Analysis of the Digital Economy, pages 119-135, National Bureau of Economic Research, Inc.
    33. Feng Li, 2010. "The Information Content of Forward‐Looking Statements in Corporate Filings—A Naïve Bayesian Machine Learning Approach," Journal of Accounting Research, Wiley Blackwell, vol. 48(5), pages 1049-1102, December.
    34. Scott Deerwester & Susan T. Dumais & George W. Furnas & Thomas K. Landauer & Richard Harshman, 1990. "Indexing by latent semantic analysis," Journal of the American Society for Information Science, Association for Information Science & Technology, vol. 41(6), pages 391-407, September.
    35. Kevin M. Quinn & Burt L. Monroe & Michael Colaresi & Michael H. Crespin & Dragomir R. Radev, 2010. "How to Analyze Political Attention with Minimal Assumptions and Costs," American Journal of Political Science, John Wiley & Sons, vol. 54(1), pages 209-228, January.
    36. Zou, Hui, 2006. "The Adaptive Lasso and Its Oracle Properties," Journal of the American Statistical Association, American Statistical Association, vol. 101, pages 1418-1429, December.
    37. Teh, Yee Whye & Jordan, Michael I. & Beal, Matthew J. & Blei, David M., 2006. "Hierarchical Dirichlet Processes," Journal of the American Statistical Association, American Statistical Association, vol. 101, pages 1566-1581, December.
    38. Friedman, Jerome H. & Hastie, Trevor & Tibshirani, Rob, 2010. "Regularization Paths for Generalized Linear Models via Coordinate Descent," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 33(i01).
    39. Joseph E. Engelberg & Christopher A. Parsons, 2011. "The Causal Impact of Media in Financial Markets," Journal of Finance, American Finance Association, vol. 66(1), pages 67-97, February.
    40. Matthew Gentzkow & Jesse Shapiro & Matt Taddy, 2016. "Measuring Polarization in High-Dimensional Data: Method and Application to Congressional Speech," Working Papers id:11114, eSocialSciences.
    41. Paul C. Tetlock, 2007. "Giving Content to Investor Sentiment: The Role of Media in the Stock Market," Journal of Finance, American Finance Association, vol. 62(3), pages 1139-1168, June.
    42. repec:bla:jfinan:v:59:y:2004:i:3:p:1259-1294 is not listed on IDEAS
    43. Joshua Angrist & Alan Krueger, 2001. "Instrumental Variables and the Search for Identification: From Supply and Demand to Natural Experiments," Working Papers 834, Princeton University, Department of Economics, Industrial Relations Section..
    44. Hui Zou & Trevor Hastie, 2005. "Addendum: Regularization and variable selection via the elastic net," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 67(5), pages 768-768, November.
    45. Hui Zou & Trevor Hastie, 2005. "Regularization and variable selection via the elastic net," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 67(2), pages 301-320, April.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Andres Algaba & David Ardia & Keven Bluteau & Samuel Borms & Kris Boudt, 2020. "Econometrics Meets Sentiment: An Overview Of Methodology And Applications," Journal of Economic Surveys, Wiley Blackwell, vol. 34(3), pages 512-547, July.
    2. van Erp, Sara & Oberski, Daniel L. & Mulder, Joris, 2018. "Shrinkage priors for Bayesian penalized regression," OSF Preprints cg8fq, Center for Open Science.
    3. García, Diego & Hu, Xiaowen & Rohrer, Maximilian, 2023. "The colour of finance words," Journal of Financial Economics, Elsevier, vol. 147(3), pages 525-549.
    4. Vegard H. Larsen & Leif Anders Thorsrud, 2018. "Business cycle narratives," Working Paper 2018/3, Norges Bank.
    5. Posch, Konstantin & Arbeiter, Maximilian & Pilz, Juergen, 2020. "A novel Bayesian approach for variable selection in linear regression models," Computational Statistics & Data Analysis, Elsevier, vol. 144(C).
    6. Christina Bannier & Thomas Pauls & Andreas Walter, 2019. "Content analysis of business communication: introducing a German dictionary," Journal of Business Economics, Springer, vol. 89(1), pages 79-123, February.
    7. Buehlmaier, Matthias M. M. & Zechner, Josef, 2016. "Financial media, price discovery, and merger arbitrage," CFS Working Paper Series 551, Center for Financial Studies (CFS).
    8. Matthew Gentzkow & Jesse M. Shapiro & Matt Taddy, 2019. "Measuring Group Differences in High‐Dimensional Choices: Method and Application to Congressional Speech," Econometrica, Econometric Society, vol. 87(4), pages 1307-1340, July.
    9. Youngjoon Lee & Soohyon Kim & Ki Young Park, 2018. "Deciphering Monetary Policy Committee Minutes with Text Mining Approach: A Case of South Korea," Working papers 2018rwp-132, Yonsei University, Yonsei Economics Research Institute.
    10. Aprigliano, Valentina & Emiliozzi, Simone & Guaitoli, Gabriele & Luciani, Andrea & Marcucci, Juri & Monteforte, Libero, 2023. "The power of text-based indicators in forecasting Italian economic activity," International Journal of Forecasting, Elsevier, vol. 39(2), pages 791-808.
    11. Stelios Michalopoulos & Melanie Meng Xue, 2021. "Folklore," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 136(4), pages 1993-2046.
    12. Ricardo P. Masini & Marcelo C. Medeiros & Eduardo F. Mendes, 2023. "Machine learning advances for time series forecasting," Journal of Economic Surveys, Wiley Blackwell, vol. 37(1), pages 76-111, February.
    13. Philip Kostov & Thankom Arun & Samuel Annim, 2014. "Financial Services to the Unbanked: the case of the Mzansi intervention in South Africa," Contemporary Economics, University of Economics and Human Sciences in Warsaw., vol. 8(2), June.
    14. Ruggieri, Eric & Lawrence, Charles E., 2012. "On efficient calculations for Bayesian variable selection," Computational Statistics & Data Analysis, Elsevier, vol. 56(6), pages 1319-1332.
    15. Hansen, Stephen & Davis, Steven & Seminario-Amez, Cristhian, 2020. "Firm-level Risk Exposures and Stock Returns in the Wake of COVID-19," CEPR Discussion Papers 15314, C.E.P.R. Discussion Papers.
    16. Tanin Sirimongkolkasem & Reza Drikvandi, 2019. "On Regularisation Methods for Analysis of High Dimensional Data," Annals of Data Science, Springer, vol. 6(4), pages 737-763, December.
    17. Adam D. Nowak & Bradley S. Price & Patrick S. Smith, 2021. "Real Estate Dictionaries Across Space and Time," The Journal of Real Estate Finance and Economics, Springer, vol. 62(1), pages 139-163, January.
    18. Ching Hsu & Tina Yu & Shu-Heng Chen, 2021. "Narrative economics using textual analysis of newspaper data: new insights into the U.S. Silver Purchase Act and Chinese price level in 1928–1936," Journal of Computational Social Science, Springer, vol. 4(2), pages 761-785, November.
    19. Simon Fritzsch & Philipp Scharner & Gregor Weiß, 2021. "Estimating the relation between digitalization and the market value of insurers," Journal of Risk & Insurance, The American Risk and Insurance Association, vol. 88(3), pages 529-567, September.
    20. Hirose, Kei & Tateishi, Shohei & Konishi, Sadanori, 2013. "Tuning parameter selection in sparse regression modeling," Computational Statistics & Data Analysis, Elsevier, vol. 59(C), pages 28-40.

    More about this item

    JEL classification:

    • C1 - Mathematical and Quantitative Methods - - Econometric and Statistical Methods and Methodology: General

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:nbr:nberwo:23276. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: the person in charge (email available below). General contact details of provider: https://edirc.repec.org/data/nberrus.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.