IDEAS home Printed from https://ideas.repec.org/p/osf/socarx/htnej.html
   My bibliography  Save this paper

Three Families of Automated Text Analysis

Author

Listed:
  • van Loon, Austin

Abstract

Since the beginning of this millennium, data in the form of human-generated text in a machine-readable format has become increasingly available to social scientists, presenting a unique window into social life. However, harnessing vast quantities of this highly unstructured data in a systematic way presents a unique combination of analytical and methodological challenges. Luckily, our understanding of how to overcome these challenges has also developed greatly over this same period. In this article, I present a novel typology of the methods social scientists have used to analyze text data at scale in the interest of testing and developing social theory. I describe three “families” of methods: analyses of (1) term frequency, (2) document structure, and (3) semantic similarity. For each family of methods, I discuss their logical and statistical foundations, analytical strengths and weaknesses, as well as prominent variants and applications.

Suggested Citation

  • van Loon, Austin, 2022. "Three Families of Automated Text Analysis," SocArXiv htnej, Center for Open Science.
  • Handle: RePEc:osf:socarx:htnej
    DOI: 10.31219/osf.io/htnej
    as

    Download full text from publisher

    File URL: https://osf.io/download/6274752dc622401fd41bfa08/
    Download Restriction: no

    File URL: https://libkey.io/10.31219/osf.io/htnej?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Rudolph, Maja & Ruiz, Francisco & Athey, Susan & Blei, David, 2017. "Structured Embedding Models for Grouped Data," Research Papers repec:ecl:stabus:3597, Stanford University, Graduate School of Business.
    2. Bill Thompson & Seán G. Roberts & Gary Lupyan, 2020. "Cultural influences on word meanings revealed through large-scale semantic alignment," Nature Human Behaviour, Nature, vol. 4(10), pages 1029-1038, October.
    3. Daniel D. Lee & H. Sebastian Seung, 1999. "Learning the parts of objects by non-negative matrix factorization," Nature, Nature, vol. 401(6755), pages 788-791, October.
    4. Kim, Sung Eun, 2018. "Media Bias against Foreign Firms as a Veiled Trade Barrier: Evidence from Chinese Newspapers," American Political Science Review, Cambridge University Press, vol. 112(4), pages 954-970, November.
    5. Margaret E. Roberts & Brandon M. Stewart & Edoardo M. Airoldi, 2016. "A Model of Text for Experimentation in the Social Sciences," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 111(515), pages 988-1003, July.
    6. Hackett, Edward J. & Leahey, Erin & Parker, John N. & Rafols, Ismael & Hampton, Stephanie E. & Corte, Ugo & Chavarro, Diego & Drake, John M. & Penders, Bart & Sheble, Laura & Vermeulen, Niki & Vision,, 2021. "Do synthesis centers synthesize? A semantic analysis of topical diversity in research," Research Policy, Elsevier, vol. 50(1).
    7. Matthew Gentzkow & Bryan Kelly & Matt Taddy, 2019. "Text as Data," Journal of Economic Literature, American Economic Association, vol. 57(3), pages 535-574, September.
    8. King, Gary & Zeng, Langche, 2001. "Logistic Regression in Rare Events Data," Political Analysis, Cambridge University Press, vol. 9(2), pages 137-163, January.
    9. Ban, Xuegang (Jeff) & Pang, Jong-Shi & Liu, Henry X. & Ma, Rui, 2012. "Continuous-time point-queue models in dynamic network loading," Transportation Research Part B: Methodological, Elsevier, vol. 46(3), pages 360-380.
    10. Jason W. Burton & Nicole Cruz & Ulrike Hahn, 2021. "Reconsidering evidence of moral contagion in online social networks," Nature Human Behaviour, Nature, vol. 5(12), pages 1629-1635, December.
    11. Elliott Ash & Daniel L. Chen & Sergio Galletta, 2022. "Measuring Judicial Sentiment: Methods and Application to US Circuit Courts," Economica, London School of Economics and Political Science, vol. 89(354), pages 362-376, April.
    12. Scott Deerwester & Susan T. Dumais & George W. Furnas & Thomas K. Landauer & Richard Harshman, 1990. "Indexing by latent semantic analysis," Journal of the American Society for Information Science, Association for Information Science & Technology, vol. 41(6), pages 391-407, September.
    13. David M. Blei & Alp Kucukelbir & Jon D. McAuliffe, 2017. "Variational Inference: A Review for Statisticians," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 112(518), pages 859-877, April.
    14. Grimmer, Justin & Stewart, Brandon M., 2013. "Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts," Political Analysis, Cambridge University Press, vol. 21(3), pages 267-297, July.
    15. Margaret Roberts & Brandon Stewart & Tingley, Dustin & Edoardo Airoldi, 2013. "The structural topic model and applied social science," Working Paper 132666, Harvard University OpenScholar.
    16. Monroe, Burt L. & Colaresi, Michael P. & Quinn, Kevin M., 2008. "Fightin' Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict," Political Analysis, Cambridge University Press, vol. 16(4), pages 372-403.
    17. Laura K. Nelson, 2020. "Computational Grounded Theory: A Methodological Framework," Sociological Methods & Research, , vol. 49(1), pages 3-42, February.
    18. Goldberg, Amir & Srivastava, Sameer B & Manian, Govind & Monroe, William & Potts, Christopher, 2016. "Fitting In or Standing Out? The Tradeoffs of Structural and Cultural Embeddedness," Institute for Research on Labor and Employment, Working Paper Series qt9bf631rg, Institute of Industrial Relations, UC Berkeley.
    19. Molly Lewis & Gary Lupyan, 2020. "Gender stereotypes are reflected in the distributional structure of 25 languages," Nature Human Behaviour, Nature, vol. 4(10), pages 1021-1028, October.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Anna Calissano & Simone Vantini & Marika Arena, 2020. "Monitoring rare categories in sentiment and opinion analysis: a Milan mega event on Twitter platform," Statistical Methods & Applications, Springer;Società Italiana di Statistica, vol. 29(4), pages 787-812, December.
    2. Mohamed M. Mostafa, 2023. "A one-hundred-year structural topic modeling analysis of the knowledge structure of international management research," Quality & Quantity: International Journal of Methodology, Springer, vol. 57(4), pages 3905-3935, August.
    3. Simon Fritzsch & Philipp Scharner & Gregor Weiß, 2021. "Estimating the relation between digitalization and the market value of insurers," Journal of Risk & Insurance, The American Risk and Insurance Association, vol. 88(3), pages 529-567, September.
    4. Wen Shi & Diyi Liu & Jing Yang & Jing Zhang & Sanmei Wen & Jing Su, 2020. "Social Bots’ Sentiment Engagement in Health Emergencies: A Topic-Based Analysis of the COVID-19 Pandemic Discussions on Twitter," IJERPH, MDPI, vol. 17(22), pages 1-18, November.
    5. Dehler-Holland, Joris & Schumacher, Kira & Fichtner, Wolf, 2021. "Topic Modeling Uncovers Shifts in Media Framing of the German Renewable Energy Act," EconStor Open Access Articles and Book Chapters, ZBW - Leibniz Information Centre for Economics, vol. 2(1).
    6. Ben Cormier & Mark S. Manger, 2022. "Power, ideas, and World Bank conditionality," The Review of International Organizations, Springer, vol. 17(3), pages 397-425, July.
    7. Peter Grajzl & Peter Murrell, 2021. "Characterizing a legal–intellectual culture: Bacon, Coke, and seventeenth-century England," Cliometrica, Journal of Historical Economics and Econometric History, Association Française de Cliométrie (AFC), vol. 15(1), pages 43-88, January.
    8. Matthew Gentzkow & Bryan T. Kelly & Matt Taddy, 2017. "Text as Data," NBER Working Papers 23276, National Bureau of Economic Research, Inc.
    9. Giovanna Maria Dora Dore, 2023. "A Natural Language Processing Analysis of Newspapers Coverage of Hong Kong Protests Between 1998 and 2020," Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, Springer, vol. 169(1), pages 143-166, September.
    10. Arthur Dyevre & Nicolas Lampach, 2021. "Issue attention on international courts: Evidence from the European Court of Justice," The Review of International Organizations, Springer, vol. 16(4), pages 793-815, October.
    11. Maksym Polyakov & Morteza Chalak & Md. Sayed Iftekhar & Ram Pandit & Sorada Tapsuwan & Fan Zhang & Chunbo Ma, 2018. "Authorship, Collaboration, Topics, and Research Gaps in Environmental and Resource Economics 1991–2015," Environmental & Resource Economics, Springer;European Association of Environmental and Resource Economists, vol. 71(1), pages 217-239, September.
    12. Camilla Salvatore & Silvia Biffignandi & Annamaria Bianchi, 2022. "Corporate Social Responsibility Activities Through Twitter: From Topic Model Analysis to Indexes Measuring Communication Characteristics," Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, Springer, vol. 164(3), pages 1217-1248, December.
    13. Lüdering Jochen & Winker Peter, 2016. "Forward or Backward Looking? The Economic Discourse and the Observed Reality," Journal of Economics and Statistics (Jahrbuecher fuer Nationaloekonomie und Statistik), De Gruyter, vol. 236(4), pages 483-515, August.
    14. Andreas Rehs, 2020. "A structural topic model approach to scientific reorientation of economics and chemistry after German reunification," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(2), pages 1229-1251, November.
    15. Dehler-Holland, Joris & Okoh, Marvin & Keles, Dogan, 2022. "Assessing technology legitimacy with topic models and sentiment analysis – The case of wind power in Germany," Technological Forecasting and Social Change, Elsevier, vol. 175(C).
    16. Gadat, Sébastien & Villeneuve, Stéphane, 2023. "Parsimonious Wasserstein Text-mining," TSE Working Papers 23-1471, Toulouse School of Economics (TSE).
    17. Beatrice Ferrario & Stefanie Stantcheva, 2022. "Eliciting People's First-Order Concerns: Text Analysis of Open-Ended Survey Questions," AEA Papers and Proceedings, American Economic Association, vol. 112, pages 163-169, May.
    18. D. Thorleuchter & D. Van Den Poel, 2013. "Weak Signal Identification with Semantic Web Mining," Working Papers of Faculty of Economics and Business Administration, Ghent University, Belgium 13/860, Ghent University, Faculty of Economics and Business Administration.
    19. Szymon Sacher & Laura Battaglia & Stephen Hansen, 2021. "Hamiltonian Monte Carlo for Regression with High-Dimensional Categorical Data," Papers 2107.08112, arXiv.org, revised Feb 2024.
    20. Peter Grajzl & Cindy Irby, 2019. "Reflections on study abroad: a computational linguistics approach," Journal of Computational Social Science, Springer, vol. 2(2), pages 151-181, July.

    More about this item

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:osf:socarx:htnej. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: OSF (email available below). General contact details of provider: https://arabixiv.org .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.