IDEAS home Printed from https://ideas.repec.org/a/spr/qualqt/v56y2022i5d10.1007_s11135-021-01287-4.html
   My bibliography  Save this article

The (real) need for a human touch: testing a human–machine hybrid topic classification workflow on a New York Times corpus

Author

Listed:
  • Miklos Sebők

    (Centre for Social Sciences)

  • Zoltán Kacsuk

    (Centre for Social Sciences
    Hochschule der Medien)

  • Ákos Máté

    (Centre for Social Sciences)

Abstract

The classification of the items of ever-increasing textual databases has become an important goal for a number of research groups active in the field of computational social science. Due to the increased amount of text data there is a growing number of use-cases where the initial effort of human classifiers was successfully augmented using supervised machine learning (SML). In this paper, we investigate such a hybrid workflow solution classifying the lead paragraphs of New York Times front-page articles from 1996 to 2006 according to policy topic categories (such as education or defense) of the Comparative Agendas Project (CAP). The SML classification is conducted in multiple rounds and, within each round, we run the SML algorithm on n samples and n times if the given algorithm is non-deterministic (e.g., SVM). If all the SML predictions point towards a single label for a document, then it is classified as such (this approach is also called a “voting ensemble"). In the second step, we explore several scenarios, ranging from using the SML ensemble without human validation to incorporating active learning. Using these scenarios, we can quantify the gains from the various workflow versions. We find that using human coding and validation combined with an ensemble SML hybrid approach can reduce the need for human coding while maintaining very high precision rates and offering a modest to a good level of recall. The modularity of this hybrid workflow allows for various setups to address the idiosyncratic resource bottlenecks that a large-scale text classification project might face.

Suggested Citation

  • Miklos Sebők & Zoltán Kacsuk & Ákos Máté, 2022. "The (real) need for a human touch: testing a human–machine hybrid topic classification workflow on a New York Times corpus," Quality & Quantity: International Journal of Methodology, Springer, vol. 56(5), pages 3621-3643, October.
  • Handle: RePEc:spr:qualqt:v:56:y:2022:i:5:d:10.1007_s11135-021-01287-4
    DOI: 10.1007/s11135-021-01287-4
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11135-021-01287-4
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11135-021-01287-4?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Grimmer, Justin & Stewart, Brandon M., 2013. "Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts," Political Analysis, Cambridge University Press, vol. 21(3), pages 267-297, July.
    2. Denny, Matthew J. & Spirling, Arthur, 2018. "Text Preprocessing For Unsupervised Learning: Why It Matters, When It Misleads, And What To Do About It," Political Analysis, Cambridge University Press, vol. 26(2), pages 168-189, April.
    3. Mike Thelwall & Kevan Buckley & Georgios Paltoglou, 2012. "Sentiment strength detection for the social web," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 63(1), pages 163-173, January.
    4. Lucas, Christopher & Nielsen, Richard A. & Roberts, Margaret E. & Stewart, Brandon M. & Storer, Alex & Tingley, Dustin, 2015. "Computer-Assisted Text Analysis for Comparative Politics," Political Analysis, Cambridge University Press, vol. 23(2), pages 254-277, April.
    5. Adam Bonica, 2018. "Inferring Roll‐Call Scores from Campaign Contributions Using Supervised Machine Learning," American Journal of Political Science, John Wiley & Sons, vol. 62(4), pages 830-848, October.
    6. Sebők, Miklós & Kacsuk, Zoltán, 2021. "The Multiclass Classification of Newspaper Articles with Machine Learning: The Hybrid Binary Snowball Approach," Political Analysis, Cambridge University Press, vol. 29(2), pages 236-249, April.
    7. Stuart N. Soroka & Dominik A. Stecula & Christopher Wlezien, 2015. "It's (Change in) the (Future) Economy, Stupid: Economic Indicators, the Media, and Public Opinion," American Journal of Political Science, John Wiley & Sons, vol. 59(2), pages 457-474, February.
    8. Peterson, Andrew & Spirling, Arthur, 2018. "Classification Accuracy as a Substantive Quantity of Interest: Measuring Polarization in Westminster Systems," Political Analysis, Cambridge University Press, vol. 26(1), pages 120-128, January.
    9. Mike Thelwall & Kevan Buckley & Georgios Paltoglou, 2012. "Sentiment strength detection for the social web," Journal of the American Society for Information Science and Technology, Association for Information Science & Technology, vol. 63(1), pages 163-173, January.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Mohamed M. Mostafa, 2023. "A one-hundred-year structural topic modeling analysis of the knowledge structure of international management research," Quality & Quantity: International Journal of Methodology, Springer, vol. 57(4), pages 3905-3935, August.
    2. Martin Haselmayer & Marcelo Jenny, 2017. "Sentiment analysis of political communication: combining a dictionary approach with crowdcoding," Quality & Quantity: International Journal of Methodology, Springer, vol. 51(6), pages 2623-2646, November.
    3. Seraphine F. Maerz & Carsten Q. Schneider, 2020. "Comparing public communication in democracies and autocracies: automated text analyses of speeches by heads of government," Quality & Quantity: International Journal of Methodology, Springer, vol. 54(2), pages 517-545, April.
    4. Karell, Daniel & Freedman, Michael Raphael, 2019. "Rhetorics of Radicalism," SocArXiv yfzsh, Center for Open Science.
    5. LIM Jaehwan & ITO Asei & ZHANG Hongyong, 2023. "Policy Agenda and Trajectory of the Xi Jinping Administration: Textual Evidence from 2012 to 2022," Policy Discussion Papers 23008, Research Institute of Economy, Trade and Industry (RIETI).
    6. Michal Ovádek & Nicolas Lampach & Arthur Dyevre, 2020. "What’s the talk in Brussels? Leveraging daily news coverage to measure issue attention in the European Union," European Union Politics, , vol. 21(2), pages 204-232, June.
    7. Dehler-Holland, Joris & Schumacher, Kira & Fichtner, Wolf, 2021. "Topic Modeling Uncovers Shifts in Media Framing of the German Renewable Energy Act," EconStor Open Access Articles and Book Chapters, ZBW - Leibniz Information Centre for Economics, vol. 2(1).
    8. Maschke, Andreas, 2024. "Talking exports: The representation of Germany's current account in newspaper media," MPIfG Discussion Paper 24/1, Max Planck Institute for the Study of Societies.
    9. Latifi, Albina & Naboka-Krell, Viktoriia & Tillmann, Peter & Winker, Peter, 2024. "Fiscal policy in the Bundestag: Textual analysis and macroeconomic effects," European Economic Review, Elsevier, vol. 168(C).
    10. Purwoko Haryadi Santoso & Edi Istiyono & Haryanto & Wahyu Hidayatulloh, 2022. "Thematic Analysis of Indonesian Physics Education Research Literature Using Machine Learning," Data, MDPI, vol. 7(11), pages 1-41, October.
    11. Agrawal, Shiv Ratan & Mittal, Divya, 2022. "Optimizing customer engagement content strategy in retail and E-tail: Available on online product review videos," Journal of Retailing and Consumer Services, Elsevier, vol. 67(C).
    12. Ferrara, Federico M. & Masciandaro, Donato & Moschella, Manuela & Romelli, Davide, 2022. "Political voice on monetary policy: Evidence from the parliamentary hearings of the European Central Bank," European Journal of Political Economy, Elsevier, vol. 74(C).
    13. Camilla Salvatore & Silvia Biffignandi & Annamaria Bianchi, 2022. "Corporate Social Responsibility Activities Through Twitter: From Topic Model Analysis to Indexes Measuring Communication Characteristics," Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, Springer, vol. 164(3), pages 1217-1248, December.
    14. Jason Anastasopoulos & George J. Borjas & Gavin G. Cook & Michael Lachanski, 2018. "Job Vacancies, the Beveridge Curve, and Supply Shocks: The Frequency and Content of Help-Wanted Ads in Pre- and Post-Mariel Miami," NBER Working Papers 24580, National Bureau of Economic Research, Inc.
    15. Young Bin Kim & Sang Hyeok Lee & Shin Jin Kang & Myung Jin Choi & Jung Lee & Chang Hun Kim, 2015. "Virtual World Currency Value Fluctuation Prediction System Based on User Sentiment Analysis," PLOS ONE, Public Library of Science, vol. 10(8), pages 1-18, August.
    16. Singh, Amit & Jenamani, Mamata & Thakkar, Jitesh J. & Rana, Nripendra P., 2022. "Quantifying the effect of eWOM embedded consumer perceptions on sales: An integrated aspect-level sentiment analysis and panel data modeling approach," Journal of Business Research, Elsevier, vol. 138(C), pages 52-64.
    17. Ping-Yu Hsu & Hong-Tsuen Lei & Shih-Hsiang Huang & Teng Hao Liao & Yao-Chung Lo & Chin-Chun Lo, 2019. "Effects of sentiment on recommendations in social network," Electronic Markets, Springer;IIM University of St. Gallen, vol. 29(2), pages 253-262, June.
    18. Fatma Najar & Nizar Bouguila, 2023. "On smoothing and scaling language model for sentiment based information retrieval," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 17(3), pages 725-744, September.
    19. Anselm Hager & Hanno Hilbig, 2020. "Does Public Opinion Affect Political Speech?," American Journal of Political Science, John Wiley & Sons, vol. 64(4), pages 921-937, October.
    20. Dehler-Holland, Joris & Okoh, Marvin & Keles, Dogan, 2022. "Assessing technology legitimacy with topic models and sentiment analysis – The case of wind power in Germany," Technological Forecasting and Social Change, Elsevier, vol. 175(C).

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:qualqt:v:56:y:2022:i:5:d:10.1007_s11135-021-01287-4. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.