IDEAS home Printed from https://ideas.repec.org/a/spr/qualqt/v54y2020i4d10.1007_s11135-020-00976-w.html
   My bibliography  Save this article

Topic modeling, long texts and the best number of topics. Some Problems and solutions

Author

Listed:
  • Stefano Sbalchiero

    (University of Padova)

  • Maciej Eder

    (Polish Academy of Sciences and Pedagogical University of Kraków)

Abstract

The main aim of this article is to present the results of different experiments focused on the problem of model fitting process in topic modeling and its accuracy when applied to long texts. At the same time, in fact, the digital era has made available both enormous quantities of textual data and technological advances that have facilitated the development of techniques to automate the data coding and analysis processes. In the ambit of topic modeling, different procedures were born in order to analyze larger and larger collections of texts, namely corpora, but this has posed, and continues to pose, a series of methodological questions that urgently need to be resolved. Therefore, through a series of different experiments, this article is based on the following consideration: taking into account Latent Dirichlet Allocation (LDA), a generative probabilistic model (Blei et al. in J Mach Learn Res 3:993–1022, 2003; Blei and Lafferty in: Srivastava, Sahami (eds) Text mining: classification, clustering, and applications, Chapman & Hall/CRC Press, Cambridge, 2009; Griffiths and Steyvers in Proc Natl Acad Sci USA (PNAS), 101(Supplement 1):5228–5235, 2004), the problem of fitting model is crucial because the LDA algorithm demands that the number of topics is specified a priori. Needles to say, the number of topics to detect in a corpus is a parameter which affect the analysis results. Since there is a lack of experiments applied to long texts, our article tries to shed new light on the complex relationship between texts’ length and the optimal number of topics. In the conclusions, we present a clear-cut power-law relation between the optimal number of topics and the analyzed sample size, and we formulate it in a form of a mathematical model.

Suggested Citation

  • Stefano Sbalchiero & Maciej Eder, 2020. "Topic modeling, long texts and the best number of topics. Some Problems and solutions," Quality & Quantity: International Journal of Methodology, Springer, vol. 54(4), pages 1095-1108, August.
  • Handle: RePEc:spr:qualqt:v:54:y:2020:i:4:d:10.1007_s11135-020-00976-w
    DOI: 10.1007/s11135-020-00976-w
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11135-020-00976-w
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11135-020-00976-w?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Grün, Bettina & Hornik, Kurt, 2011. "topicmodels: An R Package for Fitting Topic Models," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 40(i13).
    2. Feinerer, Ingo & Hornik, Kurt & Meyer, David, 2008. "Text Mining Infrastructure in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 25(i05).
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Arina Wischnewsky & David‐Jan Jansen & Matthias Neuenkirch, 2021. "Financial stability and the Fed: Evidence from congressional hearings," Economic Inquiry, Western Economic Association International, vol. 59(3), pages 1192-1214, July.
    2. Maria Stella Righettini & Elisa Bordin, 2023. "Exploring food security as a multidimensional topic: twenty years of scientific publications and recent developments," Quality & Quantity: International Journal of Methodology, Springer, vol. 57(3), pages 2739-2758, June.
    3. Weiss, Daniel & Nemeczek, Fabian, 2021. "A text-based monitoring tool for the legitimacy and guidance of technological innovation systems," Technology in Society, Elsevier, vol. 66(C).
    4. Jessica Birkholz & Jutta Günther & Mariia Shkolnykova, 2021. "Using Topic Modeling in Innovation Studies: The Case of a Small Innovation System under Conditions of Pandemic Related Change," Bremen Papers on Economics & Innovation 2101, University of Bremen, Faculty of Business Studies and Economics.
    5. Javier De la Hoz-M & Mª José Fernández-Gómez & Susana Mendes, 2021. "LDAShiny: An R Package for Exploratory Review of Scientific Literature Based on a Bayesian Probabilistic Model and Machine Learning Tools," Mathematics, MDPI, vol. 9(14), pages 1-21, July.
    6. Oleg Sobchuk & Artjoms Šeļa, 2024. "Computational thematics: comparing algorithms for clustering the genres of literary fiction," Palgrave Communications, Palgrave Macmillan, vol. 11(1), pages 1-12, December.
    7. Mohamed M. Mostafa, 2023. "A one-hundred-year structural topic modeling analysis of the knowledge structure of international management research," Quality & Quantity: International Journal of Methodology, Springer, vol. 57(4), pages 3905-3935, August.
    8. Liangchao Huang & Zhengmeng Hou & Yanli Fang & Jianhua Liu & Tianle Shi, 2023. "Evolution of CCUS Technologies Using LDA Topic Model and Derwent Patent Data," Energies, MDPI, vol. 16(6), pages 1-14, March.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Daoud, Adel & Kohl, Sebastian, 2016. "How much do sociologists write about economic topics? Using big data to test some conventional views in economic sociology, 1890 to 2014," MPIfG Discussion Paper 16/7, Max Planck Institute for the Study of Societies.
    2. Holand, Øystein & Contiero, Barbara & Næss, Marius W. & Cozzi, Giulio, 2024. "“The Times They Are A-Changin' “ – research trends and perspectives of reindeer pastoralism – A review using text mining and topic modelling," Land Use Policy, Elsevier, vol. 136(C).
    3. Cho, Yung-Jan & Fu, Pei-Wen & Wu, Chi-Cheng, 2017. "Popular Research Topics in Marketing Journals, 1995–2014," Journal of Interactive Marketing, Elsevier, vol. 40(C), pages 52-72.
    4. Motta Queiroz, Mariza & Roque, Carlos & Moura, Filipe & Marôco, João, 2024. "Understanding the expectations of parents regarding their children's school commuting by public transport using latent Dirichlet Allocation," Transportation Research Part A: Policy and Practice, Elsevier, vol. 181(C).
    5. João Guerreiro & Paulo Rita & Duarte Trigueiros, 2016. "A Text Mining-Based Review of Cause-Related Marketing Literature," Journal of Business Ethics, Springer, vol. 139(1), pages 111-128, November.
    6. Abhinav Khare & Qing He & Rajan Batta, 2020. "Predicting gasoline shortage during disasters using social media," OR Spectrum: Quantitative Approaches in Management, Springer;Gesellschaft für Operations Research e.V., vol. 42(3), pages 693-726, September.
    7. Lehotský, Lukáš & Černoch, Filip & Osička, Jan & Ocelík, Petr, 2019. "When climate change is missing: Media discourse on coal mining in the Czech Republic," Energy Policy, Elsevier, vol. 129(C), pages 774-786.
    8. Doblinger, Claudia & Surana, Kavita & Li, Deyu & Hultman, Nathan & Anadón, Laura Díaz, 2022. "How do global manufacturing shifts affect long-term clean energy innovation? A study of wind energy suppliers," Research Policy, Elsevier, vol. 51(7).
    9. Andres, Maximilian & Bruttel, Lisa & Friedrichsen, Jana, 2023. "How communication makes the difference between a cartel and tacit collusion: A machine learning approach," European Economic Review, Elsevier, vol. 152(C).
    10. Hudson Golino & Alexander P. Christensen & Robert Moulder & Seohyun Kim & Steven M. Boker, 2022. "Modeling Latent Topics in Social Media using Dynamic Exploratory Graph Analysis: The Case of the Right-wing and Left-wing Trolls in the 2016 US Elections," Psychometrika, Springer;The Psychometric Society, vol. 87(1), pages 156-187, March.
    11. Sun, Katherine Qianwen & Slepian, Michael L., 2020. "The conversations we seek to avoid," Organizational Behavior and Human Decision Processes, Elsevier, vol. 160(C), pages 87-105.
    12. Rieger, Jonas & von Nordheim, Gerret, 2021. "corona100d: German-language Twitter dataset of the first 100 days after Chancellor Merkel addressed the coronavirus outbreak on TV," DoCMA Working Papers 4, TU Dortmund University, Dortmund Center for Data-based Media Analysis (DoCMA).
    13. Garner, Benjamin & Thornton, Corliss & Luo Pawluk, Anita & Mora Cortez, Roberto & Johnston, Wesley & Ayala, Cesar, 2022. "Utilizing text-mining to explore consumer happiness within tourism destinations," Journal of Business Research, Elsevier, vol. 139(C), pages 1366-1377.
    14. Anke Piepenbrink & Elkin Nurmammadov, 2015. "Topics in the literature of transition economies and emerging markets," Scientometrics, Springer;Akadémiai Kiadó, vol. 102(3), pages 2107-2130, March.
    15. Christian WEISMAYER, 2022. "Applied Research in Quality of Life: A Computational Literature Review," Applied Research in Quality of Life, Springer;International Society for Quality-of-Life Studies, vol. 17(3), pages 1433-1458, June.
    16. Arenas Gaitán, Jorge & Ramírez-Correa, Patricio E., 2023. "COVID-19 and telemedicine: A netnography approach," Technological Forecasting and Social Change, Elsevier, vol. 190(C).
    17. Polyzos, Efstathios & Wang, Fang, 2022. "Twitter and market efficiency in energy markets: Evidence using LDA clustered topic extraction," Energy Economics, Elsevier, vol. 114(C).
    18. Jiang, Hanchen & Qiang, Maoshan & Lin, Peng, 2016. "A topic modeling based bibliometric exploration of hydropower research," Renewable and Sustainable Energy Reviews, Elsevier, vol. 57(C), pages 226-237.
    19. Cecilia Elizabeth Bayas Aldaz & Jesus Rodriguez-Pomeda & Leyla Angélica Sandoval Hamón & Fernando Casani, 2020. "Understanding the University-Sustainability Link through Media: A Spanish Perspective," Sustainability, MDPI, vol. 12(12), pages 1-15, June.
    20. Jonas Rieger, 2019. "Mónica Bécue-Bertaut (2019): Textual Data Science with R," Statistical Papers, Springer, vol. 60(5), pages 1797-1798, October.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:qualqt:v:54:y:2020:i:4:d:10.1007_s11135-020-00976-w. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.