IDEAS home Printed from https://ideas.repec.org/a/spr/scient/v100y2014i3d10.1007_s11192-014-1321-8.html
   My bibliography  Save this article

Clustering scientific documents with topic modeling

Author

Listed:
  • Chyi-Kwei Yau

    (Georgia Tech)

  • Alan Porter

    (Georgia Tech
    Search Technology, Inc.)

  • Nils Newman

    (IISC
    University of Maastricht)

  • Arho Suominen

    (VTT Technical Research Centre of Finland, Innovations, Economy, and Policy)

Abstract

Topic modeling is a type of statistical model for discovering the latent “topics” that occur in a collection of documents through machine learning. Currently, latent Dirichlet allocation (LDA) is a popular and common modeling approach. In this paper, we investigate methods, including LDA and its extensions, for separating a set of scientific publications into several clusters. To evaluate the results, we generate a collection of documents that contain academic papers from several different fields and see whether papers in the same field will be clustered together. We explore potential scientometric applications of such text analysis capabilities.

Suggested Citation

  • Chyi-Kwei Yau & Alan Porter & Nils Newman & Arho Suominen, 2014. "Clustering scientific documents with topic modeling," Scientometrics, Springer;Akadémiai Kiadó, vol. 100(3), pages 767-786, September.
  • Handle: RePEc:spr:scient:v:100:y:2014:i:3:d:10.1007_s11192-014-1321-8
    DOI: 10.1007/s11192-014-1321-8
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s11192-014-1321-8
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s11192-014-1321-8?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Alfio Ferrara & Silvia Salini, 2012. "Ten challenges in modeling bibliographic data for bibliometric analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 93(3), pages 765-785, December.
    2. Erjia Yan & Ying Ding & Elin K. Jacob, 2012. "Overlaying communities and topics: an analysis on publication networks," Scientometrics, Springer;Akadémiai Kiadó, vol. 90(2), pages 499-513, February.
    3. Teh, Yee Whye & Jordan, Michael I. & Beal, Matthew J. & Blei, David M., 2006. "Hierarchical Dirichlet Processes," Journal of the American Statistical Association, American Statistical Association, vol. 101, pages 1566-1581, December.
    4. Grün, Bettina & Hornik, Kurt, 2011. "topicmodels: An R Package for Fitting Topic Models," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 40(i13).
    5. Patrick Glenisson & Wolfgang Glänzel & Olle Persson, 2005. "Combining full-text analysis and bibliometric indicators. A pilot study," Scientometrics, Springer;Akadémiai Kiadó, vol. 63(1), pages 163-180, March.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Kun Sun & Haitao Liu & Wenxin Xiong, 2021. "The evolutionary pattern of language in scientific writings: A case study of Philosophical Transactions of Royal Society (1665–1869)," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(2), pages 1695-1724, February.
    2. Wang, Jason & Weiss, Robert E., 2022. "Local and global topics in text modeling of web pages nested in web sites," Computational Statistics & Data Analysis, Elsevier, vol. 173(C).
    3. Yoshi Fujiwara & Rubaiyat Islam, 2021. "Bitcoin's Crypto Flow Network," Papers 2106.11446, arXiv.org, revised Jul 2021.
    4. Francesca De Battisti & Alfio Ferrara & Silvia Salini, 2015. "A decade of research in statistics: a topic model approach," Scientometrics, Springer;Akadémiai Kiadó, vol. 103(2), pages 413-433, May.
    5. Jiang, Hanchen & Qiang, Maoshan & Lin, Peng, 2016. "A topic modeling based bibliometric exploration of hydropower research," Renewable and Sustainable Energy Reviews, Elsevier, vol. 57(C), pages 226-237.
    6. Martin Reisenbichler & Thomas Reutterer, 2019. "Topic modeling in marketing: recent advances and research opportunities," Journal of Business Economics, Springer, vol. 89(3), pages 327-356, April.
    7. Yanto Chandra, 2018. "Mapping the evolution of entrepreneurship as a field of research (1990–2013): A scientometric analysis," PLOS ONE, Public Library of Science, vol. 13(1), pages 1-24, January.
    8. Jeong, Yujin & Park, Inchae & Yoon, Byungun, 2019. "Identifying emerging Research and Business Development (R&BD) areas based on topic modeling and visualization with intellectual property right data," Technological Forecasting and Social Change, Elsevier, vol. 146(C), pages 655-672.
    9. Sandra Wankmüller, 2023. "A comparison of approaches for imbalanced classification problems in the context of retrieving relevant documents for an analysis," Journal of Computational Social Science, Springer, vol. 6(1), pages 91-163, April.
    10. Michelle Dietzen & Haoran Zhai & Olivia Lucas & Oriol Pich & Christopher Barrington & Wei-Ting Lu & Sophia Ward & Yanping Guo & Robert E. Hynds & Simone Zaccaria & Charles Swanton & Nicholas McGranaha, 2024. "Replication timing alterations are associated with mutation acquisition during breast and lung cancer evolution," Nature Communications, Nature, vol. 15(1), pages 1-23, December.
    11. Redivo, Edoardo & Nguyen, Hien D. & Gupta, Mayetri, 2020. "Bayesian clustering of skewed and multimodal data using geometric skewed normal distributions," Computational Statistics & Data Analysis, Elsevier, vol. 152(C).
    12. María Pinto & Rosaura Fernández-Pascual & David Caballero-Mariscal & Dora Sales, 2020. "Information literacy trends in higher education (2006–2019): visualizing the emerging field of mobile information literacy," Scientometrics, Springer;Akadémiai Kiadó, vol. 124(2), pages 1479-1510, August.
    13. Arsenyan, Jbid & Mirowska, Agata & Piepenbrink, Anke, 2023. "Close encounters with the virtual kind: Defining a human-virtual agent coexistence framework," Technological Forecasting and Social Change, Elsevier, vol. 193(C).
    14. Jin, Xin & Maheu, John M., 2016. "Bayesian semiparametric modeling of realized covariance matrices," Journal of Econometrics, Elsevier, vol. 192(1), pages 19-39.
    15. Hong Joo Lee & Hoyeon Oh, 2020. "A Study on the Deduction and Diffusion of Promising Artificial Intelligence Technology for Sustainable Industrial Development," Sustainability, MDPI, vol. 12(14), pages 1-15, July.
    16. Maksym Polyakov & Morteza Chalak & Md. Sayed Iftekhar & Ram Pandit & Sorada Tapsuwan & Fan Zhang & Chunbo Ma, 2018. "Authorship, Collaboration, Topics, and Research Gaps in Environmental and Resource Economics 1991–2015," Environmental & Resource Economics, Springer;European Association of Environmental and Resource Economists, vol. 71(1), pages 217-239, September.
    17. Stefano Sbalchiero & Maciej Eder, 2020. "Topic modeling, long texts and the best number of topics. Some Problems and solutions," Quality & Quantity: International Journal of Methodology, Springer, vol. 54(4), pages 1095-1108, August.
    18. Martin Baumgaertner & Johannes Zahner, 2021. "Whatever it takes to understand a central banker - Embedding their words using neural networks," MAGKS Papers on Economics 202130, Philipps-Universität Marburg, Faculty of Business Administration and Economics, Department of Economics (Volkswirtschaftliche Abteilung).
    19. Parvin Ahmadi & Iman Gholampour & Mahmoud Tabandeh, 2018. "Cluster-based sparse topical coding for topic mining and document clustering," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 12(3), pages 537-558, September.
    20. Ignacio Rodríguez-Rodríguez & José-Víctor Rodríguez & Niloofar Shirvanizadeh & Andrés Ortiz & Domingo-Javier Pardo-Quiles, 2021. "Applications of Artificial Intelligence, Machine Learning, Big Data and the Internet of Things to the COVID-19 Pandemic: A Scientometric Review Using Text Mining," IJERPH, MDPI, vol. 18(16), pages 1-29, August.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:scient:v:100:y:2014:i:3:d:10.1007_s11192-014-1321-8. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.