IDEAS home Printed from https://ideas.repec.org/a/eee/infome/v14y2020i3s1751157719305127.html
   My bibliography  Save this article

Monolingual and multilingual topic analysis using LDA and BERT embeddings

Author

Listed:
  • Xie, Qing
  • Zhang, Xinyuan
  • Ding, Ying
  • Song, Min

Abstract

Analyzing research topics offers potential insights into the direction of scientific development. In particular, analyzing multilingual research topics can help researchers grasp the evolution of topics globally, revealing topic similarity among scientific publications written in different languages. Most studies to date on topic analysis have been based on English-language publications and have relied heavily on citation-based topic evolution analysis. However, since it can be challenging for English publications to cite non-English sources and since many languages do not offer English translations of abstracts, citation-based methodologies are not suitable for analyzing multilingual research topic relations. Since multilingual sentence embeddings can effectively preserve word semantics in multilingual translation tasks, a topic model based on multilingual sentence embeddings could potentially generate topic–word distributions for publications in multilingual analysis. In this paper, which is situated in the field of library and information science, we use multilingual pretrained Bidirectional Encoder Representations from Transformers (BERT) embeddings and the Latent Dirichlet Allocation (LDA) topic model to analyze topic evolution in monolingual and multilingual topic similarity settings. For each topic, we multiply its LDA probability value by the averaged tensor similarity of BERT embeddings to explore the evolution of the topic in scientific publications. As our proposed method does not rely on a machine translator or the author's subjective translation, it avoids confusion and misusages caused by either machine error or the author's subjectively chosen English keywords. Our results show that the proposed approach is well-suited to analyzing the scientific evolutions in monolingual and scientific multilingual topic similarity relations.

Suggested Citation

  • Xie, Qing & Zhang, Xinyuan & Ding, Ying & Song, Min, 2020. "Monolingual and multilingual topic analysis using LDA and BERT embeddings," Journal of Informetrics, Elsevier, vol. 14(3).
  • Handle: RePEc:eee:infome:v:14:y:2020:i:3:s1751157719305127
    DOI: 10.1016/j.joi.2020.101055
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S1751157719305127
    Download Restriction: Full text for ScienceDirect subscribers only

    File URL: https://libkey.io/10.1016/j.joi.2020.101055?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Yashuang Qi & Na Zhu & Yujia Zhai & Ying Ding, 2018. "The mutually beneficial relationship of patents and scientific literature: topic evolution in nanoscience," Scientometrics, Springer;Akadémiai Kiadó, vol. 115(2), pages 893-911, May.
    2. Fabrizio Natale & Gianluca Fiore & Johann Hofherr, 2012. "Mapping the research on aquaculture. A bibliometric analysis of aquaculture literature," Scientometrics, Springer;Akadémiai Kiadó, vol. 90(3), pages 983-999, March.
    3. Qing Ji & Xiaoping Pang & Xi Zhao, 2014. "A bibliometric analysis of research on Antarctica during 1993–2012," Scientometrics, Springer;Akadémiai Kiadó, vol. 101(3), pages 1925-1939, December.
    4. Hui-Yun Sung & Hsi-Yin Yeh & Jin-Kwan Lin & Ssu-Han Chen, 2017. "A visualization tool of patent topic evolution using a growing cell structure neural network," Scientometrics, Springer;Akadémiai Kiadó, vol. 111(3), pages 1267-1285, June.
    5. Min Song & SuYeon Kim & Keeheon Lee, 2017. "Ensemble analysis of topical journal ranking in bioinformatics," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 68(6), pages 1564-1583, June.
    6. Tahereh Dehdarirad & Anna Villarroya & Maite Barrios, 2014. "Research trends in gender differences in higher education and science: a co-word analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 101(1), pages 273-290, October.
    7. Xiaoguang Wang & Qikai Cheng & Wei Lu, 2014. "Analyzing evolution of research topics with NEViewer: a new method based on dynamic co-word networks," Scientometrics, Springer;Akadémiai Kiadó, vol. 101(2), pages 1253-1271, November.
    8. Ryan Light & jimi adams, 2016. "Knowledge in motion: the evolution of HIV/AIDS research," Scientometrics, Springer;Akadémiai Kiadó, vol. 107(3), pages 1227-1248, June.
    9. Haiyun Xu & Ting Guo & Zenghui Yue & Lijie Ru & Shu Fang, 2016. "Interdisciplinary topics of information science: a study based on the terms interdisciplinarity index series," Scientometrics, Springer;Akadémiai Kiadó, vol. 106(2), pages 583-601, February.
    10. Ding, Ying, 2011. "Community detection: Topological vs. topical," Journal of Informetrics, Elsevier, vol. 5(4), pages 498-514.
    11. Min Song & Go Eun Heo & Su Yeon Kim, 2014. "Analyzing topic evolution in bioinformatics: investigation of dynamics of the field with conference data in DBLP," Scientometrics, Springer;Akadémiai Kiadó, vol. 101(1), pages 397-428, October.
    12. Jean-Charles Lamirel, 2012. "A new approach for automatizing the analysis of research topics dynamics: application to optoelectronics research," Scientometrics, Springer;Akadémiai Kiadó, vol. 93(1), pages 151-166, October.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Yuetong Chen & Hao Wang & Baolong Zhang & Wei Zhang, 2022. "A method of measuring the article discriminative capacity and its distribution," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(6), pages 3317-3341, June.
    2. Xiaoguang Wang & Hongyu Wang & Han Huang, 2021. "Evolutionary exploration and comparative analysis of the research topic networks in information disciplines," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(6), pages 4991-5017, June.
    3. Yating Li & Ye Chen & Qiyu Wang, 2021. "Evolution and diffusion of information literacy topics," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(5), pages 4195-4224, May.
    4. Byungun Yoon & Songhee Kim & Sunhye Kim & Hyeonju Seol, 2022. "Doc2vec-based link prediction approach using SAO structures: application to patent network," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(9), pages 5385-5414, September.
    5. Wang, Xiaoguang & He, Jing & Huang, Han & Wang, Hongyu, 2022. "MatrixSim: A new method for detecting the evolution paths of research topics," Journal of Informetrics, Elsevier, vol. 16(4).
    6. Wang, Changlin, 2024. "Social media platform-oriented topic mining and information security analysis by big data and deep convolutional neural network," Technological Forecasting and Social Change, Elsevier, vol. 199(C).
    7. Jingda Ding & Yifan Chen & Chao Liu, 2023. "Exploring the research features of Nobel laureates in Physics based on the semantic similarity measurement," Scientometrics, Springer;Akadémiai Kiadó, vol. 128(9), pages 5247-5275, September.
    8. Qiang Gao & Man Jiang, 2024. "Exploring technology fusion by combining latent Dirichlet allocation with Doc2vec: a case of digital medicine and machine learning," Scientometrics, Springer;Akadémiai Kiadó, vol. 129(7), pages 4043-4070, July.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Zhichao Ba & Yujie Cao & Jin Mao & Gang Li, 2019. "A hierarchical approach to analyzing knowledge integration between two fields—a case study on medical informatics and computer science," Scientometrics, Springer;Akadémiai Kiadó, vol. 119(3), pages 1455-1486, June.
    2. Beibei Hu & Xianlei Dong & Chenwei Zhang & Timothy D. Bowman & Ying Ding & Staša Milojević & Chaoqun Ni & Erjia Yan & Vincent Larivière, 2015. "A lead-lag analysis of the topic evolution patterns for preprints and publications," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 66(12), pages 2643-2656, December.
    3. Lu Huang & Xiang Chen & Yi Zhang & Changtian Wang & Xiaoli Cao & Jiarun Liu, 2022. "Identification of topic evolution: network analytics with piecewise linear representation and word embedding," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(9), pages 5353-5383, September.
    4. Marie Katsurai & Shunsuke Ono, 2019. "TrendNets: mapping emerging research trends from dynamic co-word networks via sparse representation," Scientometrics, Springer;Akadémiai Kiadó, vol. 121(3), pages 1583-1598, December.
    5. Zongshui Wang & Hong Zhao & Yan Wang, 2015. "Social networks in marketing research 2001–2014: a co-word analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 105(1), pages 65-82, October.
    6. Anke Piepenbrink & Elkin Nurmammadov, 2015. "Topics in the literature of transition economies and emerging markets," Scientometrics, Springer;Akadémiai Kiadó, vol. 102(3), pages 2107-2130, March.
    7. Sjögårde, Peter & Ahlgren, Per, 2018. "Granularity of algorithmically constructed publication-level classifications of research publications: Identification of topics," Journal of Informetrics, Elsevier, vol. 12(1), pages 133-152.
    8. Curci, Ylenia & Mongeau Ospina, Christian A., 2016. "Investigating biofuels through network analysis," Energy Policy, Elsevier, vol. 97(C), pages 60-72.
    9. Minchul Lee & Min Song, 2020. "Incorporating citation impact into analysis of research trends," Scientometrics, Springer;Akadémiai Kiadó, vol. 124(2), pages 1191-1224, August.
    10. Parvin Ahmadi & Iman Gholampour & Mahmoud Tabandeh, 2018. "Cluster-based sparse topical coding for topic mining and document clustering," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 12(3), pages 537-558, September.
    11. Yalcin, Haydar & Daim, Tugrul & Moughari, Mahdieh Mokhtari & Mermoud, Alain, 2024. "Supercomputers and quantum computing on the axis of cyber security," Technology in Society, Elsevier, vol. 77(C).
    12. Marie-Violaine Tatry & Dominique Fournier & Benoît Jeannequin & Françoise Dosba, 2014. "EU27 and USA leadership in fruit and vegetable research: a bibliometric study from 2000 to 2009," Scientometrics, Springer;Akadémiai Kiadó, vol. 98(3), pages 2207-2222, March.
    13. Elizabeth C. Teixeira & Victor E. L. Silva & Nidia N. Fabré & Vandick S. Batista, 2020. "Marine shrimp fisheries research—a mismatch on spatial and thematic needs," Scientometrics, Springer;Akadémiai Kiadó, vol. 122(1), pages 591-606, January.
    14. Hao Wang & Sanhong Deng & Xinning Su, 2016. "A study on construction and analysis of discipline knowledge structure of Chinese LIS based on CSSCI," Scientometrics, Springer;Akadémiai Kiadó, vol. 109(3), pages 1725-1759, December.
    15. Jiancheng Guan & Lanxin Pang, 2018. "Bidirectional relationship between network position and knowledge creation in Scientometrics," Scientometrics, Springer;Akadémiai Kiadó, vol. 115(1), pages 201-222, April.
    16. Aliakbar Pourhatami & Mohammad Kaviyani-Charati & Bahareh Kargar & Hamed Baziyad & Maryam Kargar & Carlos Olmeda-Gómez, 2021. "Mapping the intellectual structure of the coronavirus field (2000–2020): a co-word analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(8), pages 6625-6657, August.
    17. Kim, Hyeyoung & Park, Hyelin & Song, Min, 2022. "Developing a topic-driven method for interdisciplinarity analysis," Journal of Informetrics, Elsevier, vol. 16(2).
    18. Khadijah Nabilah Mohd Zahri & Azham Zulkharnain & Suriana Sabri & Claudio Gomez-Fuentes & Siti Aqlima Ahmad, 2021. "Research Trends of Biodegradation of Cooking Oil in Antarctica from 2001 to 2021: A Bibliometric Analysis Based on the Scopus Database," IJERPH, MDPI, vol. 18(4), pages 1-15, February.
    19. Erjia Yan & Ying Ding & Elin K. Jacob, 2012. "Overlaying communities and topics: an analysis on publication networks," Scientometrics, Springer;Akadémiai Kiadó, vol. 90(2), pages 499-513, February.
    20. He, Bing & Ding, Ying & Tang, Jie & Reguramalingam, Vignesh & Bollen, Johan, 2013. "Mining diversity subgraph in multidisciplinary scientific collaboration networks: A meso perspective," Journal of Informetrics, Elsevier, vol. 7(1), pages 117-128.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:infome:v:14:y:2020:i:3:s1751157719305127. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/joi .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.