IDEAS home Printed from https://ideas.repec.org/a/gam/jsusta/v11y2019i1p196-d194504.html
   My bibliography  Save this article

SocialTERM-Extractor: Identifying and Predicting Social-Problem-Specific Key Noun Terms from a Large Number of Online News Articles Using Text Mining and Machine Learning Techniques

Author

Listed:
  • Jong Hwan Suh

    (Department of Management Information Systems, BERI, Gyeongsang National University, 501 Jinjudae-ro Jinju-si, Gyeongsangnam-do 52828, Korea)

Abstract

In the digital age, the abundant unstructured data on the Internet, particularly online news articles, provide opportunities for identifying social problems and understanding social systems for sustainability. However, the previous works have not paid attention to the social-problem-specific perspectives of such big data, and it is currently unclear how information technologies can use the big data to identify and manage the ongoing social problems. In this context, this paper introduces and focuses on social-problem-specific key noun terms, namely SocialTERMs, which can be used not only to search the Internet for social-problem-related data, but also to monitor the ongoing and future events of social problems. Moreover, to alleviate time-consuming human efforts in identifying the SocialTERMs, this paper designs and examines the SocialTERM-Extractor, which is an automatic approach for identifying the key noun terms of social-problem-related topics, namely SPRTs, in a large number of online news articles and predicting the SocialTERMs among the identified key noun terms. This paper has its novelty as the first trial to identify and predict the SocialTERMs from a large number of online news articles, and it contributes to literature by proposing three types of text-mining-based features, namely temporal weight, sentiment, and complex network structural features, and by comparing the performances of such features with various machine learning techniques including deep learning. Particularly, when applied to a large number of online news articles that had been published in South Korea over a 12-month period and mostly written in Korean, the experimental results showed that Boosting Decision Tree gave the best performances with the full feature sets. They showed that the SocialTERMs can be predicted with high performances by the proposed SocialTERM-Extractor. Eventually, this paper can be beneficial for individuals or organizations who want to explore and use social-problem-related data in a systematical manner for understanding and managing social problems even though they are unfamiliar with ongoing social problems.

Suggested Citation

  • Jong Hwan Suh, 2019. "SocialTERM-Extractor: Identifying and Predicting Social-Problem-Specific Key Noun Terms from a Large Number of Online News Articles Using Text Mining and Machine Learning Techniques," Sustainability, MDPI, vol. 11(1), pages 1-44, January.
  • Handle: RePEc:gam:jsusta:v:11:y:2019:i:1:p:196-:d:194504
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2071-1050/11/1/196/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2071-1050/11/1/196/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Yongho Lee & So Young Kim & Inseok Song & Yongtae Park & Juneseuk Shin, 2014. "Technology opportunity identification customized to the technological capability of SMEs through two-stage patent analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 100(1), pages 227-244, July.
    2. Woojae Myung & Geung-Hee Lee & Hong-Hee Won & Maurizio Fava & David Mischoulon & Maren Nyer & Doh Kwan Kim & Jung-Yoon Heo & Hong Jin Jeon, 2015. "Paraquat Prohibition and Change in the Suicide Rate and Methods in South Korea," PLOS ONE, Public Library of Science, vol. 10(6), pages 1-10, June.
    3. Angel Conde & Mikel Larrañaga & Ana Arruarte & Jon A. Elorriaga & Dan Roth, 2016. "litewi: A combined term extraction and entity linking method for eliciting educational ontologies from textbooks," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 67(2), pages 380-399, February.
    4. Nieminen, Paavo & Pölönen, Ilkka & Sipola, Tuomo, 2013. "Research literature clustering using diffusion maps," Journal of Informetrics, Elsevier, vol. 7(4), pages 874-886.
    5. Xianshu Zhu & Tim Oates, 2014. "Finding story chains in newswire articles using random walks," Information Systems Frontiers, Springer, vol. 16(5), pages 753-769, November.
    6. Yungchang Ku & Chaochang Chiu & Yulei Zhang & Hsinchun Chen & Handsome Su, 2014. "Text mining self-disclosing health information for public health service," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 65(5), pages 928-947, May.
    7. Yang, Siluo & Han, Ruizhen & Wolfram, Dietmar & Zhao, Yuehua, 2016. "Visualizing the intellectual structure of information science (2006–2015): Introducing author keyword coupling analysis," Journal of Informetrics, Elsevier, vol. 10(1), pages 132-150.
    8. Benjamin Van Roy & Xiang Yan, 2010. "Manipulation Robustness of Collaborative Filtering," Management Science, INFORMS, vol. 56(11), pages 1911-1929, November.
    9. Yan Dang & Yulei Zhang & Hsinchun Chen & Paul Jen‐Hwa Hu & Susan A. Brown & Cathy Larson, 2009. "Arizona Literature Mapper: An integrated approach to monitor and analyze global bioterrorism research literature," Journal of the American Society for Information Science and Technology, Association for Information Science & Technology, vol. 60(7), pages 1466-1485, July.
    10. Suh, Jong Hwan, 2015. "Forecasting the daily outbreak of topic-level political risk from social media using hidden Markov model-based techniques," Technological Forecasting and Social Change, Elsevier, vol. 94(C), pages 115-132.
    11. Zhang, Yi & Porter, Alan L. & Hu, Zhengyin & Guo, Ying & Newman, Nils C., 2014. "“Term clumping” for technical intelligence: A case study on dye-sensitized solar cells," Technological Forecasting and Social Change, Elsevier, vol. 85(C), pages 26-39.
    12. Erjia Yan & Ying Ding, 2009. "Applying centrality measures to impact analysis: A coauthorship network analysis," Journal of the American Society for Information Science and Technology, Association for Information Science & Technology, vol. 60(10), pages 2107-2118, October.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Jong Hwan Suh, 2022. "Machine-Learning-Based Gender Distribution Prediction from Anonymous News Comments: The Case of Korean News Portal," Sustainability, MDPI, vol. 14(16), pages 1-17, August.
    2. Boram Choi & Jong Hwan Suh, 2020. "Forecasting Spare Parts Demand of Military Aircraft: Comparisons of Data Mining Techniques and Managerial Features from the Case of South Korea," Sustainability, MDPI, vol. 12(15), pages 1-20, July.
    3. Samuel Zanferdini Oliva & Livia Oliveira-Ciabati & Denise Gazotto Dezembro & Mário Sérgio Adolfi Júnior & Maísa Carvalho Silva & Hugo Cesar Pessotti & Juliana Tarossi Pollettini, 2021. "Text structuring methods based on complex network: a systematic review," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(2), pages 1471-1493, February.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Xiao Zhou & Lu Huang & Yi Zhang & Miaomiao Yu, 2019. "A hybrid approach to detecting technological recombination based on text mining and patent network analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 121(2), pages 699-737, November.
    2. Zhang, Yi & Wu, Mengjia & Miao, Wen & Huang, Lu & Lu, Jie, 2021. "Bi-layer network analytics: A methodology for characterizing emerging general-purpose technologies," Journal of Informetrics, Elsevier, vol. 15(4).
    3. Zhang, Yi & Lu, Jie & Liu, Feng & Liu, Qian & Porter, Alan & Chen, Hongshu & Zhang, Guangquan, 2018. "Does deep learning help topic extraction? A kernel k-means clustering method with word embedding," Journal of Informetrics, Elsevier, vol. 12(4), pages 1099-1117.
    4. Song, Kisik & Kim, Karp Soo & Lee, Sungjoo, 2017. "Discovering new technology opportunities based on patents: Text-mining and F-term analysis," Technovation, Elsevier, vol. 60, pages 1-14.
    5. Raf Guns & Yu Xian Liu & Dilruba Mahbuba, 2011. "Q-measures and betweenness centrality in a collaboration network: a case study of the field of informetrics," Scientometrics, Springer;Akadémiai Kiadó, vol. 87(1), pages 133-147, April.
    6. Zhang, Yi & Huang, Ying & Porter, Alan L. & Zhang, Guangquan & Lu, Jie, 2019. "Discovering and forecasting interactions in big data research: A learning-enhanced bibliometric study," Technological Forecasting and Social Change, Elsevier, vol. 146(C), pages 795-807.
    7. Vinayak, & Raghuvanshi, Adarsh & kshitij, Avinash, 2023. "Signatures of capacity development through research collaborations in artificial intelligence and machine learning," Journal of Informetrics, Elsevier, vol. 17(1).
    8. Dejing Kong & Jianzhong Yang & Lingfeng Li, 2020. "Early identification of technological convergence in numerical control machine tool: a deep learning approach," Scientometrics, Springer;Akadémiai Kiadó, vol. 125(3), pages 1983-2009, December.
    9. Kai Hu & Huayi Wu & Kunlun Qi & Jingmin Yu & Siluo Yang & Tianxing Yu & Jie Zheng & Bo Liu, 2018. "A domain keyword analysis approach extending Term Frequency-Keyword Active Index with Google Word2Vec model," Scientometrics, Springer;Akadémiai Kiadó, vol. 114(3), pages 1031-1068, March.
    10. Zhichao Wang & Valentin Zelenyuk, 2021. "Performance Analysis of Hospitals in Australia and its Peers: A Systematic Review," CEPA Working Papers Series WP012021, School of Economics, University of Queensland, Australia.
    11. Kyuwoong Kim & Kyeongmin Park & Sungjoo Lee, 2019. "Investigating technology opportunities: the use of SAOx analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 118(1), pages 45-70, January.
    12. Yi Bu & Binglu Wang & Win-bin Huang & Shangkun Che & Yong Huang, 2018. "Using the appearance of citations in full text on author co-citation analysis," Scientometrics, Springer;Akadémiai Kiadó, vol. 116(1), pages 275-289, July.
    13. Yongjun Zhu & Erjia Yan, 2015. "Dynamic subfield analysis of disciplines: an examination of the trading impact and knowledge diffusion patterns of computer science," Scientometrics, Springer;Akadémiai Kiadó, vol. 104(1), pages 335-359, July.
    14. Wang, Xiaoguang & He, Jing & Huang, Han & Wang, Hongyu, 2022. "MatrixSim: A new method for detecting the evolution paths of research topics," Journal of Informetrics, Elsevier, vol. 16(4).
    15. Zhichao Wang & Bao Hoang Nguyen & Valentin Zelenyuk, 2024. "Performance analysis of hospitals in Australia and its peers: a systematic and critical review," Journal of Productivity Analysis, Springer, vol. 62(2), pages 139-173, October.
    16. Alison M. J. Buchan & Eva Jurczyk & Ruth Isserlin & Gary D. Bader, 2016. "Global neuroscience and mental health research: a bibliometrics case study," Scientometrics, Springer;Akadémiai Kiadó, vol. 109(1), pages 515-531, October.
    17. Xuefeng Wang & Shuo Zhang & Yuqin liu, 2022. "ITGInsight–discovering and visualizing research fronts in the scientific literature," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(11), pages 6509-6531, November.
    18. Arnauld Bessagnet & Joan Crespo & Jerome Vicente, 2023. "How is the literature on Digital Entrepreneurial Ecosystems structured? A socio-semantic network approach," Papers in Evolutionary Economic Geography (PEEG) 2320, Utrecht University, Department of Human Geography and Spatial Planning, Group Economic Geography, revised Oct 2023.
    19. Lu Huang & Xiang Chen & Yi Zhang & Changtian Wang & Xiaoli Cao & Jiarun Liu, 2022. "Identification of topic evolution: network analytics with piecewise linear representation and word embedding," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(9), pages 5353-5383, September.
    20. Zhai, Li & Yan, Xiangbin, 2022. "A directed collaboration network for exploring the order of scientific collaboration," Journal of Informetrics, Elsevier, vol. 16(4).

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jsusta:v:11:y:2019:i:1:p:196-:d:194504. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.