IDEAS home Printed from https://ideas.repec.org/a/gam/jijerp/v17y2020i24p9467-d463852.html
   My bibliography  Save this article

Automated Classification of Online Sources for Infectious Disease Occurrences Using Machine-Learning-Based Natural Language Processing Approaches

Author

Listed:
  • Mira Kim

    (Department of Preventive Medicine, College of Medicine, The Catholic University of Korea, Seoul 06591, Korea)

  • Kyunghee Chae

    (Department of Preventive Medicine, College of Medicine, The Catholic University of Korea, Seoul 06591, Korea)

  • Seungwoo Lee

    (Department of Data and HPC Science, University of Science and Technology, Daejeon 34113, Korea
    Research Data Sharing Center, Korea Institute of Science and Technology Information, Daejeon 34141, Korea)

  • Hong-Jun Jang

    (Research Data Sharing Center, Korea Institute of Science and Technology Information, Daejeon 34141, Korea)

  • Sukil Kim

    (Department of Preventive Medicine, College of Medicine, The Catholic University of Korea, Seoul 06591, Korea)

Abstract

Collecting valid information from electronic sources to detect the potential outbreak of infectious disease is time-consuming and labor-intensive. The automated identification of relevant information using machine learning is necessary to respond to a potential disease outbreak. A total of 2864 documents were collected from various websites and subsequently manually categorized and labeled by two reviewers. Accurate labels for the training and test data were provided based on a reviewer consensus. Two machine learning algorithms—ConvNet and bidirectional long short-term memory (BiLSTM)—and two classification methods—DocClass and SenClass—were used for classifying the documents. The precision, recall, F1, accuracy, and area under the curve were measured to evaluate the performance of each model. ConvNet yielded higher average, min, and max accuracies (87.6%, 85.2%, and 91.1%, respectively) than BiLSTM with DocClass, while BiLSTM performed better than ConvNet with SenClass with average, min, and max accuracies of 92.8%, 92.6%, and 93.3%, respectively. The performance of BiLSTM with SenClass yielded an overall accuracy of 92.9% in classifying infectious disease occurrences. Machine learning had a compatible performance with a human expert given a particular text extraction system. This study suggests that analyzing information from the website using machine learning can achieve significant accuracies in the presence of abundant articles/documents.

Suggested Citation

  • Mira Kim & Kyunghee Chae & Seungwoo Lee & Hong-Jun Jang & Sukil Kim, 2020. "Automated Classification of Online Sources for Infectious Disease Occurrences Using Machine-Learning-Based Natural Language Processing Approaches," IJERPH, MDPI, vol. 17(24), pages 1-13, December.
  • Handle: RePEc:gam:jijerp:v:17:y:2020:i:24:p:9467-:d:463852
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/1660-4601/17/24/9467/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/1660-4601/17/24/9467/
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Ahmed M Alaa & Thomas Bolton & Emanuele Di Angelantonio & James H F Rudd & Mihaela van der Schaar, 2019. "Cardiovascular disease risk prediction using automated machine learning: A prospective study of 423,604 UK Biobank participants," PLOS ONE, Public Library of Science, vol. 14(5), pages 1-17, May.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Senqi Yang & Xuliang Duan & Zeyan Xiao & Zhiyao Li & Yuhai Liu & Zhihao Jie & Dezhao Tang & Hui Du, 2022. "Sentiment Classification of Chinese Tourism Reviews Based on ERNIE-Gram+GCN," IJERPH, MDPI, vol. 19(20), pages 1-20, October.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Menteş, Nurettin & Çakmak, Mehmet Aziz & Kurt, Mehmet Emin, 2023. "Estimation of service length with the machine learning algorithms and neural networks for patients who receiving home health care," Evaluation and Program Planning, Elsevier, vol. 100(C).
    2. Victor Olsavszky & Mihnea Dosius & Cristian Vladescu & Johannes Benecke, 2020. "Time Series Analysis and Forecasting with Automated Machine Learning on a National ICD-10 Database," IJERPH, MDPI, vol. 17(14), pages 1-17, July.
    3. Shelda Sajeev & Stephanie Champion & Alline Beleigoli & Derek Chew & Richard L. Reed & Dianna J. Magliano & Jonathan E. Shaw & Roger L. Milne & Sarah Appleton & Tiffany K. Gill & Anthony Maeder, 2021. "Predicting Australian Adults at High Risk of Cardiovascular Disease Mortality Using Standard Risk Factors and Machine Learning," IJERPH, MDPI, vol. 18(6), pages 1-14, March.
    4. Ervasti, Jenni & Pentti, Jaana & Seppälä, Piia & Ropponen, Annina & Virtanen, Marianna & Elovainio, Marko & Chandola, Tarani & Kivimäki, Mika & Airaksinen, Jaakko, 2023. "Prediction of bullying at work: A data-driven analysis of the Finnish public sector cohort study," Social Science & Medicine, Elsevier, vol. 317(C).

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jijerp:v:17:y:2020:i:24:p:9467-:d:463852. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.