IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/1008277.html
   My bibliography  Save this article

EventEpi—A natural language processing framework for event-based surveillance

Author

Listed:
  • Auss Abbood
  • Alexander Ullrich
  • Rüdiger Busche
  • Stéphane Ghozzi

Abstract

According to the World Health Organization (WHO), around 60% of all outbreaks are detected using informal sources. In many public health institutes, including the WHO and the Robert Koch Institute (RKI), dedicated groups of public health agents sift through numerous articles and newsletters to detect relevant events. This media screening is one important part of event-based surveillance (EBS). Reading the articles, discussing their relevance, and putting key information into a database is a time-consuming process. To support EBS, but also to gain insights into what makes an article and the event it describes relevant, we developed a natural language processing framework for automated information extraction and relevance scoring. First, we scraped relevant sources for EBS as done at the RKI (WHO Disease Outbreak News and ProMED) and automatically extracted the articles’ key data: disease, country, date, and confirmed-case count. For this, we performed named entity recognition in two steps: EpiTator, an open-source epidemiological annotation tool, suggested many different possibilities for each. We extracted the key country and disease using a heuristic with good results. We trained a naive Bayes classifier to find the key date and confirmed-case count, using the RKI’s EBS database as labels which performed modestly. Then, for relevance scoring, we defined two classes to which any article might belong: The article is relevant if it is in the EBS database and irrelevant otherwise. We compared the performance of different classifiers, using bag-of-words, document and word embeddings. The best classifier, a logistic regression, achieved a sensitivity of 0.82 and an index balanced accuracy of 0.61. Finally, we integrated these functionalities into a web application called EventEpi where relevant sources are automatically analyzed and put into a database. The user can also provide any URL or text, that will be analyzed in the same way and added to the database. Each of these steps could be improved, in particular with larger labeled datasets and fine-tuning of the learning algorithms. The overall framework, however, works already well and can be used in production, promising improvements in EBS. The source code and data are publicly available under open licenses.Author summary: Public health surveillance that uses official sources to detect important disease outbreaks suffers from a time delay. Using unofficial sources, like websites, to detect rumors of disease outbreaks can offer a decisive temporal advantage. Due to the vast amount of information on the web, public health agents are only capable to process a fraction of the available information. Recent advances in natural language processing and deep learning offer new opportunities to process large amounts of text with human-like understanding. However, to the best of our knowledge, no open-source solutions using natural language processing for public health surveillance exist. We extracted expert labels from a public health unit that screens online resources every day to train various machine learning models and perform key information extraction as well as relevance scoring on epidemiological texts. With the help of those expert labels, we scraped and annotated news articles to create inputs for the machine learning models. The scraped texts were transformed into word embeddings that were trained on 61,320 epidemiological articles and the Wikipedia corpus (May 2020). We were able to extract key information from epidemiological texts such as disease, outbreak country, cases counts, and the date of these counts. While disease and country could be extracted with high accuracy, date and count could still be extracted with medium accuracy with the help of machine learning models. Furthermore, our model could detect 82% of all relevant articles in an unseen test dataset. Both of these functionalities were embedded into a web application. We present an open-source framework that public health agents can use to include online sources into their screening routine. This can be of great help to existing and emerging public health institutions. Although parts of the information extraction function robustly and the relevance scoring could already save public health agent’s time, methods to explain deep and machine learning models showed that the learned patterns are sometimes implausible. This could be improved with more labeled data and optimization of the learning algorithms.

Suggested Citation

  • Auss Abbood & Alexander Ullrich & Rüdiger Busche & Stéphane Ghozzi, 2020. "EventEpi—A natural language processing framework for event-based surveillance," PLOS Computational Biology, Public Library of Science, vol. 16(11), pages 1-16, November.
  • Handle: RePEc:plo:pcbi00:1008277
    DOI: 10.1371/journal.pcbi.1008277
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1008277
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1008277&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.1008277?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1008277. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.