IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/1003892.html
   My bibliography  Save this article

Global Disease Monitoring and Forecasting with Wikipedia

Author

Listed:
  • Nicholas Generous
  • Geoffrey Fairchild
  • Alina Deshpande
  • Sara Y Del Valle
  • Reid Priedhorsky

Abstract

Infectious disease is a leading threat to public health, economic stability, and other key social structures. Efforts to mitigate these impacts depend on accurate and timely monitoring to measure the risk and progress of disease. Traditional, biologically-focused monitoring techniques are accurate but costly and slow; in response, new techniques based on social internet data, such as social media and search queries, are emerging. These efforts are promising, but important challenges in the areas of scientific peer review, breadth of diseases and countries, and forecasting hamper their operational usefulness. We examine a freely available, open data source for this use: access logs from the online encyclopedia Wikipedia. Using linear models, language as a proxy for location, and a systematic yet simple article selection procedure, we tested 14 location-disease combinations and demonstrate that these data feasibly support an approach that overcomes these challenges. Specifically, our proof-of-concept yields models with up to 0.92, forecasting value up to the 28 days tested, and several pairs of models similar enough to suggest that transferring models from one location to another without re-training is feasible. Based on these preliminary results, we close with a research agenda designed to overcome these challenges and produce a disease monitoring and forecasting system that is significantly more effective, robust, and globally comprehensive than the current state of the art.Author Summary: Even in developed countries, infectious disease has significant impact; for example, flu seasons in the United States take between 3,000 and 49,000 lives. Disease surveillance, traditionally based on patient visits to health providers and laboratory tests, can reduce these impacts. Motivated by cost and timeliness, surveillance methods based on internet data have recently emerged, but are not yet reliable for several reasons, including weak scientific peer review, breadth of diseases and countries covered, and underdeveloped forecasting capabilities. We argue that these challenges can be overcome by using a freely available data source: aggregated access logs from the online encyclopedia Wikipedia. Using simple statistical techniques, our proof-of-concept experiments suggest that these data are effective for predicting the present, as well as forecasting up to the 28-day limit of our tests. Our results also suggest that these models can be used even in places with no official data upon which to build models. In short, this paper establishes the utility of Wikipedia as a broadly effective data source for disease information, and we outline a path to a reliable, scientifically sound, operational, and global disease surveillance system that overcomes key gaps in existing traditional and internet-based techniques.

Suggested Citation

  • Nicholas Generous & Geoffrey Fairchild & Alina Deshpande & Sara Y Del Valle & Reid Priedhorsky, 2014. "Global Disease Monitoring and Forecasting with Wikipedia," PLOS Computational Biology, Public Library of Science, vol. 10(11), pages 1-16, November.
  • Handle: RePEc:plo:pcbi00:1003892
    DOI: 10.1371/journal.pcbi.1003892
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003892
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1003892&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.1003892?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Jim Giles, 2005. "Internet encyclopaedias go head to head," Nature, Nature, vol. 438(7070), pages 900-901, December.
    2. Alan D. Lopez & Colin D. Mathers & Majid Ezzati & Dean T. Jamison & Christopher J. L. Murray, 2006. "Global Burden of Disease and Risk Factors," World Bank Publications - Books, The World Bank Group, number 7039.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Kuchler, Theresa & Russel, Dominic & Stroebel, Johannes, 2022. "JUE Insight: The geographic spread of COVID-19 correlates with the structure of social networks as measured by Facebook," Journal of Urban Economics, Elsevier, vol. 127(C).
    2. Svitlana Volkova & Ellyn Ayton & Katherine Porterfield & Courtney D Corley, 2017. "Forecasting influenza-like illness dynamics for military populations using neural networks and social media," PLOS ONE, Public Library of Science, vol. 12(12), pages 1-22, December.
    3. Meyer, Christian & Hamer, Martin & Terlau, Wiltrud & Raithel, Johannes & Pongratz, Patrick, 2015. "Web Data Mining and Social Media Analysis for better Communication in Food Safety Crises," International Journal on Food System Dynamics, International Center for Management, Communication, and Research, vol. 6(3), pages 1-10, July.
    4. Julissa Alexandra Galarza-Villamar & Mariette McCampbell & Andres Galarza-Villamar & Cees Leeuwis & Francesco Cecchi & John Galarza-Rodrigo, 2021. "A Public Bad Game Method to Study Dynamics in Socio-Ecological Systems (Part II): Results of Testing Musa-Game in Rwanda and Adding Emergence and Spatiality to the Analysis," Sustainability, MDPI, vol. 13(16), pages 1-27, August.
    5. Samuel V Scarpino & James G Scott & Rosalind M Eggo & Bruce Clements & Nedialko B Dimitrov & Lauren Ancel Meyers, 2020. "Socioeconomic bias in influenza surveillance," PLOS Computational Biology, Public Library of Science, vol. 16(7), pages 1-19, July.
    6. Logan C Brooks & David C Farrow & Sangwon Hyun & Ryan J Tibshirani & Roni Rosenfeld, 2018. "Nonmechanistic forecasts of seasonal influenza with iterative one-week-ahead distributions," PLOS Computational Biology, Public Library of Science, vol. 14(6), pages 1-29, June.
    7. Zeynep Ertem & Dorrie Raymond & Lauren Ancel Meyers, 2018. "Optimal multi-source forecasting of seasonal influenza," PLOS Computational Biology, Public Library of Science, vol. 14(9), pages 1-16, September.
    8. Ibrahim Musa & Hyun Woo Park & Lkhagvadorj Munkhdalai & Keun Ho Ryu, 2018. "Global Research on Syndromic Surveillance from 1993 to 2017: Bibliometric Analysis and Visualization," Sustainability, MDPI, vol. 10(10), pages 1-20, September.
    9. Dave Osthus & Ashlynn R Daughton & Reid Priedhorsky, 2019. "Even a good influenza forecasting model can benefit from internet-based nowcasts, but those benefits are limited," PLOS Computational Biology, Public Library of Science, vol. 15(2), pages 1-19, February.
    10. Kyle S Hickmann & Geoffrey Fairchild & Reid Priedhorsky & Nicholas Generous & James M Hyman & Alina Deshpande & Sara Y Del Valle, 2015. "Forecasting the 2013–2014 Influenza Season Using Wikipedia," PLOS Computational Biology, Public Library of Science, vol. 11(5), pages 1-29, May.
    11. Meyer, Christian & Hamer, Martin & Terlau, Wiltrud & Raithel, Johannes & Pongratz, Patrick, 2015. "Web Data Mining and Social Media Analysis for better Communication in Food Safety Crises," 2015 International European Forum (144th EAAE Seminar), February 9-13, 2015, Innsbruck-Igls, Austria 206212, International European Forum on System Dynamics and Innovation in Food Networks.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Tsung-Ming Tsao & Jing-Shiang Hwang & Sung-Tsun Lin & Charlene Wu & Ming-Jer Tsai & Ta-Chen Su, 2022. "Forest Bathing Is Better than Walking in Urban Park: Comparison of Cardiac and Vascular Function between Urban and Forest Parks," IJERPH, MDPI, vol. 19(6), pages 1-15, March.
    2. Anderson, Soren T. & Laxminarayan, Ramanan & Salant, Stephen W., 2012. "Diversify or focus? Spending to combat infectious diseases when budgets are tight," Journal of Health Economics, Elsevier, vol. 31(4), pages 658-675.
    3. Michael Grimm & Carole Treibich, 2013. "Why Do Some Bikers Wear a Helmet and Others Don't? Evidence from Delhi, India," AMSE Working Papers 1348, Aix-Marseille School of Economics, France, revised 10 Oct 2013.
    4. Christopher Fitzpatrick & Katherine Floyd, 2012. "A Systematic Review of the Cost and Cost Effectiveness of Treatment for Multidrug-Resistant Tuberculosis," PharmacoEconomics, Springer, vol. 30(1), pages 63-80, January.
    5. Wei Luo & Julia Adams & Hannah Brueckner, 2018. "The Ladies Vanish? American Sociology and the Genealogy of its Missing Women on Wikipedia," Working Papers 20180012, New York University Abu Dhabi, Department of Social Science, revised Jan 2018.
    6. repec:hrv:hksfac:5341873 is not listed on IDEAS
    7. Aaltonen, Aleksi Ville & Seiler, Stephan, 2014. "Quantifying spillovers in open source content production: evidence from Wikipedia," LSE Research Online Documents on Economics 60284, London School of Economics and Political Science, LSE Library.
    8. Falk, Armin & Menrath, Ingo & Verde, Pablo Emilio & Siegrist, Johannes, 2011. "Cardiovascular Consequences of Unfair Pay," IZA Discussion Papers 5720, Institute of Labor Economics (IZA).
    9. John Gibson & Steven Stillman & David McKenzie & Halahingano Rohorua, 2013. "Natural Experiment Evidence On The Effect Of Migration On Blood Pressure And Hypertension," Health Economics, John Wiley & Sons, Ltd., vol. 22(6), pages 655-672, June.
    10. Eva Deuchert, 2011. "The Virgin HIV Puzzle: Can Misreporting Account for the High Proportion of HIV Cases in Self-reported Virgins?," Journal of African Economies, Centre for the Study of African Economies, vol. 20(1), pages 60-89, January.
    11. Charles Ayoubi & Boris Thurm, 2023. "Knowledge diffusion and morality: Why do we freely share valuable information with Strangers?," Journal of Economics & Management Strategy, Wiley Blackwell, vol. 32(1), pages 75-99, January.
    12. Peter J. Rothe & Linda J. Carroll, 2009. "Hazards Faced by Young Designated Drivers: In-Car Risks of Driving Drunken Passengers," IJERPH, MDPI, vol. 6(6), pages 1-18, June.
    13. Fernando Abad-Franch & Gonçalo Ferraz & Ciro Campos & Francisco S Palomeque & Mario J Grijalva & H Marcelo Aguilar & Michael A Miles, 2010. "Modeling Disease Vector Occurrence when Detection Is Imperfect: Infestation of Amazonian Palm Trees by Triatomine Bugs at Three Spatial Scales," PLOS Neglected Tropical Diseases, Public Library of Science, vol. 4(3), pages 1-11, March.
    14. Demidov, Denis & Frahm, Klaus M. & Shepelyansky, Dima L., 2020. "What is the central bank of Wikipedia?," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 542(C).
    15. Elizabeth Kristjansson & Damian K Francis & Selma Liberato & Marik Benkhalti Jandu & Vivian Welch & Malek Batal & Trish Greenhalgh & Tamara Rader & Eamonn Noonan & Beverley Shea & Laura Janzen & Georg, 2013. "PROTOCOL: Feeding Interventions for Improving the Physical and Psychosocial Health of Disadvantaged Children Aged Three Months to Five Years: Protocol for a Systematic Review," Campbell Systematic Reviews, John Wiley & Sons, vol. 9(1), pages 1-41.
    16. Suddaby, Roy & Ganzin, Max & Minkus, Alison, 2017. "Craft, magic and the re-enchantment of the world," European Management Journal, Elsevier, vol. 35(3), pages 285-296.
    17. Stadnik SM & Saiko OV, 2020. "Neuron-Specific Enolaza as a Marker of Lesion Cerebral Tissue in Patients with Ischemic Stroke," Biomedical Journal of Scientific & Technical Research, Biomedical Research Network+, LLC, vol. 31(1), pages 23816-23820, October.
    18. Hervé, Fabrice & Zouaoui, Mohamed & Belvaux, Bertrand, 2019. "Noise traders and smart money: Evidence from online searches," Economic Modelling, Elsevier, vol. 83(C), pages 141-149.
    19. Pinna Pintor, Matteo & Fumagalli, Elena & Suhrcke, Marc, 2024. "The impact of health on labour market outcomes: A rapid systematic review," Health Policy, Elsevier, vol. 143(C).
    20. Feyza G. Sahinyazan & Marie‐Ève Rancourt & Vedat Verter, 2021. "Food Aid Modality Selection Problem," Production and Operations Management, Production and Operations Management Society, vol. 30(4), pages 965-983, April.
    21. La Torre, Davide & Liuzzi, Danilo & Marsiglio, Simone, 2021. "Epidemics and macroeconomic outcomes: Social distancing intensity and duration," Journal of Mathematical Economics, Elsevier, vol. 93(C).

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1003892. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.