IDEAS home Printed from https://ideas.repec.org/a/plo/pcbi00/1003892.html
   My bibliography  Save this article

Global Disease Monitoring and Forecasting with Wikipedia

Author

Listed:
  • Nicholas Generous
  • Geoffrey Fairchild
  • Alina Deshpande
  • Sara Y Del Valle
  • Reid Priedhorsky

Abstract

Infectious disease is a leading threat to public health, economic stability, and other key social structures. Efforts to mitigate these impacts depend on accurate and timely monitoring to measure the risk and progress of disease. Traditional, biologically-focused monitoring techniques are accurate but costly and slow; in response, new techniques based on social internet data, such as social media and search queries, are emerging. These efforts are promising, but important challenges in the areas of scientific peer review, breadth of diseases and countries, and forecasting hamper their operational usefulness. We examine a freely available, open data source for this use: access logs from the online encyclopedia Wikipedia. Using linear models, language as a proxy for location, and a systematic yet simple article selection procedure, we tested 14 location-disease combinations and demonstrate that these data feasibly support an approach that overcomes these challenges. Specifically, our proof-of-concept yields models with up to 0.92, forecasting value up to the 28 days tested, and several pairs of models similar enough to suggest that transferring models from one location to another without re-training is feasible. Based on these preliminary results, we close with a research agenda designed to overcome these challenges and produce a disease monitoring and forecasting system that is significantly more effective, robust, and globally comprehensive than the current state of the art.Author Summary: Even in developed countries, infectious disease has significant impact; for example, flu seasons in the United States take between 3,000 and 49,000 lives. Disease surveillance, traditionally based on patient visits to health providers and laboratory tests, can reduce these impacts. Motivated by cost and timeliness, surveillance methods based on internet data have recently emerged, but are not yet reliable for several reasons, including weak scientific peer review, breadth of diseases and countries covered, and underdeveloped forecasting capabilities. We argue that these challenges can be overcome by using a freely available data source: aggregated access logs from the online encyclopedia Wikipedia. Using simple statistical techniques, our proof-of-concept experiments suggest that these data are effective for predicting the present, as well as forecasting up to the 28-day limit of our tests. Our results also suggest that these models can be used even in places with no official data upon which to build models. In short, this paper establishes the utility of Wikipedia as a broadly effective data source for disease information, and we outline a path to a reliable, scientifically sound, operational, and global disease surveillance system that overcomes key gaps in existing traditional and internet-based techniques.

Suggested Citation

  • Nicholas Generous & Geoffrey Fairchild & Alina Deshpande & Sara Y Del Valle & Reid Priedhorsky, 2014. "Global Disease Monitoring and Forecasting with Wikipedia," PLOS Computational Biology, Public Library of Science, vol. 10(11), pages 1-16, November.
  • Handle: RePEc:plo:pcbi00:1003892
    DOI: 10.1371/journal.pcbi.1003892
    as

    Download full text from publisher

    File URL: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003892
    Download Restriction: no

    File URL: https://journals.plos.org/ploscompbiol/article/file?id=10.1371/journal.pcbi.1003892&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pcbi.1003892?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Jim Giles, 2005. "Internet encyclopaedias go head to head," Nature, Nature, vol. 438(7070), pages 900-901, December.
    2. Alan D. Lopez & Colin D. Mathers & Majid Ezzati & Dean T. Jamison & Christopher J. L. Murray, 2006. "Global Burden of Disease and Risk Factors," World Bank Publications - Books, The World Bank Group, number 7039.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Samuel V Scarpino & James G Scott & Rosalind M Eggo & Bruce Clements & Nedialko B Dimitrov & Lauren Ancel Meyers, 2020. "Socioeconomic bias in influenza surveillance," PLOS Computational Biology, Public Library of Science, vol. 16(7), pages 1-19, July.
    2. Logan C Brooks & David C Farrow & Sangwon Hyun & Ryan J Tibshirani & Roni Rosenfeld, 2018. "Nonmechanistic forecasts of seasonal influenza with iterative one-week-ahead distributions," PLOS Computational Biology, Public Library of Science, vol. 14(6), pages 1-29, June.
    3. Kuchler, Theresa & Russel, Dominic & Stroebel, Johannes, 2022. "JUE Insight: The geographic spread of COVID-19 correlates with the structure of social networks as measured by Facebook," Journal of Urban Economics, Elsevier, vol. 127(C).
    4. Zeynep Ertem & Dorrie Raymond & Lauren Ancel Meyers, 2018. "Optimal multi-source forecasting of seasonal influenza," PLOS Computational Biology, Public Library of Science, vol. 14(9), pages 1-16, September.
    5. Ibrahim Musa & Hyun Woo Park & Lkhagvadorj Munkhdalai & Keun Ho Ryu, 2018. "Global Research on Syndromic Surveillance from 1993 to 2017: Bibliometric Analysis and Visualization," Sustainability, MDPI, vol. 10(10), pages 1-20, September.
    6. Dave Osthus & Ashlynn R Daughton & Reid Priedhorsky, 2019. "Even a good influenza forecasting model can benefit from internet-based nowcasts, but those benefits are limited," PLOS Computational Biology, Public Library of Science, vol. 15(2), pages 1-19, February.
    7. Svitlana Volkova & Ellyn Ayton & Katherine Porterfield & Courtney D Corley, 2017. "Forecasting influenza-like illness dynamics for military populations using neural networks and social media," PLOS ONE, Public Library of Science, vol. 12(12), pages 1-22, December.
    8. Kyle S Hickmann & Geoffrey Fairchild & Reid Priedhorsky & Nicholas Generous & James M Hyman & Alina Deshpande & Sara Y Del Valle, 2015. "Forecasting the 2013–2014 Influenza Season Using Wikipedia," PLOS Computational Biology, Public Library of Science, vol. 11(5), pages 1-29, May.
    9. Wenceslao Arroyo‐Machado & Adrián A. Díaz‐Faes & Enrique Herrera‐Viedma & Rodrigo Costas, 2024. "From academic to media capital: To what extent does the scientific reputation of universities translate into Wikipedia attention?," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 75(4), pages 423-437, April.
    10. Meyer, Christian & Hamer, Martin & Terlau, Wiltrud & Raithel, Johannes & Pongratz, Patrick, 2015. "Web Data Mining and Social Media Analysis for better Communication in Food Safety Crises," International Journal on Food System Dynamics, International Center for Management, Communication, and Research, vol. 6(3), pages 1-10, July.
    11. Julissa Alexandra Galarza-Villamar & Mariette McCampbell & Andres Galarza-Villamar & Cees Leeuwis & Francesco Cecchi & John Galarza-Rodrigo, 2021. "A Public Bad Game Method to Study Dynamics in Socio-Ecological Systems (Part II): Results of Testing Musa-Game in Rwanda and Adding Emergence and Spatiality to the Analysis," Sustainability, MDPI, vol. 13(16), pages 1-27, August.
    12. Meyer, Christian & Hamer, Martin & Terlau, Wiltrud & Raithel, Johannes & Pongratz, Patrick, 2015. "Web Data Mining and Social Media Analysis for better Communication in Food Safety Crises," 2015 International European Forum (144th EAAE Seminar), February 9-13, 2015, Innsbruck-Igls, Austria 206212, International European Forum on System Dynamics and Innovation in Food Networks.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Anderson, Soren T. & Laxminarayan, Ramanan & Salant, Stephen W., 2012. "Diversify or focus? Spending to combat infectious diseases when budgets are tight," Journal of Health Economics, Elsevier, vol. 31(4), pages 658-675.
    2. Wei Luo & Julia Adams & Hannah Brueckner, 2018. "The Ladies Vanish? American Sociology and the Genealogy of its Missing Women on Wikipedia," Working Papers 20180012, New York University Abu Dhabi, Department of Social Science, revised Jan 2018.
    3. Aaltonen, Aleksi Ville & Seiler, Stephan, 2014. "Quantifying spillovers in open source content production: evidence from Wikipedia," LSE Research Online Documents on Economics 60284, London School of Economics and Political Science, LSE Library.
    4. John Gibson & Steven Stillman & David McKenzie & Halahingano Rohorua, 2013. "Natural Experiment Evidence On The Effect Of Migration On Blood Pressure And Hypertension," Health Economics, John Wiley & Sons, Ltd., vol. 22(6), pages 655-672, June.
    5. Eva Deuchert, 2011. "The Virgin HIV Puzzle: Can Misreporting Account for the High Proportion of HIV Cases in Self-reported Virgins?," Journal of African Economies, Centre for the Study of African Economies, vol. 20(1), pages 60-89, January.
    6. Charles Ayoubi & Boris Thurm, 2023. "Knowledge diffusion and morality: Why do we freely share valuable information with Strangers?," Journal of Economics & Management Strategy, Wiley Blackwell, vol. 32(1), pages 75-99, January.
    7. Peter J. Rothe & Linda J. Carroll, 2009. "Hazards Faced by Young Designated Drivers: In-Car Risks of Driving Drunken Passengers," IJERPH, MDPI, vol. 6(6), pages 1-18, June.
    8. Fernando Abad-Franch & Gonçalo Ferraz & Ciro Campos & Francisco S Palomeque & Mario J Grijalva & H Marcelo Aguilar & Michael A Miles, 2010. "Modeling Disease Vector Occurrence when Detection Is Imperfect: Infestation of Amazonian Palm Trees by Triatomine Bugs at Three Spatial Scales," PLOS Neglected Tropical Diseases, Public Library of Science, vol. 4(3), pages 1-11, March.
    9. Demidov, Denis & Frahm, Klaus M. & Shepelyansky, Dima L., 2020. "What is the central bank of Wikipedia?," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 542(C).
    10. Elizabeth Kristjansson & Damian K Francis & Selma Liberato & Marik Benkhalti Jandu & Vivian Welch & Malek Batal & Trish Greenhalgh & Tamara Rader & Eamonn Noonan & Beverley Shea & Laura Janzen & Georg, 2013. "PROTOCOL: Feeding Interventions for Improving the Physical and Psychosocial Health of Disadvantaged Children Aged Three Months to Five Years: Protocol for a Systematic Review," Campbell Systematic Reviews, John Wiley & Sons, vol. 9(1), pages 1-41.
    11. Hervé, Fabrice & Zouaoui, Mohamed & Belvaux, Bertrand, 2019. "Noise traders and smart money: Evidence from online searches," Economic Modelling, Elsevier, vol. 83(C), pages 141-149.
    12. Pinna Pintor, Matteo & Fumagalli, Elena & Suhrcke, Marc, 2024. "The impact of health on labour market outcomes: A rapid systematic review," Health Policy, Elsevier, vol. 143(C).
    13. La Torre, Davide & Liuzzi, Danilo & Marsiglio, Simone, 2021. "Epidemics and macroeconomic outcomes: Social distancing intensity and duration," Journal of Mathematical Economics, Elsevier, vol. 93(C).
    14. Nicolas Jullien, 2012. "What We Know About Wikipedia: A Review of the Literature Analyzing the Project(s)," Post-Print hal-00857208, HAL.
    15. George Ploubidis & Wanjiku Mathenge & Bianca Stavola & Emily Grundy & Allen Foster & Hannah Kuper, 2013. "Socioeconomic position and later life prevalence of hypertension, diabetes and visual impairment in Nakuru, Kenya," International Journal of Public Health, Springer;Swiss School of Public Health (SSPH+), vol. 58(1), pages 133-141, February.
    16. Moana S. Simas & Laura Golsteijn & Mark A. J. Huijbregts & Richard Wood & Edgar G. Hertwich, 2014. "The “Bad Labor” Footprint: Quantifying the Social Impacts of Globalization," Sustainability, MDPI, vol. 6(11), pages 1-27, October.
    17. Brajer, Victor & Mead, Robert W. & Xiao, Feng, 2008. "Health benefits of tunneling through the Chinese environmental Kuznets curve (EKC)," Ecological Economics, Elsevier, vol. 66(4), pages 674-686, July.
    18. Burton, Suzan & Clark, Lindie & Heuler, Stefanie & Bollerup, Jette & Jackson, Kristina, 2011. "Retail tobacco distribution in Australia: Evidence for policy development," Australasian marketing journal, Elsevier, vol. 19(3), pages 168-173.
    19. Shuxia Guo & Hongrui Pang & Heng Guo & Mei Zhang & Jia He & Yizhong Yan & Qiang Niu & Muratbek & Dongsheng Rui & Shugang Li & Rulin Ma & Jingyu Zhang & Jiaming Liu & Yusong Ding, 2015. "Ethnic Differences in the Prevalence of High Homocysteine Levels Among Low-Income Rural Kazakh and Uyghur Adults in Far Western China and Its Implications for Preventive Public Health," IJERPH, MDPI, vol. 12(5), pages 1-13, May.
    20. Céline Azémar & Rodolphe Desbordes, 2009. "Public Governance, Health and Foreign Direct Investment in Sub-Saharan Africa," Journal of African Economies, Centre for the Study of African Economies, vol. 18(4), pages 667-709, August.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pcbi00:1003892. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: ploscompbiol (email available below). General contact details of provider: https://journals.plos.org/ploscompbiol/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.