IDEAS home Printed from https://ideas.repec.org/a/bla/biomet/v77y2021i3p1089-1100.html
   My bibliography  Save this article

Statistical inference for natural language processing algorithms with a demonstration using type 2 diabetes prediction from electronic health record notes

Author

Listed:
  • Brian L. Egleston
  • Tian Bai
  • Richard J. Bleicher
  • Stanford J. Taylor
  • Michael H. Lutz
  • Slobodan Vucetic

Abstract

The pointwise mutual information statistic (PMI), which measures how often two words occur together in a document corpus, is a cornerstone of recently proposed popular natural language processing algorithms such as word2vec. PMI and word2vec reveal semantic relationships between words and can be helpful in a range of applications such as document indexing, topic analysis, or document categorization. We use probability theory to demonstrate the relationship between PMI and word2vec. We use the theoretical results to demonstrate how the PMI can be modeled and estimated in a simple and straight forward manner. We further describe how one can obtain standard error estimates that account for within‐patient clustering that arises from patterns of repeated words within a patient's health record due to a unique health history. We then demonstrate the usefulness of PMI on the problem of predictive identification of disease from free text notes of electronic health records. Specifically, we use our methods to distinguish those with and without type 2 diabetes mellitus in electronic health record free text data using over 400 000 clinical notes from an academic medical center.

Suggested Citation

  • Brian L. Egleston & Tian Bai & Richard J. Bleicher & Stanford J. Taylor & Michael H. Lutz & Slobodan Vucetic, 2021. "Statistical inference for natural language processing algorithms with a demonstration using type 2 diabetes prediction from electronic health record notes," Biometrics, The International Biometric Society, vol. 77(3), pages 1089-1100, September.
  • Handle: RePEc:bla:biomet:v:77:y:2021:i:3:p:1089-1100
    DOI: 10.1111/biom.13338
    as

    Download full text from publisher

    File URL: https://doi.org/10.1111/biom.13338
    Download Restriction: no

    File URL: https://libkey.io/10.1111/biom.13338?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Feinerer, Ingo & Hornik, Kurt & Meyer, David, 2008. "Text Mining Infrastructure in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 25(i05).
    2. Rick L. Williams, 2000. "A Note on Robust Variance Estimation for Cluster-Correlated Data," Biometrics, The International Biometric Society, vol. 56(2), pages 645-646, June.
    3. Leo Egghe & Loet Leydesdorff, 2009. "The relation between Pearson's correlation coefficient r and Salton's cosine measure," Journal of the American Society for Information Science and Technology, Association for Information Science & Technology, vol. 60(5), pages 1027-1036, May.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. David H Chae & Sean Clouston & Mark L Hatzenbuehler & Michael R Kramer & Hannah L F Cooper & Sacoby M Wilson & Seth I Stephens-Davidowitz & Robert S Gold & Bruce G Link, 2015. "Association between an Internet-Based Measure of Area Racism and Black Mortality," PLOS ONE, Public Library of Science, vol. 10(4), pages 1-12, April.
    2. Krause, Werner & Giebler, Heiko, 2020. "Shifting Welfare Policy Positions: The Impact of Radical Right Populist Party Success Beyond Migration Politics," EconStor Open Access Articles and Book Chapters, ZBW - Leibniz Information Centre for Economics, vol. 56(3), pages 331-348.
    3. Elijah Brewer & Julapa Jagtiani, 2013. "How Much Did Banks Pay to Become Too-Big-To-Fail and to Become Systemically Important?," Journal of Financial Services Research, Springer;Western Finance Association, vol. 43(1), pages 1-35, February.
    4. Dixon, Jenna & Luginaah, Isaac & Mkandawire, Paul, 2014. "The National Health Insurance Scheme in Ghana's Upper West Region: A gendered perspective of insurance acquisition in a resource-poor setting," Social Science & Medicine, Elsevier, vol. 122(C), pages 103-112.
    5. Gerben ter Riet & Paula Chesley & Alan G Gross & Lara Siebeling & Patrick Muggensturm & Nadine Heller & Martin Umbehr & Daniela Vollenweider & Tsung Yu & Elie A Akl & Lizzy Brewster & Olaf M Dekkers &, 2013. "All That Glitters Isn't Gold: A Survey on Acknowledgment of Limitations in Biomedical Studies," PLOS ONE, Public Library of Science, vol. 8(11), pages 1-6, November.
    6. Doidge, Craig & Andrew Karolyi, G. & Stulz, Rene M., 2007. "Why do countries matter so much for corporate governance?," Journal of Financial Economics, Elsevier, vol. 86(1), pages 1-39, October.
    7. Grinis, Inna, 2017. "The STEM requirements of "non-STEM" jobs: evidence from UK online vacancy postings and implications for skills & knowledge shortages," LSE Research Online Documents on Economics 85123, London School of Economics and Political Science, LSE Library.
    8. Sjoerd Halem & Eeske Roekel & Jaap Denissen, 2024. "Understanding the Dynamics of Hedonic and Eudaimonic Motives on Daily Well-Being: Insights from Experience Sampling Data," Journal of Happiness Studies, Springer, vol. 25(7), pages 1-25, October.
    9. Dave, Dhaval, 2008. "Illicit drug use among arrestees, prices and policy," Journal of Urban Economics, Elsevier, vol. 63(2), pages 694-714, March.
    10. Andres, Maximilian & Bruttel, Lisa & Friedrichsen, Jana, 2021. "The leniency rule revisited: Experiments on cartel formation with open communication," International Journal of Industrial Organization, Elsevier, vol. 76(C).
    11. Miozzo, Marcela & Desyllas, Panos & Lee, Hsing-fen & Miles, Ian, 2016. "Innovation collaboration and appropriability by knowledge-intensive business services firms," Research Policy, Elsevier, vol. 45(7), pages 1337-1351.
    12. Gagnon, Louis & Karolyi, G. Andrew, 2009. "Information, Trading Volume, and International Stock Return Comovements: Evidence from Cross-Listed Stocks," Journal of Financial and Quantitative Analysis, Cambridge University Press, vol. 44(4), pages 953-986, August.
    13. Julia Bachtrögler & Christoph Hammer & Wolf Heinrich Reuter & Florian Schwendinger, 2019. "Guide to the galaxy of EU regional funds recipients: evidence from new data," Empirica, Springer;Austrian Institute for Economic Research;Austrian Economic Association, vol. 46(1), pages 103-150, February.
    14. Carracedo, Patricia & Puertas, Rosa & Marti, Luisa, 2021. "Research lines on the impact of the COVID-19 pandemic on business. A text mining analysis," Journal of Business Research, Elsevier, vol. 132(C), pages 586-593.
    15. Nafeesa N Dhalwani & Laila J Tata & Tim Coleman & Kate M Fleming & Lisa Szatkowski, 2013. "Completeness of Maternal Smoking Status Recording during Pregnancy in United Kingdom Primary Care Data," PLOS ONE, Public Library of Science, vol. 8(9), pages 1-7, September.
    16. Christian WEISMAYER, 2022. "Applied Research in Quality of Life: A Computational Literature Review," Applied Research in Quality of Life, Springer;International Society for Quality-of-Life Studies, vol. 17(3), pages 1433-1458, June.
    17. Nicolas Jacquemet & Adam Zylbersztejn, 2014. "What drives failure to maximize payoffs in the lab? A test of the inequality aversion hypothesis," Review of Economic Design, Springer;Society for Economic Design, vol. 18(4), pages 243-264, December.
    18. Debnath, R. & Bardhan, R. & Mohaddes, K. & Shah, D. U. & Ramage, M. H. & Alvarez, R. M., 2022. "People-centric Emission Reduction in Buildings: A Data-driven and Network Topology-based Investigation," Cambridge Working Papers in Economics 2202, Faculty of Economics, University of Cambridge.
    19. George Van Houtven & John Powers & Amber Jessup & Jui‐Chen Yang, 2006. "Valuing avoided morbidity using meta‐regression analysis: what can health status measures and QALYs tell us about WTP?," Health Economics, John Wiley & Sons, Ltd., vol. 15(8), pages 775-795, August.
    20. Jannik Gerwanski & Othar Kordsachia & Patrick Velte, 2019. "Determinants of materiality disclosure quality in integrated reporting: Empirical evidence from an international setting," Business Strategy and the Environment, Wiley Blackwell, vol. 28(5), pages 750-770, July.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bla:biomet:v:77:y:2021:i:3:p:1089-1100. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Wiley Content Delivery (email available below). General contact details of provider: http://www.blackwellpublishing.com/journal.asp?ref=0006-341X .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.