IDEAS home Printed from https://ideas.repec.org/a/nat/natcom/v15y2024i1d10.1038_s41467-024-54071-x.html
   My bibliography  Save this article

Shareable artificial intelligence to extract cancer outcomes from electronic health records for precision oncology research

Author

Listed:
  • Kenneth L. Kehl

    (Dana-Farber Cancer Institute)

  • Justin Jee

    (Memorial Sloan Kettering Cancer Center)

  • Karl Pichotta

    (Memorial Sloan Kettering Cancer Center)

  • Morgan A. Paul

    (Dana-Farber Cancer Institute)

  • Pavel Trukhanov

    (Dana-Farber Cancer Institute)

  • Christopher Fong

    (Memorial Sloan Kettering Cancer Center)

  • Michele Waters

    (Memorial Sloan Kettering Cancer Center)

  • Ziad Bakouny

    (Memorial Sloan Kettering Cancer Center)

  • Wenxin Xu

    (Dana-Farber Cancer Institute)

  • Toni K. Choueiri

    (Dana-Farber Cancer Institute)

  • Chelsea Nichols

    (Memorial Sloan Kettering Cancer Center)

  • Deborah Schrag

    (Memorial Sloan Kettering Cancer Center)

  • Nikolaus Schultz

    (Memorial Sloan Kettering Cancer Center)

Abstract

Databases that link molecular data to clinical outcomes can inform precision cancer research into novel prognostic and predictive biomarkers. However, outside of clinical trials, cancer outcomes are typically recorded only in text form within electronic health records (EHRs). Artificial intelligence (AI) models have been trained to extract outcomes from individual EHRs. However, patient privacy restrictions have historically precluded dissemination of these models beyond the centers at which they were trained. In this study, the vulnerability of text classification models trained directly on protected health information to membership inference attacks is confirmed. A teacher-student distillation approach is applied to develop shareable models for annotating outcomes from imaging reports and medical oncologist notes. ‘Teacher’ models trained on EHR data from Dana-Farber Cancer Institute (DFCI) are used to label imaging reports and discharge summaries from the Medical Information Mart for Intensive Care (MIMIC)-IV dataset. ‘Student’ models are trained to use these MIMIC documents to predict the labels assigned by teacher models and sent to Memorial Sloan Kettering (MSK) for evaluation. The student models exhibit high discrimination across outcomes in both the DFCI and MSK test sets. Leveraging private labeling of public datasets to distill publishable clinical AI models from academic centers could facilitate deployment of machine learning to accelerate precision oncology research.

Suggested Citation

  • Kenneth L. Kehl & Justin Jee & Karl Pichotta & Morgan A. Paul & Pavel Trukhanov & Christopher Fong & Michele Waters & Ziad Bakouny & Wenxin Xu & Toni K. Choueiri & Chelsea Nichols & Deborah Schrag & N, 2024. "Shareable artificial intelligence to extract cancer outcomes from electronic health records for precision oncology research," Nature Communications, Nature, vol. 15(1), pages 1-11, December.
  • Handle: RePEc:nat:natcom:v:15:y:2024:i:1:d:10.1038_s41467-024-54071-x
    DOI: 10.1038/s41467-024-54071-x
    as

    Download full text from publisher

    File URL: https://www.nature.com/articles/s41467-024-54071-x
    File Function: Abstract
    Download Restriction: no

    File URL: https://libkey.io/10.1038/s41467-024-54071-x?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Lavender Yao Jiang & Xujin Chris Liu & Nima Pour Nejatian & Mustafa Nasir-Moin & Duo Wang & Anas Abidin & Kevin Eaton & Howard Antony Riina & Ilya Laufer & Paawan Punjabi & Madeline Miceli & Nora C. K, 2023. "Health system-scale language models are all-purpose prediction engines," Nature, Nature, vol. 619(7969), pages 357-362, July.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Chen Gao & Xiaochong Lan & Nian Li & Yuan Yuan & Jingtao Ding & Zhilun Zhou & Fengli Xu & Yong Li, 2024. "Large language models empowered agent-based modeling and simulation: a survey and perspectives," Palgrave Communications, Palgrave Macmillan, vol. 11(1), pages 1-24, December.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:nat:natcom:v:15:y:2024:i:1:d:10.1038_s41467-024-54071-x. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.nature.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.