IDEAS home Printed from https://ideas.repec.org/a/nat/nature/v636y2024i8043d10.1038_s41586-024-08167-5.html
   My bibliography  Save this article

Automated real-world data integration improves cancer outcome prediction

Author

Listed:
  • Justin Jee

    (Memorial Sloan Kettering Cancer Center)

  • Christopher Fong

    (Memorial Sloan Kettering Cancer Center)

  • Karl Pichotta

    (Memorial Sloan Kettering Cancer Center)

  • Thinh Ngoc Tran

    (Memorial Sloan Kettering Cancer Center)

  • Anisha Luthra

    (Memorial Sloan Kettering Cancer Center)

  • Michele Waters

    (Memorial Sloan Kettering Cancer Center)

  • Chenlian Fu

    (Memorial Sloan Kettering Cancer Center)

  • Mirella Altoe

    (Memorial Sloan Kettering Cancer Center)

  • Si-Yang Liu

    (Memorial Sloan Kettering Cancer Center)

  • Steven B. Maron

    (Memorial Sloan Kettering Cancer Center
    Dana Farber Cancer Institute)

  • Mehnaj Ahmed

    (Memorial Sloan Kettering Cancer Center)

  • Susie Kim

    (Memorial Sloan Kettering Cancer Center)

  • Mono Pirun

    (Memorial Sloan Kettering Cancer Center)

  • Walid K. Chatila

    (Memorial Sloan Kettering Cancer Center)

  • Ino Bruijn

    (Memorial Sloan Kettering Cancer Center)

  • Arfath Pasha

    (Memorial Sloan Kettering Cancer Center)

  • Ritika Kundra

    (Memorial Sloan Kettering Cancer Center)

  • Benjamin Gross

    (Memorial Sloan Kettering Cancer Center)

  • Brooke Mastrogiacomo

    (Memorial Sloan Kettering Cancer Center)

  • Tyler J. Aprati

    (Dana Farber Cancer Institute)

  • David Liu

    (Dana Farber Cancer Institute)

  • JianJiong Gao

    (Caris Life Sciences)

  • Marzia Capelletti

    (Caris Life Sciences)

  • Kelly Pekala

    (Memorial Sloan Kettering Cancer Center)

  • Lisa Loudon

    (Memorial Sloan Kettering Cancer Center)

  • Maria Perry

    (Memorial Sloan Kettering Cancer Center)

  • Chaitanya Bandlamudi

    (Memorial Sloan Kettering Cancer Center)

  • Mark Donoghue

    (Memorial Sloan Kettering Cancer Center)

  • Baby Anusha Satravada

    (Memorial Sloan Kettering Cancer Center)

  • Axel Martin

    (Memorial Sloan Kettering Cancer Center)

  • Ronglai Shen

    (Memorial Sloan Kettering Cancer Center)

  • Yuan Chen

    (Memorial Sloan Kettering Cancer Center)

  • A. Rose Brannon

    (Memorial Sloan Kettering Cancer Center)

  • Jason Chang

    (Memorial Sloan Kettering Cancer Center)

  • Lior Braunstein

    (Memorial Sloan Kettering Cancer Center
    Dana Farber Cancer Institute)

  • Anyi Li

    (Memorial Sloan Kettering Cancer Center)

  • Anton Safonov

    (Memorial Sloan Kettering Cancer Center)

  • Aaron Stonestrom

    (Memorial Sloan Kettering Cancer Center)

  • Pablo Sanchez-Vela

    (Memorial Sloan Kettering Cancer Center)

  • Clare Wilhelm

    (Memorial Sloan Kettering Cancer Center)

  • Mark Robson

    (Memorial Sloan Kettering Cancer Center
    Dana Farber Cancer Institute)

  • Howard Scher

    (Memorial Sloan Kettering Cancer Center
    Dana Farber Cancer Institute)

  • Marc Ladanyi

    (Memorial Sloan Kettering Cancer Center)

  • Jorge S. Reis-Filho

    (Memorial Sloan Kettering Cancer Center)

  • David B. Solit

    (Memorial Sloan Kettering Cancer Center)

  • David R. Jones

    (Memorial Sloan Kettering Cancer Center)

  • Daniel Gomez

    (Memorial Sloan Kettering Cancer Center)

  • Helena Yu

    (Memorial Sloan Kettering Cancer Center)

  • Debyani Chakravarty

    (Memorial Sloan Kettering Cancer Center)

  • Rona Yaeger

    (Memorial Sloan Kettering Cancer Center
    Cornell University)

  • Wassim Abida

    (Memorial Sloan Kettering Cancer Center
    Cornell University)

  • Wungki Park

    (Memorial Sloan Kettering Cancer Center
    Cornell University)

  • Eileen M. O’Reilly

    (Memorial Sloan Kettering Cancer Center
    Cornell University)

  • Julio Garcia-Aguilar

    (Memorial Sloan Kettering Cancer Center
    Cornell University)

  • Nicholas Socci

    (Memorial Sloan Kettering Cancer Center)

  • Francisco Sanchez-Vega

    (Memorial Sloan Kettering Cancer Center)

  • Jian Carrot-Zhang

    (Memorial Sloan Kettering Cancer Center)

  • Peter D. Stetson

    (Memorial Sloan Kettering Cancer Center)

  • Ross Levine

    (Memorial Sloan Kettering Cancer Center
    Cornell University)

  • Charles M. Rudin

    (Memorial Sloan Kettering Cancer Center
    Cornell University)

  • Michael F. Berger

    (Memorial Sloan Kettering Cancer Center)

  • Sohrab P. Shah

    (Memorial Sloan Kettering Cancer Center)

  • Deborah Schrag

    (Memorial Sloan Kettering Cancer Center
    Cornell University)

  • Pedram Razavi

    (Memorial Sloan Kettering Cancer Center
    Cornell University)

  • Kenneth L. Kehl

    (Dana Farber Cancer Institute)

  • Bob T. Li

    (Memorial Sloan Kettering Cancer Center
    Cornell University)

  • Gregory J. Riely

    (Memorial Sloan Kettering Cancer Center
    Cornell University)

  • Nikolaus Schultz

    (Memorial Sloan Kettering Cancer Center)

Abstract

The digitization of health records and growing availability of tumour DNA sequencing provide an opportunity to study the determinants of cancer outcomes with unprecedented richness. Patient data are often stored in unstructured text and siloed datasets. Here we combine natural language processing annotations1,2 with structured medication, patient-reported demographic, tumour registry and tumour genomic data from 24,950 patients at Memorial Sloan Kettering Cancer Center to generate a clinicogenomic, harmonized oncologic real-world dataset (MSK-CHORD). MSK-CHORD includes data for non-small-cell lung (n = 7,809), breast (n = 5,368), colorectal (n = 5,543), prostate (n = 3,211) and pancreatic (n = 3,109) cancers and enables discovery of clinicogenomic relationships not apparent in smaller datasets. Leveraging MSK-CHORD to train machine learning models to predict overall survival, we find that models including features derived from natural language processing, such as sites of disease, outperform those based on genomic data or stage alone as tested by cross-validation and an external, multi-institution dataset. By annotating 705,241 radiology reports, MSK-CHORD also uncovers predictors of metastasis to specific organ sites, including a relationship between SETD2 mutation and lower metastatic potential in immunotherapy-treated lung adenocarcinoma corroborated in independent datasets. We demonstrate the feasibility of automated annotation from unstructured notes and its utility in predicting patient outcomes. The resulting data are provided as a public resource for real-world oncologic research.

Suggested Citation

  • Justin Jee & Christopher Fong & Karl Pichotta & Thinh Ngoc Tran & Anisha Luthra & Michele Waters & Chenlian Fu & Mirella Altoe & Si-Yang Liu & Steven B. Maron & Mehnaj Ahmed & Susie Kim & Mono Pirun &, 2024. "Automated real-world data integration improves cancer outcome prediction," Nature, Nature, vol. 636(8043), pages 728-736, December.
  • Handle: RePEc:nat:nature:v:636:y:2024:i:8043:d:10.1038_s41586-024-08167-5
    DOI: 10.1038/s41586-024-08167-5
    as

    Download full text from publisher

    File URL: https://www.nature.com/articles/s41586-024-08167-5
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1038/s41586-024-08167-5?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:nat:nature:v:636:y:2024:i:8043:d:10.1038_s41586-024-08167-5. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.nature.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.