IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0007431.html
   My bibliography  Save this article

Can Survival Prediction Be Improved By Merging Gene Expression Data Sets?

Author

Listed:
  • Haleh Yasrebi
  • Peter Sperisen
  • Viviane Praz
  • Philipp Bucher

Abstract

Background: High-throughput gene expression profiling technologies generating a wealth of data, are increasingly used for characterization of tumor biopsies for clinical trials. By applying machine learning algorithms to such clinically documented data sets, one hopes to improve tumor diagnosis, prognosis, as well as prediction of treatment response. However, the limited number of patients enrolled in a single trial study limits the power of machine learning approaches due to over-fitting. One could partially overcome this limitation by merging data from different studies. Nevertheless, such data sets differ from each other with regard to technical biases, patient selection criteria and follow-up treatment. It is therefore not clear at all whether the advantage of increased sample size outweighs the disadvantage of higher heterogeneity of merged data sets. Here, we present a systematic study to answer this question specifically for breast cancer data sets. We use survival prediction based on Cox regression as an assay to measure the added value of merged data sets. Results: Using time-dependent Receiver Operating Characteristic-Area Under the Curve (ROC-AUC) and hazard ratio as performance measures, we see in overall no significant improvement or deterioration of survival prediction with merged data sets as compared to individual data sets. This apparently was due to the fact that a few genes with strong prognostic power were not available on all microarray platforms and thus were not retained in the merged data sets. Surprisingly, we found that the overall best performance was achieved with a single-gene predictor consisting of CYB5D1. Conclusions: Merging did not deteriorate performance on average despite (a) The diversity of microarray platforms used. (b) The heterogeneity of patients cohorts. (c) The heterogeneity of breast cancer disease. (d) Substantial variation of time to death or relapse. (e) The reduced number of genes in the merged data sets. Predictors derived from the merged data sets were more robust, consistent and reproducible across microarray platforms. Moreover, merging data sets from different studies helps to better understand the biases of individual studies and can lead to the identification of strong survival factors like CYB5D1 expression.

Suggested Citation

  • Haleh Yasrebi & Peter Sperisen & Viviane Praz & Philipp Bucher, 2009. "Can Survival Prediction Be Improved By Merging Gene Expression Data Sets?," PLOS ONE, Public Library of Science, vol. 4(10), pages 1-14, October.
  • Handle: RePEc:plo:pone00:0007431
    DOI: 10.1371/journal.pone.0007431
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0007431
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0007431&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0007431?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Patrick J. Heagerty & Thomas Lumley & Margaret S. Pepe, 2000. "Time-Dependent ROC Curves for Censored Survival Data and a Diagnostic Marker," Biometrics, The International Biometric Society, vol. 56(2), pages 337-344, June.
    2. Davendra Sohal & Andrew Yeatts & Kenny Ye & Andrea Pellagatti & Li Zhou & Perry Pahanish & Yongkai Mo & Tushar Bhagat & John Mariadason & Jacqueline Boultwood & Ari Melnick & John Greally & Amit Verma, 2008. "Meta-Analysis of Microarray Studies Reveals a Novel Hematopoietic Progenitor Cell Signature and Demonstrates Feasibility of Inter-Platform Data Integration," PLOS ONE, Public Library of Science, vol. 3(8), pages 1-10, August.
    3. Margaret Pepe & Holly Janes & Gary Longton & Wendy Leisenring & Polly Newcomb, 2004. "Limitations of the Odds Ratio in Gauging the Performance of a Diagnostic or Prognostic Marker," UW Biostatistics Working Paper Series 1035, Berkeley Electronic Press.
    4. Andrea H. Bild & Guang Yao & Jeffrey T. Chang & Quanli Wang & Anil Potti & Dawn Chasse & Mary-Beth Joshi & David Harpole & Johnathan M. Lancaster & Andrew Berchuck & John A. Olson & Jeffrey R. Marks &, 2006. "Oncogenic pathway signatures in human cancers as a guide to targeted therapies," Nature, Nature, vol. 439(7074), pages 353-357, January.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Herman M J Sontrop & Wim F J Verhaegh & Marcel J T Reinders & Perry D Moerland, 2011. "An Evaluation Protocol for Subtype-Specific Breast Cancer Event Prediction," PLOS ONE, Public Library of Science, vol. 6(7), pages 1-12, July.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Debashis Ghosh & Michael S. Sabel, 2022. "A Weighted Sample Framework to Incorporate External Calculators for Risk Modeling," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 14(3), pages 363-379, December.
    2. Junjie Su & Byung-Jun Yoon & Edward R Dougherty, 2009. "Accurate and Reliable Cancer Classification Based on Probabilistic Inference of Pathway Activity," PLOS ONE, Public Library of Science, vol. 4(12), pages 1-10, December.
    3. Carey K Anders & Chaitanya R Acharya & David S Hsu & Gloria Broadwater & Katherine Garman & John A Foekens & Yi Zhang & Yixin Wang & Kelly Marcom & Jeffrey R Marks & Sayan Mukherjee & Joseph R Nevins , 2008. "Age-Specific Differences in Oncogenic Pathway Deregulation Seen in Human Breast Tumors," PLOS ONE, Public Library of Science, vol. 3(1), pages 1-8, January.
    4. Chin-Tsang Chiang & Shr-Yan Huang, 2009. "Estimation for the Optimal Combination of Markers without Modeling the Censoring Distribution," Biometrics, The International Biometric Society, vol. 65(1), pages 152-158, March.
    5. Anna-Karin Ivert & Marie Torstensson Levander & Juan Merlo, 2013. "Adolescents' Utilisation of Psychiatric Care, Neighbourhoods and Neighbourhood Socioeconomic Deprivation: A Multilevel Analysis," PLOS ONE, Public Library of Science, vol. 8(11), pages 1-1, November.
    6. Margaret Sullivan Pepe & Tianxi Cai & Gary Longton, 2006. "Combining Predictors for Classification Using the Area under the Receiver Operating Characteristic Curve," Biometrics, The International Biometric Society, vol. 62(1), pages 221-229, March.
    7. Te-Ling Ma & Tsung-Hui Hu & Chao-Hung Hung & Jing-Houng Wang & Sheng-Nan Lu & Chien-Hung Chen, 2019. "Incidence and predictors of retreatment in chronic hepatitis B patients after discontinuation of entecavir or tenofovir treatment," PLOS ONE, Public Library of Science, vol. 14(10), pages 1-16, October.
    8. Yingye Zheng & Patrick Heagerty, 2004. "Semiparametric Estimation of Time-Dependent: ROC Curves for Longitudinal Marker Data," UW Biostatistics Working Paper Series 1052, Berkeley Electronic Press.
    9. Holly Janes & Margaret S. Pepe, 2008. "Matching in Studies of Classification Accuracy: Implications for Analysis, Efficiency, and Assessment of Incremental Value," Biometrics, The International Biometric Society, vol. 64(1), pages 1-9, March.
    10. Carlos A Labarrere & John R Woods & James W Hardin & Beate R Jaeger & Marian Zembala & Mario C Deng & Ghassan S Kassab, 2014. "Early Inflammatory Markers Are Independent Predictors of Cardiac Allograft Vasculopathy in Heart-Transplant Recipients," PLOS ONE, Public Library of Science, vol. 9(12), pages 1-18, December.
    11. Diego Tomassi & Liliana Forzani & Efstathia Bura & Ruth Pfeiffer, 2017. "Sufficient dimension reduction for censored predictors," Biometrics, The International Biometric Society, vol. 73(1), pages 220-231, March.
    12. Shannon M Lynch & Elizabeth Handorf & Kristen A Sorice & Elizabeth Blackman & Lisa Bealin & Veda N Giri & Elias Obeid & Camille Ragin & Mary Daly, 2020. "The effect of neighborhood social environment on prostate cancer development in black and white men at high risk for prostate cancer," PLOS ONE, Public Library of Science, vol. 15(8), pages 1-18, August.
    13. Osamu Komori, 2011. "A boosting method for maximization of the area under the ROC curve," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 63(5), pages 961-979, October.
    14. Weining Shen & Jing Ning & Ying Yuan, 2015. "A direct method to evaluate the time-dependent predictive accuracy for biomarkers," Biometrics, The International Biometric Society, vol. 71(2), pages 439-449, June.
    15. Si Cheng & Kathleen F Kerr & Heather Thiessen-Philbrook & Steven G Coca & Chirag R Parikh, 2020. "BioPETsurv: Methodology and open source software to evaluate biomarkers for prognostic enrichment of time-to-event clinical trials," PLOS ONE, Public Library of Science, vol. 15(9), pages 1-11, September.
    16. Lori E. Dodd, 2010. "ROC Curves for Continuous Data by KRZANOWSKI, W. J. and HAND, D. J," Biometrics, The International Biometric Society, vol. 66(2), pages 657-658, June.
    17. David Lindgren & Gottfrid Sjödahl & Martin Lauss & Johan Staaf & Gunilla Chebil & Kristina Lövgren & Sigurdur Gudjonsson & Fredrik Liedberg & Oliver Patschan & Wiking Månsson & Mårten Fernö & Mattias , 2012. "Integrated Genomic and Gene Expression Profiling Identifies Two Major Genomic Circuits in Urothelial Carcinoma," PLOS ONE, Public Library of Science, vol. 7(6), pages 1-11, June.
    18. Kenichi Hayashi & Shinto Eguchi, 2024. "A new integrated discrimination improvement index via odds," Statistical Papers, Springer, vol. 65(8), pages 4971-4990, October.
    19. Yingye Zheng & Tianxi Cai & Ziding Feng, 2006. "Application of the Time-Dependent ROC Curves for Prognostic Accuracy with Multiple Biomarkers," Biometrics, The International Biometric Society, vol. 62(1), pages 279-287, March.
    20. C. Jason Liang & Patrick J. Heagerty, 2017. "Rejoinder to discussions on: A risk-based measure of time-varying prognostic discrimination for survival models," Biometrics, The International Biometric Society, vol. 73(3), pages 745-748, September.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0007431. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.