IDEAS home Printed from https://ideas.repec.org/a/eee/jmvana/v90y2004i1p154-177.html
   My bibliography  Save this article

Tree-based multivariate regression and density estimation with right-censored data

Author

Listed:
  • Molinaro, Annette M.
  • Dudoit, Sandrine
  • van der Laan, M.J.Mark J.

Abstract

We propose a unified strategy for estimator construction, selection, and performance assessment in the presence of censoring. This approach is entirely driven by the choice of a loss function for the full (uncensored) data structure and can be stated in terms of the following three main steps. (1) First, define the parameter of interest as the minimizer of the expected loss, or risk, for a full data loss function chosen to represent the desired measure of performance. Map the full data loss function into an observed (censored) data loss function having the same expected value and leading to an efficient estimator of this risk. (2) Next, construct candidate estimators based on the loss function for the observed data. (3) Then, apply cross-validation to estimate risk based on the observed data loss function and to select an optimal estimator among the candidates. A number of common estimation procedures follow this approach in the full data situation, but depart from it when faced with the obstacle of evaluating the loss function for censored observations. Here, we argue that one can, and should, also adhere to this estimation road map in censored data situations. Tree-based methods, where the candidate estimators in Step 2 are generated by recursive binary partitioning of a suitably defined covariate space, provide a striking example of the chasm between estimation procedures for full data and censored data (e.g., regression trees as in CART for uncensored data and adaptations to censored data). Common approaches for regression trees bypass the risk estimation problem for censored outcomes by altering the node splitting and tree pruning criteria in manners that are specific to right-censored data. This article describes an application of our unified methodology to tree-based estimation with censored data. The approach encompasses univariate outcome prediction, multivariate outcome prediction, and density estimation, simply by defining a suitable loss function for each of these problems. The proposed method for tree-based estimation with censoring is evaluated using a simulation study and the analysis of CGH copy number and survival data from breast cancer patients.

Suggested Citation

  • Molinaro, Annette M. & Dudoit, Sandrine & van der Laan, M.J.Mark J., 2004. "Tree-based multivariate regression and density estimation with right-censored data," Journal of Multivariate Analysis, Elsevier, vol. 90(1), pages 154-177, July.
  • Handle: RePEc:eee:jmvana:v:90:y:2004:i:1:p:154-177
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0047-259X(04)00029-6
    Download Restriction: Full text for ScienceDirect subscribers only
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. van der Laan Mark J. & Dudoit Sandrine & Keles Sunduz, 2004. "Asymptotic Optimality of Likelihood-Based Cross-Validation," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 3(1), pages 1-25, March.
    2. Sandra Sinisi & Mark van der Laan, 2004. "Loss-Based Cross-Validated Deletion/Substitution/Addition Algorithms in Estimation," U.C. Berkeley Division of Biostatistics Working Paper Series 1142, Berkeley Electronic Press.
    3. Keles Sunduz & van der Laan Mark J. & Dudoit Sandrine & Xing Biao & Eisen Michael B., 2003. "Supervised Detection of Regulatory Motifs in DNA Sequences," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 2(1), pages 1-40, August.
    4. Sandrine Dudoit & Mark van der Laan & Sunduz Keles & Annette Molinaro & Sandra Sinisi & Siew Leng Teng, 2004. "Loss-Based Estimation with Cross-Validation: Applications to Microarray Data Analysis and Motif Finding," U.C. Berkeley Division of Biostatistics Working Paper Series 1136, Berkeley Electronic Press.
    5. Leo Breiman & Jerome H. Friedman, 1997. "Predicting Multivariate Responses in Multiple Linear Regression," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 59(1), pages 3-54.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Sinisi Sandra E. & Neugebauer Romain & van der Laan Mark J., 2006. "Cross-Validated Bagged Prediction of Survival," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 5(1), pages 1-26, May.
    2. Karen Lostritto & Robert L. Strawderman & Annette M. Molinaro, 2012. "A Partitioning Deletion/Substitution/Addition Algorithm for Creating Survival Risk Groups," Biometrics, The International Biometric Society, vol. 68(4), pages 1146-1156, December.
    3. Olivier Lopez & Xavier Milhaud & Pierre-Emmanuel Thérond, 2015. "Tree-based censored regression with applications to insurance," Working Papers hal-01141228, HAL.
    4. Laan Mark J. van der & Dudoit Sandrine & Vaart Aad W. van der, 2006. "The cross-validated adaptive epsilon-net estimator," Statistics & Risk Modeling, De Gruyter, vol. 24(3), pages 1-23, December.
    5. Yan Zhou & John McArdle, 2015. "Rationale and Applications of Survival Tree and Survival Ensemble Methods," Psychometrika, Springer;The Psychometric Society, vol. 80(3), pages 811-833, September.
    6. Mark van der Laan & Sandrine Dudoit & Aad van der Vaart, 2004. "The Cross-Validated Adaptive Epsilon-Net Estimator," U.C. Berkeley Division of Biostatistics Working Paper Series 1141, Berkeley Electronic Press.
    7. Wei-Yin Loh, 2014. "Fifty Years of Classification and Regression Trees," International Statistical Review, International Statistical Institute, vol. 82(3), pages 329-348, December.
    8. Olivier Lopez & Xavier Milhaud & Pierre-Emmanuel Thérond, 2016. "Tree-based censored regression with applications in insurance," Post-Print hal-01141228, HAL.
    9. Susan Athey & Julie Tibshirani & Stefan Wager, 2016. "Generalized Random Forests," Papers 1610.01271, arXiv.org, revised Apr 2018.
    10. Yifei Sun & Sy Han Chiou & Mei‐Cheng Wang, 2020. "ROC‐guided survival trees and ensembles," Biometrics, The International Biometric Society, vol. 76(4), pages 1177-1189, December.
    11. Alexander Hanbo Li & Jelena Bradic, 2019. "Censored Quantile Regression Forests," Papers 1902.03327, arXiv.org.
    12. Pablo Gonzalez Ginestet & Ales Kotalik & David M. Vock & Julian Wolfson & Erin E. Gabriel, 2021. "Stacked inverse probability of censoring weighted bagging: A case study in the InfCareHIV Register," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 70(1), pages 51-65, January.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Paul Hewson & Keming Yu, 2008. "Quantile regression for binary performance indicators," Applied Stochastic Models in Business and Industry, John Wiley & Sons, vol. 24(5), pages 401-418, September.
    2. Roberto Rocci & Stefano Antonio Gattone & Roberto Di Mari, 2018. "A data driven equivariant approach to constrained Gaussian mixture modeling," Advances in Data Analysis and Classification, Springer;German Classification Society - Gesellschaft für Klassifikation (GfKl);Japanese Classification Society (JCS);Classification and Data Analysis Group of the Italian Statistical Society (CLADAG);International Federation of Classification Societies (IFCS), vol. 12(2), pages 235-260, June.
    3. Jewson Stephen & Penzer Jeremy, 2006. "Estimating Trends in Weather Series: Consequences for Pricing Derivatives," Studies in Nonlinear Dynamics & Econometrics, De Gruyter, vol. 10(3), pages 1-17, September.
    4. Luebke, Karsten & Czogiel, Irina & Weihs, Claus, 2004. "Latent Factor Prediction Pursuit for Rank Deficient Regressors," Technical Reports 2004,75, Technische Universität Dortmund, Sonderforschungsbereich 475: Komplexitätsreduktion in multivariaten Datenstrukturen.
    5. Bruce Desmarais, 2012. "Lessons in disguise: multivariate predictive mistakes in collective choice models," Public Choice, Springer, vol. 151(3), pages 719-737, June.
    6. Sandrine Dudoit & Mark van der Laan & Sunduz Keles & Annette Molinaro & Sandra Sinisi & Siew Leng Teng, 2004. "Loss-Based Estimation with Cross-Validation: Applications to Microarray Data Analysis and Motif Finding," U.C. Berkeley Division of Biostatistics Working Paper Series 1136, Berkeley Electronic Press.
    7. Stitelman Ori M & van der Laan Mark J., 2010. "Collaborative Targeted Maximum Likelihood for Time to Event Data," The International Journal of Biostatistics, De Gruyter, vol. 6(1), pages 1-46, June.
    8. Wang, Yihe & Zhao, Sihai Dave, 2021. "A nonparametric empirical Bayes approach to large-scale multivariate regression," Computational Statistics & Data Analysis, Elsevier, vol. 156(C).
    9. Seokhyun Chung & Raed Al Kontar & Zhenke Wu, 2022. "Weakly Supervised Multi-output Regression via Correlated Gaussian Processes," INFORMS Joural on Data Science, INFORMS, vol. 1(2), pages 115-137, October.
    10. Olivier Lopez & Xavier Milhaud & Pierre-Emmanuel Thérond, 2015. "Tree-based censored regression with applications to insurance," Working Papers hal-01141228, HAL.
    11. Arafat Tayeb & Aurélie Labbe & Alexandre Bureau & Chantal Mérette, 2011. "Solving genetic heterogeneity in extended families by identifying sub-types of complex diseases," Computational Statistics, Springer, vol. 26(3), pages 539-560, September.
    12. Qiang Sun & Hongtu Zhu & Yufeng Liu & Joseph G. Ibrahim, 2015. "SPReM: Sparse Projection Regression Model For High-Dimensional Linear Regression," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 110(509), pages 289-302, March.
    13. Joyce de Souza Zanirato Maia & Ana Paula Arantes Bueno & João Ricardo Sato, 2021. "Assessing the educational performance of different Brazilian school cycles using data science methods," PLOS ONE, Public Library of Science, vol. 16(3), pages 1-14, March.
    14. Laan Mark J. van der & Dudoit Sandrine & Vaart Aad W. van der, 2006. "The cross-validated adaptive epsilon-net estimator," Statistics & Risk Modeling, De Gruyter, vol. 24(3), pages 1-23, December.
    15. Hu, Yingyao & Schennach, Susanne & Shiu, Ji-Liang, 2022. "Identification of nonparametric monotonic regression models with continuous nonclassical measurement errors," Journal of Econometrics, Elsevier, vol. 226(2), pages 269-294.
    16. Jhun, Myoungshic & Choi, Inkyung, 2009. "Bootstrapping least distance estimator in the multivariate regression model," Computational Statistics & Data Analysis, Elsevier, vol. 53(12), pages 4221-4227, October.
    17. Mahmood Zafar & Khan Salahuddin, 2009. "On the Use of K-Fold Cross-Validation to Choose Cutoff Values and Assess the Performance of Predictive Models in Stepwise Regression," The International Journal of Biostatistics, De Gruyter, vol. 5(1), pages 1-21, July.
    18. Haight, Thaddeus J. & Wang, Yue & van der Laan, Mark J. & Tager, Ira B., 2010. "A cross-validation deletion-substitution-addition model selection algorithm: Application to marginal structural models," Computational Statistics & Data Analysis, Elsevier, vol. 54(12), pages 3080-3094, December.
    19. Olivier Lopez & Xavier Milhaud & Pierre-Emmanuel Thérond, 2016. "Tree-based censored regression with applications in insurance," Post-Print hal-01141228, HAL.
    20. Simila, Timo & Tikka, Jarkko, 2007. "Input selection and shrinkage in multiresponse linear regression," Computational Statistics & Data Analysis, Elsevier, vol. 52(1), pages 406-422, September.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:jmvana:v:90:y:2004:i:1:p:154-177. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/wps/find/journaldescription.cws_home/622892/description#description .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.