IDEAS home Printed from https://ideas.repec.org/a/bpj/ijbist/v10y2014i2p19n5.html
   My bibliography  Save this article

Estimation of a Predictor’s Importance by Random Forests When There Is Missing Data: RISK Prediction in Liver Surgery using Laboratory Data

Author

Listed:
  • Hapfelmeier Alexander

    (Institute of Medical Statistics and Epidemiology, Technische Universität, München, Germany)

  • Hothorn Torsten

    (Division of Biostatistics, Universität Zürich, Zürich, Switzerland)

  • Riediger Carina

    (Department of Surgery, Technische Universität Dresden, Dresden, Germany)

  • Ulm Kurt

    (Institute of Medical Statistics and Epidemiology, Technische Universität, München, Germany)

Abstract

In the last few decades, new developments in liver surgery have led to an expanded applicability and an improved safety. However, liver surgery is still associated with postoperative morbidity and mortality, especially in extended resections. We analyzed a large liver surgery database to investigate whether laboratory parameters like haemoglobin, leucocytes, bilirubin, haematocrit and lactate might be relevant preoperative predictors. It is not uncommon to observe missing values in such data. This also holds for many other data sources and research fields. For analysis, one can make use of imputation methods or approaches that are able to deal with missing values in the predictor variables. A representative of the latter are Random Forests which also provide variable importance measures to assess a variable’s relevance for prediction. Applied to the liver surgery data, we observed divergent results for the laboratory parameters, depending on the method used to cope with missing values. We therefore performed an extensive simulation study to investigate the properties of each approach. Findings and recommendations: Complete case analysis should not be used as it distorts the relevance of completely observed variables in an undesirable way. The estimation of a variable’s importance by a self-contained measure that can deal with missing values appropriately reflects the decreased relevance of variables with missing values. It can therefore be used to obtain insight into Random Forests which are commonly fit without preprocessing of missing values in the data. By contrast, multiple imputation allows for the assessment of a variable’s relevance one would potentially observe in complete-data situations, if imputation performs well. For the laboratory data, lactate and bilirubin seem to be associated with the risk of liver failure and postoperative complications. These relations should be investigated by future studies in more detail. However, it is important to carefully consider the method used for analysis when there are missing values in the predictor variables.

Suggested Citation

  • Hapfelmeier Alexander & Hothorn Torsten & Riediger Carina & Ulm Kurt, 2014. "Estimation of a Predictor’s Importance by Random Forests When There Is Missing Data: RISK Prediction in Liver Surgery using Laboratory Data," The International Journal of Biostatistics, De Gruyter, vol. 10(2), pages 165-183, November.
  • Handle: RePEc:bpj:ijbist:v:10:y:2014:i:2:p:19:n:5
    DOI: 10.1515/ijb-2013-0038
    as

    Download full text from publisher

    File URL: https://doi.org/10.1515/ijb-2013-0038
    Download Restriction: For access to full text, subscription to the journal or payment for the individual article is required.

    File URL: https://libkey.io/10.1515/ijb-2013-0038?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Willi Sauerbrei, 1999. "The Use of Resampling Methods to Simplify Regression Models in Medical Statistics," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 48(3), pages 313-329.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Sauerbrei, W. & Meier-Hirmer, C. & Benner, A. & Royston, P., 2006. "Multivariable regression model building by using fractional polynomials: Description of SAS, STATA and R programs," Computational Statistics & Data Analysis, Elsevier, vol. 50(12), pages 3464-3485, August.
    2. Hapfelmeier, A. & Ulm, K., 2013. "A new variable selection approach using Random Forests," Computational Statistics & Data Analysis, Elsevier, vol. 60(C), pages 50-69.
    3. Toshiki Doi & Suguru Yamamoto & Takatoshi Morinaga & Ken-ei Sada & Noriaki Kurita & Yoshihiro Onishi, 2015. "Risk Score to Predict 1-Year Mortality after Haemodialysis Initiation in Patients with Stage 5 Chronic Kidney Disease under Predialysis Nephrology Care," PLOS ONE, Public Library of Science, vol. 10(6), pages 1-14, June.
    4. Patrick Royston & Willi Sauerbrei, 2009. "Bootstrap assessment of the stability of multivariable models," Stata Journal, StataCorp LP, vol. 9(4), pages 547-570, December.
    5. Patrick Royston & Willi Sauerbrei, 2007. "Multivariable modeling with cubic regression splines: A principled approach," Stata Journal, StataCorp LP, vol. 7(1), pages 45-70, February.
    6. Dunkler, Daniela & Sauerbrei, Willi & Heinze, Georg, 2016. "Global, Parameterwise and Joint Shrinkage Factor Estimation," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 69(i08).
    7. Harald Binder & Willi Sauerbrei, 2009. "Stability analysis of an additive spline model for respiratory health data by using knot removal," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 58(5), pages 577-600, December.
    8. Simone P Rauh & Martijn W Heymans & Anitra D M Koopman & Giel Nijpels & Coen D Stehouwer & Barbara Thorand & Wolfgang Rathmann & Christa Meisinger & Annette Peters & Tonia de las Heras Gala & Charlott, 2017. "Predicting glycated hemoglobin levels in the non-diabetic general population: Development and validation of the DIRECT-DETECT prediction model - a DIRECT study," PLOS ONE, Public Library of Science, vol. 12(2), pages 1-13, February.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bpj:ijbist:v:10:y:2014:i:2:p:19:n:5. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Peter Golla (email available below). General contact details of provider: https://www.degruyter.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.