IDEAS home Printed from https://ideas.repec.org/p/boc/osug04/6.html
   My bibliography  Save this paper

The effect of missing data on covariates in survival analysis

Author

Listed:
  • Irit Aitkin

    (Department of Psychology, University of Melbourne)

Abstract

We deal with this problem in the context of survival analysis with missing data on covariates. More specifically, we examine the factors affecting the duration of breastfeeding in Western Australia. Duration was studied in 556 women delivering at two maternity hospitals in Perth, Australia. The study was carried out over the period September 1992 to April 1993. 466 women breastfed when they left the hospital. In a previous analysis, the Cox proportional hazards model was fitted to determine the factors affecting duration of breastfeeding. However, because of missing data, a covariate known to be important, smoking, could not be used as it would have resulted in a loss of almost 50% of the available sample. In this analysis, we incorporate the incomplete data on smoking omitted from the previous analysis. We deal with the missing data on covariates in survival analysis in two ways--the first is by maximum likelihood and the second by multiple imputation. Direct maximization of the likelihood with missing data is complicated, and most methods that perform maximum likelihood estimation (for example, the EM algorithm) use some form of data augmentation, which augments the observed data with latent (unobserved) data, so that very complicated calculations are replaced by much simpler ones given the "complete data". The distribution of response time for cases with smoking missing is no longer a Cox model but a mixture of two such models, in proportions given by the population proportions of smokers and non-smokers. The likelihood function is therefore different for complete and incomplete cases, and so maximizing it is more complicated in having to allow for this difference. We carried out the ML analysis in Stata using GLLAMM (Generalized Linear Latent And Mixed Models) routines (Rabe-Hesketh, Pickles, and Skrondal 2001). In the GLLAMM procedure, a latent smoking variable is defined for the cases with smoking missing, and the breastfeeding durations are regressed on the explanatory variables and smoking--the covariate when it is observed and the latent variable when not. The model for the smoking covariate is a "measurement model" when the covariate is observed and a "structural model" when it is not. We compared ML using GLLAMM with multiple imputation using the program written by J.L Schafer mainly for S-Plus/R. It is based on the data augmentation algorithm (Tanner and Wong 1987).

Suggested Citation

  • Irit Aitkin, "undated". "The effect of missing data on covariates in survival analysis," Australasian Stata Users' Group Meetings 2004 6, Stata Users Group.
  • Handle: RePEc:boc:osug04:6
    as

    Download full text from publisher

    To our knowledge, this item is not available for download. To find whether it is available, there are three options:
    1. Check below whether another version of this item is available online.
    2. Check on the provider's web page whether it is in fact available.
    3. Perform a search for a similarly titled item that would be available.

    References listed on IDEAS

    as
    1. Murray Aitkin & David Clayton, 1980. "The Fitting of Exponential, Weibull and Extreme Value Distributions to Complex Censored Survival Data Using Glim," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 29(2), pages 156-163, June.
    2. J. F. Lawless & J. D. Kalbfleisch & C. J. Wild, 1999. "Semiparametric methods for response‐selective and missing data problems in regression," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 61(2), pages 413-438, April.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Esmerelda A. Ramalho & Richard Smith, 2003. "Discrete choice non-response," CeMMAP working papers 07/03, Institute for Fiscal Studies.
    2. Powers, Daniel A. & Yun, Myeong-Su, 2009. "Multivariate Decomposition for Hazard Rate Models," IZA Discussion Papers 3971, Institute of Labor Economics (IZA).
    3. Ryo Kato & Takahiro Hoshino, 2020. "Semiparametric Bayesian multiple imputation for regression models with missing mixed continuous–discrete covariates," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 72(3), pages 803-825, June.
    4. Aubry, Philippe & Francesiaz, Charlotte & Guillemain, Matthieu, 2024. "On the impact of preferential sampling on ecological status and trend assessment," Ecological Modelling, Elsevier, vol. 492(C).
    5. Zhiwei Zhang & Howard Rockette, 2006. "Semiparametric Maximum Likelihood for Missing Covariates in Parametric Regression," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 58(4), pages 687-706, December.
    6. Jonathan S. Schildcrout & Shawn P. Garbett & Patrick J. Heagerty, 2013. "Outcome Vector Dependent Sampling with Longitudinal Continuous Response Data: Stratified Sampling Based on Summary Statistics," Biometrics, The International Biometric Society, vol. 69(2), pages 405-416, June.
    7. J. F. Lawless, 2018. "Two-phase outcome-dependent studies for failure times and testing for effects of expensive covariates," Lifetime Data Analysis: An International Journal Devoted to Statistical Methods and Applications for Time-to-Event Data, Springer, vol. 24(1), pages 28-44, January.
    8. Hoshino, Takahiro, 2008. "A Bayesian propensity score adjustment for latent variable modeling and MCMC algorithm," Computational Statistics & Data Analysis, Elsevier, vol. 52(3), pages 1413-1429, January.
    9. Orbe, Jesus & Nunez-Anton, Vicente, 2006. "Alternative approaches to study lifetime data under different scenarios: from the PH to the modified semiparametric AFT model," Computational Statistics & Data Analysis, Elsevier, vol. 50(6), pages 1565-1582, March.
    10. Brady Ryan & Ananthika Nirmalkanna & Candemir Cigsar & Yildiz E. Yilmaz, 2023. "Evaluation of Designs and Estimation Methods Under Response-Dependent Two-Phase Sampling for Genetic Association Studies," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 15(2), pages 510-539, July.
    11. Haibo Zhou & Rui Song & Yuanshan Wu & Jing Qin, 2011. "Statistical Inference for a Two-Stage Outcome-Dependent Sampling Design with a Continuous Outcome," Biometrics, The International Biometric Society, vol. 67(1), pages 194-202, March.
    12. Takahiro Hoshino & Hiroshi Kurata & Kazuo Shigemasu, 2006. "A Propensity Score Adjustment for Multiple Group Structural Equation Modeling," Psychometrika, Springer;The Psychometric Society, vol. 71(4), pages 691-712, December.
    13. Sasaki, Yuya & Ura, Takuya, 2023. "Estimation and inference for policy relevant treatment effects," Journal of Econometrics, Elsevier, vol. 234(2), pages 394-450.
    14. Xiaofei Wang & Haibo Zhou, 2006. "A Semiparametric Empirical Likelihood Method for Biased Sampling Schemes with Auxiliary Covariates," Biometrics, The International Biometric Society, vol. 62(4), pages 1149-1160, December.
    15. Liang, Hua, 2008. "Generalized partially linear models with missing covariates," Journal of Multivariate Analysis, Elsevier, vol. 99(5), pages 880-895, May.
    16. Yang Zhao & Meng Liu, 2021. "Unified approach for regression models with nonmonotone missing at random data," AStA Advances in Statistical Analysis, Springer;German Statistical Society, vol. 105(1), pages 87-101, March.
    17. Fatema Shafie Khorassani & Jeremy M. G. Taylor & Niko Kaciroti & Michael R. Elliott, 2023. "Incorporating Covariates into Measures of Surrogate Paradox Risk," Stats, MDPI, vol. 6(1), pages 1-23, February.
    18. Esmeralda A. Ramalho & Richard J. Smith, 2013. "Discrete Choice Non-Response," The Review of Economic Studies, Review of Economic Studies Ltd, vol. 80(1), pages 343-364.
    19. A. J. Scallan, 1999. "Regression modelling of interval-censored failure time data using the Weibull distribution," Journal of Applied Statistics, Taylor & Francis Journals, vol. 26(5), pages 613-618.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:boc:osug04:6. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Christopher F Baum (email available below). General contact details of provider: https://edirc.repec.org/data/stataea.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.