IDEAS home Printed from https://ideas.repec.org/a/eee/jmvana/v83y2002i1p248-263.html
   My bibliography  Save this article

Bias and Efficiency Loss Due to Categorizing an Explanatory Variable

Author

Listed:
  • Taylor, Jeremy M. G.
  • Yu, Menggang

Abstract

It is a common situation in biomedical research that one or more variables are known to be associated with the outcome of interest. Researchers often discretize some variables and fit a regression model using these discretized variables. Although convenient for illustration purposes, such an approach can be biased and lead to loss of efficiency. In this article, we consider the situation of a regression model with two explanatory variables under an assumption of multivariate normality. We investigate the effect of dichotomizing or categorizing one variable on the estimate of the coefficient of the other continuous variable and on prediction from the models. Algebraic expressions are presented for the asymptotic bias and variance of the coefficient of the continuous explanatory variable and for the residual sum of squares for prediction. Some numerical examples are presented in which we find that the bias of the coefficient of the continuous explanatory variable is always smaller for the categorized model than that for the dichotomized model. The size of the test of a zero coefficient for the continuous variable only depends on the correlations between the response variable, the discretized variable, and the continuous variable. The size of the test for the categorized model is always smaller than for the dichotomized model, however, both can differ substantially from the nominal level if the correlation between the response and the categorical variable or between the two explanatory variables is high. The (predictive) relative efficiency of models also only depends on correlations amongst the three variables. There is a substantial loss of efficiency due to categorization if the correlation between the categorized and response variable is high. The predictive relative efficiency is always higher for the categorized model. The relative predictive efficiency due to dichotomization depends on the choice of cut points, with the least loss of efficency being achieved at the median.

Suggested Citation

  • Taylor, Jeremy M. G. & Yu, Menggang, 2002. "Bias and Efficiency Loss Due to Categorizing an Explanatory Variable," Journal of Multivariate Analysis, Elsevier, vol. 83(1), pages 248-263, October.
  • Handle: RePEc:eee:jmvana:v:83:y:2002:i:1:p:248-263
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0047-259X(01)92045-7
    Download Restriction: Full text for ScienceDirect subscribers only
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Lausen, Berthold & Schumacher, Martin, 1996. "Evaluating the effect of optimized cutoff values in the assessment of prognostic factors," Computational Statistics & Data Analysis, Elsevier, vol. 21(3), pages 307-326, March.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Nnaelue Godfrey Ojijieme & Xinzhu Qi & Chin-Man Chui, 2022. "Do Remittances Enhance Elderly Adults’ Healthy Social and Physical Functioning? A Cross-Sectional Study in Nigeria," IJERPH, MDPI, vol. 19(4), pages 1-17, February.
    2. Anning Hu, 2017. "Using a discretized measure of academic performance to approximate primary and secondary effects in inequality of educational opportunity," Quality & Quantity: International Journal of Methodology, Springer, vol. 51(4), pages 1627-1643, July.
    3. Felix Chan & Ágoston Reguly & László Mátyás, 2019. "Modelling with Discretized Ordered Choice Covariates," CEU Working Papers 2019_2, Department of Economics, Central European University.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Sauerbrei, W. & Meier-Hirmer, C. & Benner, A. & Royston, P., 2006. "Multivariable regression model building by using fractional polynomials: Description of SAS, STATA and R programs," Computational Statistics & Data Analysis, Elsevier, vol. 50(12), pages 3464-3485, August.
    2. Contal, Cecile & O'Quigley, John, 1999. "An application of changepoint methods in studying the effect of age on survival in breast cancer," Computational Statistics & Data Analysis, Elsevier, vol. 30(3), pages 253-270, May.
    3. Torsten Hothorn & Achim Zeileis, 2008. "Generalized Maximally Selected Statistics," Biometrics, The International Biometric Society, vol. 64(4), pages 1263-1269, December.
    4. Anning Hu, 2017. "Using a discretized measure of academic performance to approximate primary and secondary effects in inequality of educational opportunity," Quality & Quantity: International Journal of Methodology, Springer, vol. 51(4), pages 1627-1643, July.
    5. Saptarshi Chatterjee & Shrabanti Chowdhury & Sanjib Basu, 2021. "A model‐free approach for testing association," Journal of the Royal Statistical Society Series C, Royal Statistical Society, vol. 70(3), pages 511-531, June.
    6. Tunes-da-Silva, Gisela & Klein, John P., 2011. "Cutpoint selection for discretizing a continuous covariate for generalized estimating equations," Computational Statistics & Data Analysis, Elsevier, vol. 55(1), pages 226-235, January.
    7. Yu-Min Huang, 2019. "Binary surrogates with stratified samples when weights are unknown," Computational Statistics, Springer, vol. 34(2), pages 653-682, June.
    8. Hothorn, Torsten & Lausen, Berthold, 2003. "On the exact distribution of maximally selected rank statistics," Computational Statistics & Data Analysis, Elsevier, vol. 43(2), pages 121-137, June.
    9. Heinzl, Harald & Tempfer, Clemens, 2001. "A cautionary note on segmenting a cyclical covariate by minimum P-value search," Computational Statistics & Data Analysis, Elsevier, vol. 35(4), pages 451-461, February.
    10. Qiu, Zhiping & Peng, Limin & Manatunga, Amita & Guo, Ying, 2019. "A smooth nonparametric approach to determining cut-points of a continuous scale," Computational Statistics & Data Analysis, Elsevier, vol. 134(C), pages 186-210.
    11. Boulesteix, Anne-Laure & Strobl, Carolin, 2007. "Maximally selected Chi-squared statistics and non-monotonic associations: An exact approach based on two cutpoints," Computational Statistics & Data Analysis, Elsevier, vol. 51(12), pages 6295-6306, August.
    12. John O'Quigley & Loki Natarajan, 2004. "Erosion of Regression Effect in a Survival Study," Biometrics, The International Biometric Society, vol. 60(2), pages 344-351, June.
    13. López-Ratón, Mónica & Rodríguez-Álvarez, María Xosé & Cadarso-Suárez, Carmen & Gude-Sampedro, Francisco, 2014. "OptimalCutpoints: An R Package for Selecting Optimal Cutpoints in Diagnostic Tests," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 61(i08).
    14. Hollander, Norbert & Schumacher, Martin, 2006. "Estimating the functional form of a continuous covariate's effect on survival time," Computational Statistics & Data Analysis, Elsevier, vol. 50(4), pages 1131-1151, February.

    More about this item

    Keywords

    cutpoints discretization regression;

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:jmvana:v:83:y:2002:i:1:p:248-263. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/wps/find/journaldescription.cws_home/622892/description#description .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.