IDEAS home Printed from https://ideas.repec.org/a/eee/csdana/v192y2024ics0167947323001883.html
   My bibliography  Save this article

Variable selection for high-dimensional incomplete data

Author

Listed:
  • Liang, Lixing
  • Zhuang, Yipeng
  • Yu, Philip L.H.

Abstract

Regression analysis is often affected by high dimensionality, severe multicollinearity, and a large proportion of missing data. These problems may mask important relationships and even lead to biased conclusions. This paper proposes a novel computationally efficient method that integrates data imputation and variable selection to address these issues. More specifically, the proposed method incorporates a new multiple imputation algorithm based on matrix completion (Multiple Accelerated Inexact Soft-Impute), a more stable and accurate new randomized lasso method (Hybrid Random Lasso), and a consistent method to integrate a variable selection method with multiple imputation. Compared to existing methodologies, the proposed approach offers greater accuracy and consistency through mechanisms that enhances robustness against different missing data patterns and sampling variations. The method is applied to analyze the Asian American minority subgroup in the 2017 National Youth Risk Behavior Survey, where key risk factors related to the intention for suicide among Asian Americans are studied. Through simulations and real data analyses on various regression and classification settings, the proposed method demonstrates enhanced accuracy, consistency, and efficiency in both variable selection and prediction.

Suggested Citation

  • Liang, Lixing & Zhuang, Yipeng & Yu, Philip L.H., 2024. "Variable selection for high-dimensional incomplete data," Computational Statistics & Data Analysis, Elsevier, vol. 192(C).
  • Handle: RePEc:eee:csdana:v:192:y:2024:i:c:s0167947323001883
    DOI: 10.1016/j.csda.2023.107877
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0167947323001883
    Download Restriction: Full text for ScienceDirect subscribers only.

    File URL: https://libkey.io/10.1016/j.csda.2023.107877?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Zou, Hui, 2006. "The Adaptive Lasso and Its Oracle Properties," Journal of the American Statistical Association, American Statistical Association, vol. 101, pages 1418-1429, December.
    2. van Buuren, Stef & Groothuis-Oudshoorn, Karin, 2011. "mice: Multivariate Imputation by Chained Equations in R," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 45(i03).
    3. Johnson, Brent A. & Lin, D.Y. & Zeng, Donglin, 2008. "Penalized Estimating Functions and Variable Selection in Semiparametric Regression Models," Journal of the American Statistical Association, American Statistical Association, vol. 103, pages 672-680, June.
    4. Wolfson, Julian, 2011. "EEBoost: A General Method for Prediction and Variable Selection Based on Estimating Equations," Journal of the American Statistical Association, American Statistical Association, vol. 106(493), pages 296-305.
    5. Nicolai Meinshausen & Peter Bühlmann, 2010. "Stability selection," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 72(4), pages 417-473, September.
    6. Hui Zou & Trevor Hastie, 2005. "Addendum: Regularization and variable selection via the elastic net," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 67(5), pages 768-768, November.
    7. Sasaki, P.Y. & Kameoka, V.A., 2009. "Ethnic variations in prevalence of high-risk sexual behaviors among Asian and Pacific Islander adolescents in Hawaii," American Journal of Public Health, American Public Health Association, vol. 99(10), pages 1886-1892.
    8. Hui Zou & Trevor Hastie, 2005. "Regularization and variable selection via the elastic net," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 67(2), pages 301-320, April.
    9. Pinhey, T.K. & Millman, S.R., 2004. "Asian/Pacific Islander adolescent sexual orientation and suicide risk in Guam," American Journal of Public Health, American Public Health Association, vol. 94(7), pages 1204-1206.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Christopher J Greenwood & George J Youssef & Primrose Letcher & Jacqui A Macdonald & Lauryn J Hagg & Ann Sanson & Jenn Mcintosh & Delyse M Hutchinson & John W Toumbourou & Matthew Fuller-Tyszkiewicz &, 2020. "A comparison of penalised regression methods for informing the selection of predictive markers," PLOS ONE, Public Library of Science, vol. 15(11), pages 1-14, November.
    2. Capanu, Marinela & Giurcanu, Mihai & Begg, Colin B. & Gönen, Mithat, 2023. "Subsampling based variable selection for generalized linear models," Computational Statistics & Data Analysis, Elsevier, vol. 184(C).
    3. Yongjin Li & Qingzhao Zhang & Qihua Wang, 2017. "Penalized estimation equation for an extended single-index model," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 69(1), pages 169-187, February.
    4. Tan, Xin Lu, 2019. "Optimal estimation of slope vector in high-dimensional linear transformation models," Journal of Multivariate Analysis, Elsevier, vol. 169(C), pages 179-204.
    5. Blommaert, A. & Hens, N. & Beutels, Ph., 2014. "Data mining for longitudinal data under multicollinearity and time dependence using penalized generalized estimating equations," Computational Statistics & Data Analysis, Elsevier, vol. 71(C), pages 667-680.
    6. Latouche, Pierre & Mattei, Pierre-Alexandre & Bouveyron, Charles & Chiquet, Julien, 2016. "Combining a relaxed EM algorithm with Occam’s razor for Bayesian variable selection in high-dimensional regression," Journal of Multivariate Analysis, Elsevier, vol. 146(C), pages 177-190.
    7. Diego Vidaurre & Concha Bielza & Pedro Larrañaga, 2013. "A Survey of L1 Regression," International Statistical Review, International Statistical Institute, vol. 81(3), pages 361-387, December.
    8. Wenning Feng & Abdhi Sarkar & Chae Young Lim & Tapabrata Maiti, 2016. "Variable selection for binary spatial regression: Penalized quasi‐likelihood approach," Biometrics, The International Biometric Society, vol. 72(4), pages 1164-1172, December.
    9. Roberts, S. & Nowak, G., 2014. "Stabilizing the lasso against cross-validation variability," Computational Statistics & Data Analysis, Elsevier, vol. 70(C), pages 198-211.
    10. Fan, Yali & Qin, Guoyou & Zhu, Zhongyi, 2012. "Variable selection in robust regression models for longitudinal data," Journal of Multivariate Analysis, Elsevier, vol. 109(C), pages 156-167.
    11. Zhihua Sun & Yi Liu & Kani Chen & Gang Li, 2022. "Broken adaptive ridge regression for right-censored survival data," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 74(1), pages 69-91, February.
    12. Qin, Yichen & Wang, Linna & Li, Yang & Li, Rong, 2023. "Visualization and assessment of model selection uncertainty," Computational Statistics & Data Analysis, Elsevier, vol. 178(C).
    13. Zhixuan Fu & Chirag R. Parikh & Bingqing Zhou, 2017. "Penalized variable selection in competing risks regression," Lifetime Data Analysis: An International Journal Devoted to Statistical Methods and Applications for Time-to-Event Data, Springer, vol. 23(3), pages 353-376, July.
    14. Tutz, Gerhard & Pößnecker, Wolfgang & Uhlmann, Lorenz, 2015. "Variable selection in general multinomial logit models," Computational Statistics & Data Analysis, Elsevier, vol. 82(C), pages 207-222.
    15. Margherita Giuzio, 2017. "Genetic algorithm versus classical methods in sparse index tracking," Decisions in Economics and Finance, Springer;Associazione per la Matematica, vol. 40(1), pages 243-256, November.
    16. Mkhadri, Abdallah & Ouhourane, Mohamed, 2013. "An extended variable inclusion and shrinkage algorithm for correlated variables," Computational Statistics & Data Analysis, Elsevier, vol. 57(1), pages 631-644.
    17. Yize Zhao & Matthias Chung & Brent A. Johnson & Carlos S. Moreno & Qi Long, 2016. "Hierarchical Feature Selection Incorporating Known and Novel Biological Information: Identifying Genomic Features Related to Prostate Cancer Recurrence," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 111(516), pages 1427-1439, October.
    18. Chuliá, Helena & Garrón, Ignacio & Uribe, Jorge M., 2024. "Daily growth at risk: Financial or real drivers? The answer is not always the same," International Journal of Forecasting, Elsevier, vol. 40(2), pages 762-776.
    19. Norman R. Swanson & Weiqi Xiong, 2018. "Big data analytics in economics: What have we learned so far, and where should we go from here?," Canadian Journal of Economics/Revue canadienne d'économique, John Wiley & Sons, vol. 51(3), pages 695-746, August.
    20. Gareth M. James & Peter Radchenko & Jinchi Lv, 2009. "DASSO: connections between the Dantzig selector and lasso," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 71(1), pages 127-142, January.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:csdana:v:192:y:2024:i:c:s0167947323001883. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/csda .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.