IDEAS home Printed from https://ideas.repec.org/a/gam/jecnmx/v6y2018i4p45-d185046.html
   My bibliography  Save this article

A Review on Variable Selection in Regression Analysis

Author

Listed:
  • Loann David Denis Desboulets

    (CNRS, EHESS, Centrale Marseille, AMSE, Aix-Marseille University, 5-9 Boulevard Maurice Bourdet, 13001 Marseille, France)

Abstract

In this paper, we investigate several variable selection procedures to give an overview of the existing literature for practitioners. “Let the data speak for themselves” has become the motto of many applied researchers since the number of data has significantly grown. Automatic model selection has been promoted to search for data-driven theories for quite a long time now. However, while great extensions have been made on the theoretical side, basic procedures are still used in most empirical work, e.g., stepwise regression. Here, we provide a review of main methods and state-of-the art extensions as well as a topology of them over a wide range of model structures (linear, grouped, additive, partially linear and non-parametric) and available software resources for implemented methods so that practitioners can easily access them. We provide explanations for which methods to use for different model purposes and their key differences. We also review two methods for improving variable selection in the general sense.

Suggested Citation

  • Loann David Denis Desboulets, 2018. "A Review on Variable Selection in Regression Analysis," Econometrics, MDPI, vol. 6(4), pages 1-27, November.
  • Handle: RePEc:gam:jecnmx:v:6:y:2018:i:4:p:45-:d:185046
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2225-1146/6/4/45/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2225-1146/6/4/45/
    Download Restriction: no
    ---><---

    Other versions of this item:

    References listed on IDEAS

    as
    1. Radchenko, Peter & James, Gareth M., 2010. "Variable Selection Using Adaptive Nonlinear Interaction Structures in High Dimensions," Journal of the American Statistical Association, American Statistical Association, vol. 105(492), pages 1541-1553.
    2. Runze Li & Wei Zhong & Liping Zhu, 2012. "Feature Screening via Distance Correlation Learning," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 107(499), pages 1129-1139, September.
    3. Jian Huang & Shuange Ma & Huiliang Xie & Cun-Hui Zhang, 2009. "A group bridge approach for variable selection," Biometrika, Biometrika Trust, vol. 96(2), pages 339-355.
    4. Fan J. & Li R., 2001. "Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties," Journal of the American Statistical Association, American Statistical Association, vol. 96, pages 1348-1360, December.
    5. Castle, Jennifer L. & Hendry, David F., 2010. "A low-dimension portmanteau test for non-linearity," Journal of Econometrics, Elsevier, vol. 158(2), pages 231-245, October.
    6. Wang, Hansheng & Xia, Yingcun, 2009. "Shrinkage Estimation of the Varying Coefficient Model," Journal of the American Statistical Association, American Statistical Association, vol. 104(486), pages 747-757.
    7. Camila Epprecht & Dominique Guegan & Álvaro Veiga & Joel Correa da Rosa, 2017. "Variable selection and forecasting via automated methods for linear models: LASSO/adaLASSO and Autometrics," Post-Print halshs-00917797, HAL.
    8. Ni, Xiao & Zhang, Hao Helen & Zhang, Daowen, 2009. "Automatic model selection for partially linear models," Journal of Multivariate Analysis, Elsevier, vol. 100(9), pages 2100-2111, October.
    9. Carlos Santos & David Hendry & Soren Johansen, 2008. "Automatic selection of indicators in a fully saturated regression," Computational Statistics, Springer, vol. 23(2), pages 317-335, April.
    10. Zhang, Jing & Liu, Yanyan & Wu, Yuanshan, 2017. "Correlation rank screening for ultrahigh-dimensional survival data," Computational Statistics & Data Analysis, Elsevier, vol. 108(C), pages 121-132.
    11. Choi, Nam Hee & Li, William & Zhu, Ji, 2010. "Variable Selection With the Strong Heredity Constraint and Its Oracle Property," Journal of the American Statistical Association, American Statistical Association, vol. 105(489), pages 354-364.
    12. Hendry, D.F. & Richard, J.-F., 1987. "Recent developments in the theory of encompassing," LIDAM Discussion Papers CORE 1987022, Université catholique de Louvain, Center for Operations Research and Econometrics (CORE).
    13. Zou, Hui, 2006. "The Adaptive Lasso and Its Oracle Properties," Journal of the American Statistical Association, American Statistical Association, vol. 101, pages 1418-1429, December.
    14. Friedman, Jerome H. & Hastie, Trevor & Tibshirani, Rob, 2010. "Regularization Paths for Generalized Linear Models via Coordinate Descent," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 33(i01).
    15. Fan, Jianqing & Feng, Yang & Song, Rui, 2011. "Nonparametric Independence Screening in Sparse Ultra-High-Dimensional Additive Models," Journal of the American Statistical Association, American Statistical Association, vol. 106(494), pages 544-557.
    16. Guang Cheng & Hao Zhang & Zuofeng Shang, 2015. "Sparse and efficient estimation for partial spline models with increasing dimension," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 67(1), pages 93-127, February.
    17. Kim, Yongdai & Choi, Hosik & Oh, Hee-Seok, 2008. "Smoothly Clipped Absolute Deviation on High Dimensions," Journal of the American Statistical Association, American Statistical Association, vol. 103(484), pages 1665-1673.
    18. Robert Tibshirani & Michael Saunders & Saharon Rosset & Ji Zhu & Keith Knight, 2005. "Sparsity and smoothness via the fused lasso," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 67(1), pages 91-108, February.
    19. Castle Jennifer L. & Doornik Jurgen A & Hendry David F., 2011. "Evaluating Automatic Model Selection," Journal of Time Series Econometrics, De Gruyter, vol. 3(1), pages 1-33, February.
    20. Byeong U. Park & Enno Mammen & Young K. Lee & Eun Ryung Lee, 2015. "Varying Coefficient Regression Models: A Review and New Developments," International Statistical Review, International Statistical Institute, vol. 83(1), pages 36-64, April.
    21. Camila Epprecht & Dominique Guegan & Álvaro Veiga & Joel Correa da Rosa, 2013. "Variable selection and forecasting via automated methods for linear models: LASSO/adaLASSO and Autometrics," Documents de travail du Centre d'Economie de la Sorbonne 13080r, Université Panthéon-Sorbonne (Paris 1), Centre d'Economie de la Sorbonne, revised Oct 2017.
    22. Rajen D. Shah & Richard J. Samworth, 2013. "Variable selection with error control: another look at stability selection," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 75(1), pages 55-80, January.
    23. Jianqing Fan & Jinchi Lv, 2008. "Sure independence screening for ultrahigh dimensional feature space," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 70(5), pages 849-911, November.
    24. Hui Zou & Trevor Hastie, 2005. "Addendum: Regularization and variable selection via the elastic net," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 67(5), pages 768-768, November.
    25. McIlhagga, William, 2016. "penalized: A MATLAB Toolbox for Fitting Generalized Linear Models with Penalties," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 72(i06).
    26. Wang, Hansheng, 2009. "Forward Regression for Ultra-High Dimensional Variable Screening," Journal of the American Statistical Association, American Statistical Association, vol. 104(488), pages 1512-1524.
    27. Hui Zou & Trevor Hastie, 2005. "Regularization and variable selection via the elastic net," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 67(2), pages 301-320, April.
    28. Ming Yuan & Yi Lin, 2006. "Model selection and estimation in regression with grouped variables," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 68(1), pages 49-67, February.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Fakhri J. Hasanov & Elchin Suleymanov & Heyran Aliyeva & Hezi Eynalov & Sa'd Shannak, 2022. "What Drives the Agricultural Growth in Azerbaijan? Insights from Autometrics with Super Saturation," Acta Universitatis Agriculturae et Silviculturae Mendelianae Brunensis, Mendel University Press, vol. 70(3), pages 147-174.
    2. Kimia Keshanian & Daniel Zantedeschi & Kaushik Dutta, 2022. "Features Selection as a Nash-Bargaining Solution: Applications in Online Advertising and Information Systems," INFORMS Journal on Computing, INFORMS, vol. 34(5), pages 2485-2501, September.
    3. Gonzalo García-Donato & María Eugenia Castellanos & Alicia Quirós, 2021. "Bayesian Variable Selection with Applications in Health Sciences," Mathematics, MDPI, vol. 9(3), pages 1-16, January.
    4. Gao Wang & Abhishek Sarkar & Peter Carbonetto & Matthew Stephens, 2020. "A simple new approach to variable selection in regression, with application to genetic fine mapping," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 82(5), pages 1273-1300, December.
    5. Iztok Podbregar & Goran Šimić & Mirjana Radovanović & Sanja Filipović & Polona Šprajc, 2020. "International Energy Security Risk Index—Analysis of the Methodological Settings," Energies, MDPI, vol. 13(12), pages 1-15, June.
    6. Rahi Jain & Wei Xu, 2021. "HDSI: High dimensional selection with interactions algorithm on feature selection and testing," PLOS ONE, Public Library of Science, vol. 16(2), pages 1-17, February.
    7. Fakhri J. Hasanov & Muhammad Javid & Frederick L. Joutz, 2022. "Saudi Non-Oil Exports before and after COVID-19: Historical Impacts of Determinants and Scenario Analysis," Sustainability, MDPI, vol. 14(4), pages 1-38, February.
    8. Robert Giel & Alicja Dąbrowska, 2021. "Estimating Time Spent at the Waste Collection Point by A Garbage Truck with A Multiple Regression Model," Sustainability, MDPI, vol. 13(8), pages 1-14, April.
    9. Eduardo Correia & Rodrigo Calili & José Francisco Pessanha & Maria Fatima Almeida, 2023. "Definition of Regulatory Targets for Electricity Non-Technical Losses: Proposition of an Automatic Model-Selection Technique for Panel Data Regressions," Energies, MDPI, vol. 16(6), pages 1-22, March.
    10. Aneiros, Germán & Novo, Silvia & Vieu, Philippe, 2022. "Variable selection in functional regression models: A review," Journal of Multivariate Analysis, Elsevier, vol. 188(C).
    11. Berndt Jesenko & Christian Schlögl, 2021. "The effect of web of science subject categories on clustering: the case of data-driven methods in business and economic sciences," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(8), pages 6785-6801, August.
    12. Marcin Błażejowski & Jacek Kwiatkowski & Paweł Kufel, 2020. "BACE and BMA Variable Selection and Forecasting for UK Money Demand and Inflation with Gretl," Econometrics, MDPI, vol. 8(2), pages 1-29, May.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Xiangyu Wang & Chenlei Leng, 2016. "High dimensional ordinary least squares projection for screening variables," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 78(3), pages 589-611, June.
    2. Pei Wang & Shunjie Chen & Sijia Yang, 2022. "Recent Advances on Penalized Regression Models for Biological Data," Mathematics, MDPI, vol. 10(19), pages 1-24, October.
    3. Li Yun & O’Connor George T. & Dupuis Josée & Kolaczyk Eric, 2015. "Modeling gene-covariate interactions in sparse regression with group structure for genome-wide association studies," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 14(3), pages 265-277, June.
    4. Min Chen & Yimin Lian & Zhao Chen & Zhengjun Zhang, 2017. "Sure explained variability and independence screening," Journal of Nonparametric Statistics, Taylor & Francis Journals, vol. 29(4), pages 849-883, October.
    5. Dai, Linlin & Chen, Kani & Sun, Zhihua & Liu, Zhenqiu & Li, Gang, 2018. "Broken adaptive ridge regression and its asymptotic properties," Journal of Multivariate Analysis, Elsevier, vol. 168(C), pages 334-351.
    6. Liming Wang & Xingxiang Li & Xiaoqing Wang & Peng Lai, 2022. "Unified mean-variance feature screening for ultrahigh-dimensional regression," Computational Statistics, Springer, vol. 37(4), pages 1887-1918, September.
    7. Diego Vidaurre & Concha Bielza & Pedro Larrañaga, 2013. "A Survey of L1 Regression," International Statistical Review, International Statistical Institute, vol. 81(3), pages 361-387, December.
    8. Mingqiu Wang & Guo-Liang Tian, 2016. "Robust group non-convex estimations for high-dimensional partially linear models," Journal of Nonparametric Statistics, Taylor & Francis Journals, vol. 28(1), pages 49-67, March.
    9. Howard D. Bondell & Brian J. Reich, 2012. "Consistent High-Dimensional Bayesian Variable Selection via Penalized Credible Regions," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 107(500), pages 1610-1624, December.
    10. Tutz, Gerhard & Pößnecker, Wolfgang & Uhlmann, Lorenz, 2015. "Variable selection in general multinomial logit models," Computational Statistics & Data Analysis, Elsevier, vol. 82(C), pages 207-222.
    11. Wang, Christina Dan & Chen, Zhao & Lian, Yimin & Chen, Min, 2022. "Asset selection based on high frequency Sharpe ratio," Journal of Econometrics, Elsevier, vol. 227(1), pages 168-188.
    12. Capanu, Marinela & Giurcanu, Mihai & Begg, Colin B. & Gönen, Mithat, 2023. "Subsampling based variable selection for generalized linear models," Computational Statistics & Data Analysis, Elsevier, vol. 184(C).
    13. Tomáš Plíhal, 2021. "Scheduled macroeconomic news announcements and Forex volatility forecasting," Journal of Forecasting, John Wiley & Sons, Ltd., vol. 40(8), pages 1379-1397, December.
    14. Jingxuan Luo & Lili Yue & Gaorong Li, 2023. "Overview of High-Dimensional Measurement Error Regression Models," Mathematics, MDPI, vol. 11(14), pages 1-22, July.
    15. Takumi Saegusa & Tianzhou Ma & Gang Li & Ying Qing Chen & Mei-Ling Ting Lee, 2020. "Variable Selection in Threshold Regression Model with Applications to HIV Drug Adherence Data," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 12(3), pages 376-398, December.
    16. Ricardo P. Masini & Marcelo C. Medeiros & Eduardo F. Mendes, 2023. "Machine learning advances for time series forecasting," Journal of Economic Surveys, Wiley Blackwell, vol. 37(1), pages 76-111, February.
    17. Massimiliano Caporin & Francesco Poli, 2017. "Building News Measures from Textual Data and an Application to Volatility Forecasting," Econometrics, MDPI, vol. 5(3), pages 1-46, August.
    18. Justin B. Post & Howard D. Bondell, 2013. "Factor Selection and Structural Identification in the Interaction ANOVA Model," Biometrics, The International Biometric Society, vol. 69(1), pages 70-79, March.
    19. Zhang, Shucong & Zhou, Yong, 2018. "Variable screening for ultrahigh dimensional heterogeneous data via conditional quantile correlations," Journal of Multivariate Analysis, Elsevier, vol. 165(C), pages 1-13.
    20. Jonathan Boss & Alexander Rix & Yin‐Hsiu Chen & Naveen N. Narisetty & Zhenke Wu & Kelly K. Ferguson & Thomas F. McElrath & John D. Meeker & Bhramar Mukherjee, 2021. "A hierarchical integrative group least absolute shrinkage and selection operator for analyzing environmental mixtures," Environmetrics, John Wiley & Sons, Ltd., vol. 32(8), December.

    More about this item

    Keywords

    variable selection; automatic modelling; sparse models;
    All these keywords.

    JEL classification:

    • B23 - Schools of Economic Thought and Methodology - - History of Economic Thought since 1925 - - - Econometrics; Quantitative and Mathematical Studies
    • C - Mathematical and Quantitative Methods
    • C00 - Mathematical and Quantitative Methods - - General - - - General
    • C01 - Mathematical and Quantitative Methods - - General - - - Econometrics
    • C1 - Mathematical and Quantitative Methods - - Econometric and Statistical Methods and Methodology: General
    • C2 - Mathematical and Quantitative Methods - - Single Equation Models; Single Variables
    • C3 - Mathematical and Quantitative Methods - - Multiple or Simultaneous Equation Models; Multiple Variables
    • C4 - Mathematical and Quantitative Methods - - Econometric and Statistical Methods: Special Topics
    • C5 - Mathematical and Quantitative Methods - - Econometric Modeling
    • C8 - Mathematical and Quantitative Methods - - Data Collection and Data Estimation Methodology; Computer Programs

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jecnmx:v:6:y:2018:i:4:p:45-:d:185046. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.