IDEAS home Printed from https://ideas.repec.org/p/arx/papers/1810.09583.html
   My bibliography  Save this paper

Model Selection Techniques -- An Overview

Author

Listed:
  • Jie Ding
  • Vahid Tarokh
  • Yuhong Yang

Abstract

In the era of big data, analysts usually explore various statistical models or machine learning methods for observed data in order to facilitate scientific discoveries or gain predictive power. Whatever data and fitting procedures are employed, a crucial step is to select the most appropriate model or method from a set of candidates. Model selection is a key ingredient in data analysis for reliable and reproducible statistical inference or prediction, and thus central to scientific studies in fields such as ecology, economics, engineering, finance, political science, biology, and epidemiology. There has been a long history of model selection techniques that arise from researches in statistics, information theory, and signal processing. A considerable number of methods have been proposed, following different philosophies and exhibiting varying performances. The purpose of this article is to bring a comprehensive overview of them, in terms of their motivation, large sample performance, and applicability. We provide integrated and practically relevant discussions on theoretical properties of state-of- the-art model selection approaches. We also share our thoughts on some controversial views on the practice of model selection.

Suggested Citation

  • Jie Ding & Vahid Tarokh & Yuhong Yang, 2018. "Model Selection Techniques -- An Overview," Papers 1810.09583, arXiv.org.
  • Handle: RePEc:arx:papers:1810.09583
    as

    Download full text from publisher

    File URL: http://arxiv.org/pdf/1810.09583
    File Function: Latest version
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Hansen M. H & Yu B., 2001. "Model Selection and the Principle of Minimum Description Length," Journal of the American Statistical Association, American Statistical Association, vol. 96, pages 746-774, June.
    2. Zou, Hui, 2006. "The Adaptive Lasso and Its Oracle Properties," Journal of the American Statistical Association, American Statistical Association, vol. 101, pages 1418-1429, December.
    3. John P A Ioannidis, 2005. "Why Most Published Research Findings Are False," PLOS Medicine, Public Library of Science, vol. 2(8), pages 1-1, August.
    4. Jiahua Chen & Zehua Chen, 2008. "Extended Bayesian information criteria for model selection with large model spaces," Biometrika, Biometrika Trust, vol. 95(3), pages 759-771.
    5. Wei Pan, 2001. "Akaike's Information Criterion in Generalized Estimating Equations," Biometrics, The International Biometric Society, vol. 57(1), pages 120-125, March.
    6. Yuhong Yang, 2005. "Can the strengths of AIC and BIC be shared? A conflict between model indentification and regression estimation," Biometrika, Biometrika Trust, vol. 92(4), pages 937-950, December.
    7. Yang Y., 2001. "Adaptive Regression by Mixing," Journal of the American Statistical Association, American Statistical Association, vol. 96, pages 574-588, June.
    8. Leeb, Hannes & Potscher, Benedikt M., 2008. "Sparse estimators and the oracle property, or the return of Hodges' estimator," Journal of Econometrics, Elsevier, vol. 142(1), pages 201-211, January.
    9. Wenjing Yang & Yuhong Yang, 2017. "Toward an objective and reproducible model choice via variable selection deviation," Biometrics, The International Biometric Society, vol. 73(1), pages 20-30, March.
    10. Zhang, Yongli & Yang, Yuhong, 2015. "Cross-validation for selecting a model selection procedure," Journal of Econometrics, Elsevier, vol. 187(1), pages 95-112.
    11. Tim van Erven & Peter Grünwald & Steven de Rooij, 2012. "Catching up faster by switching sooner: a predictive approach to adaptive estimation with an application to the AIC–BIC dilemma," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 74(3), pages 361-417, June.
    12. Kadane, Joseph B. & Lazar, Nicole A., 2004. "Methods and Criteria for Model Selection," Journal of the American Statistical Association, American Statistical Association, vol. 99, pages 279-290, January.
    13. Hui Zou & Trevor Hastie, 2005. "Addendum: Regularization and variable selection via the elastic net," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 67(5), pages 768-768, November.
    14. Yang, Yuhong, 2007. "Prediction/Estimation With Simple Linear Models: Is It Really That Simple?," Econometric Theory, Cambridge University Press, vol. 23(1), pages 1-36, February.
    15. David J. Spiegelhalter & Nicola G. Best & Bradley P. Carlin & Angelika Van Der Linde, 2002. "Bayesian measures of model complexity and fit," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 64(4), pages 583-639, October.
    16. Hui Zou & Trevor Hastie, 2005. "Regularization and variable selection via the elastic net," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 67(2), pages 301-320, April.
    17. Fan J. & Li R., 2001. "Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties," Journal of the American Statistical Association, American Statistical Association, vol. 96, pages 1348-1360, December.
    18. Ming Yuan & Yi Lin, 2006. "Model selection and estimation in regression with grouped variables," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 68(1), pages 49-67, February.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Kengne, William, 2021. "Strongly consistent model selection for general causal time series," Statistics & Probability Letters, Elsevier, vol. 171(C).
    2. Estey, Clayton, 2024. "Robust Bellman State Prediction with Learning and Model Preferences," OSF Preprints 75fc9, Center for Open Science.
    3. Elhassan Tomader, 2023. "Economic and Environmental Sustainability through Trade Openness and Energy Production," Business Systems Research, Sciendo, vol. 14(2), pages 102-123, December.
    4. Mutele, Litshedzani & Carranza, Emmanuel John M., 2024. "Statistical analysis of gold production in South Africa using ARIMA, VAR and ARNN modelling techniques: Extrapolating future gold production, Resources–Reserves depletion, and Implication on South Afr," Resources Policy, Elsevier, vol. 93(C).
    5. Yonekura, Shouto & Beskos, Alexandros & Singh, Sumeetpal S., 2021. "Asymptotic analysis of model selection criteria for general hidden Markov models," Stochastic Processes and their Applications, Elsevier, vol. 132(C), pages 164-191.
    6. Faguang Wen & Jiming Jiang & Yihui Luan, 2024. "Model Selection Path and Construction of Model Confidence Set under High-Dimensional Variables," Mathematics, MDPI, vol. 12(5), pages 1-21, February.
    7. Qin, Yichen & Wang, Linna & Li, Yang & Li, Rong, 2023. "Visualization and assessment of model selection uncertainty," Computational Statistics & Data Analysis, Elsevier, vol. 178(C).
    8. Pedro Bordalo & Giovanni Burro & Katherine B. Coffman & Nicola Gennaioli & Andrei Shleifer, 2022. "Imagining the Future: Memory, Simulation and Beliefs about Covid," NBER Working Papers 30353, National Bureau of Economic Research, Inc.
    9. Jinwen Sun & Akash Deep & Shiyu Zhou & Dharmaraj Veeramani, 2023. "Industrial system working condition identification using operation-adjusted hidden Markov model," Journal of Intelligent Manufacturing, Springer, vol. 34(6), pages 2611-2624, August.
    10. Peng, Jingfu & Yang, Yuhong, 2022. "On improvability of model selection by model averaging," Journal of Econometrics, Elsevier, vol. 229(2), pages 246-262.
    11. Wenchao Xu & Xinyu Zhang, 2024. "On Asymptotic Optimality of Least Squares Model Averaging When True Model Is Included," Papers 2411.09258, arXiv.org.
    12. William Kengne, 2023. "On consistency for time series model selection," Statistical Inference for Stochastic Processes, Springer, vol. 26(2), pages 437-458, July.
    13. Simon Hirsch & Jonathan Berrisch & Florian Ziel, 2024. "Online Distributional Regression," Papers 2407.08750, arXiv.org, revised Aug 2024.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Xianyi Wu & Xian Zhou, 2019. "On Hodges’ superefficiency and merits of oracle property in model selection," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 71(5), pages 1093-1119, October.
    2. Qingliang Fan & Yaqian Wu, 2020. "Endogenous Treatment Effect Estimation with some Invalid and Irrelevant Instruments," Papers 2006.14998, arXiv.org.
    3. Ricardo P. Masini & Marcelo C. Medeiros & Eduardo F. Mendes, 2023. "Machine learning advances for time series forecasting," Journal of Economic Surveys, Wiley Blackwell, vol. 37(1), pages 76-111, February.
    4. Yue, Mu & Li, Jialiang & Cheng, Ming-Yen, 2019. "Two-step sparse boosting for high-dimensional longitudinal data with varying coefficients," Computational Statistics & Data Analysis, Elsevier, vol. 131(C), pages 222-234.
    5. Kwon, Sunghoon & Oh, Seungyoung & Lee, Youngjo, 2016. "The use of random-effect models for high-dimensional variable selection problems," Computational Statistics & Data Analysis, Elsevier, vol. 103(C), pages 401-412.
    6. Sermpinis, Georgios & Tsoukas, Serafeim & Zhang, Ping, 2018. "Modelling market implied ratings using LASSO variable selection techniques," Journal of Empirical Finance, Elsevier, vol. 48(C), pages 19-35.
    7. Howard D. Bondell & Brian J. Reich, 2009. "Simultaneous Factor Selection and Collapsing Levels in ANOVA," Biometrics, The International Biometric Society, vol. 65(1), pages 169-177, March.
    8. Gerda Claeskens, 2012. "Focused estimation and model averaging with penalization methods: an overview," Statistica Neerlandica, Netherlands Society for Statistics and Operations Research, vol. 66(3), pages 272-287, August.
    9. Wei Sun & Lexin Li, 2012. "Multiple Loci Mapping via Model-free Variable Selection," Biometrics, The International Biometric Society, vol. 68(1), pages 12-22, March.
    10. Zhihua Sun & Yi Liu & Kani Chen & Gang Li, 2022. "Broken adaptive ridge regression for right-censored survival data," Annals of the Institute of Statistical Mathematics, Springer;The Institute of Statistical Mathematics, vol. 74(1), pages 69-91, February.
    11. Yoonsuh Jung, 2018. "Multiple predicting K-fold cross-validation for model selection," Journal of Nonparametric Statistics, Taylor & Francis Journals, vol. 30(1), pages 197-215, January.
    12. Xiangyu Wang & Chenlei Leng, 2016. "High dimensional ordinary least squares projection for screening variables," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 78(3), pages 589-611, June.
    13. Tutz, Gerhard & Pößnecker, Wolfgang & Uhlmann, Lorenz, 2015. "Variable selection in general multinomial logit models," Computational Statistics & Data Analysis, Elsevier, vol. 82(C), pages 207-222.
    14. Yize Zhao & Matthias Chung & Brent A. Johnson & Carlos S. Moreno & Qi Long, 2016. "Hierarchical Feature Selection Incorporating Known and Novel Biological Information: Identifying Genomic Features Related to Prostate Cancer Recurrence," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 111(516), pages 1427-1439, October.
    15. Capanu, Marinela & Giurcanu, Mihai & Begg, Colin B. & Gönen, Mithat, 2023. "Subsampling based variable selection for generalized linear models," Computational Statistics & Data Analysis, Elsevier, vol. 184(C).
    16. Loann David Denis Desboulets, 2018. "A Review on Variable Selection in Regression Analysis," Econometrics, MDPI, vol. 6(4), pages 1-27, November.
    17. Zhang, Ting & Wang, Lei, 2020. "Smoothed empirical likelihood inference and variable selection for quantile regression with nonignorable missing response," Computational Statistics & Data Analysis, Elsevier, vol. 144(C).
    18. Zhang, Tonglin, 2024. "Variables selection using L0 penalty," Computational Statistics & Data Analysis, Elsevier, vol. 190(C).
    19. Takumi Saegusa & Tianzhou Ma & Gang Li & Ying Qing Chen & Mei-Ling Ting Lee, 2020. "Variable Selection in Threshold Regression Model with Applications to HIV Drug Adherence Data," Statistics in Biosciences, Springer;International Chinese Statistical Association, vol. 12(3), pages 376-398, December.
    20. Li, Gaorong & Lian, Heng & Feng, Sanying & Zhu, Lixing, 2013. "Automatic variable selection for longitudinal generalized linear models," Computational Statistics & Data Analysis, Elsevier, vol. 61(C), pages 174-186.

    More about this item

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arx:papers:1810.09583. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: arXiv administrators (email available below). General contact details of provider: http://arxiv.org/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.