IDEAS home Printed from https://ideas.repec.org/a/oup/biomet/v105y2018i3p517-527..html
   My bibliography  Save this article

When is the first spurious variable selected by sequential regression procedures?

Author

Listed:
  • Weijie J Su

Abstract

SummaryApplied statisticians use sequential regression procedures to rank explanatory variables and, in settings of low correlations between variables and strong true effect sizes, expect that variables at the top of this ranking are truly relevant to the response. In a regime of certain sparsity levels, however, we show that the lasso, forward stepwise regression, and least angle regression include the first spurious variable unexpectedly early. We derive a sharp prediction of the rank of the first spurious variable for these three procedures, demonstrating that it occurs earlier and earlier as the regression coefficients become denser. This phenomenon persists for statistically independent Gaussian random designs and arbitrarily large true effects. We gain insight by identifying the underlying cause and then introduce a simple visualization tool termed the double-ranking diagram to improve on these methods. We obtain the first result establishing the exact equivalence between the lasso and least angle regression in the early stages of solution paths beyond orthogonal designs. This equivalence implies that many important model selection results concerning the lasso can be carried over to least angle regression.

Suggested Citation

  • Weijie J Su, 2018. "When is the first spurious variable selected by sequential regression procedures?," Biometrika, Biometrika Trust, vol. 105(3), pages 517-527.
  • Handle: RePEc:oup:biomet:v:105:y:2018:i:3:p:517-527.
    as

    Download full text from publisher

    File URL: http://hdl.handle.net/10.1093/biomet/asy032
    Download Restriction: Access to full text is restricted to subscribers.
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Max Grazier G'Sell & Stefan Wager & Alexandra Chouldechova & Robert Tibshirani, 2016. "Sequential selection procedures and false discovery rate control," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 78(2), pages 423-444, March.
    2. Ryan J. Tibshirani & Jonathan Taylor & Richard Lockhart & Robert Tibshirani, 2016. "Exact Post-Selection Inference for Sequential Regression Procedures," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 111(514), pages 600-620, April.
    3. Xiangyu Wang & Chenlei Leng, 2016. "High dimensional ordinary least squares projection for screening variables," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 78(3), pages 589-611, June.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Markus Pelger & Jiacheng Zou, 2022. "Inference for Large Panel Data with Many Covariates," Papers 2301.00292, arXiv.org, revised Mar 2023.
    2. Gregory Vaughan & Robert Aseltine & Kun Chen & Jun Yan, 2017. "Stagewise generalized estimating equations with grouped variables," Biometrics, The International Biometric Society, vol. 73(4), pages 1332-1342, December.
    3. Liang, Weijuan & Zhang, Qingzhao & Ma, Shuangge, 2024. "Hierarchical false discovery rate control for high-dimensional survival analysis with interactions," Computational Statistics & Data Analysis, Elsevier, vol. 192(C).
    4. Jelle J Goeman & Aldo Solari, 2024. "On selection and conditioning in multiple testing and selective inference," Biometrika, Biometrika Trust, vol. 111(2), pages 393-416.
    5. Michael J. Weir & Thomas W. Sproul, 2019. "Identifying Drivers of Genetically Modified Seafood Demand: Evidence from a Choice Experiment," Sustainability, MDPI, vol. 11(14), pages 1-21, July.
    6. Zhao, Bangxin & Liu, Xin & He, Wenqing & Yi, Grace Y., 2021. "Dynamic tilted current correlation for high dimensional variable screening," Journal of Multivariate Analysis, Elsevier, vol. 182(C).
    7. Damian Kozbur, 2020. "Analysis of Testing‐Based Forward Model Selection," Econometrica, Econometric Society, vol. 88(5), pages 2147-2173, September.
    8. Liu, Jingyuan & Sun, Ao & Ke, Yuan, 2024. "A generalized knockoff procedure for FDR control in structural change detection," Journal of Econometrics, Elsevier, vol. 239(2).
    9. Gong, Siliang & Zhang, Kai & Liu, Yufeng, 2018. "Efficient test-based variable selection for high-dimensional linear models," Journal of Multivariate Analysis, Elsevier, vol. 166(C), pages 17-31.
    10. The Tien Mai, 2023. "Reliable Genetic Correlation Estimation via Multiple Sample Splitting and Smoothing," Mathematics, MDPI, vol. 11(9), pages 1-13, May.
    11. Nazemi, Abdolreza & Fabozzi, Frank J., 2018. "Macroeconomic variable selection for creditor recovery rates," Journal of Banking & Finance, Elsevier, vol. 89(C), pages 14-25.
    12. Ping Wang & Lu Lin, 2023. "Conditional characteristic feature screening for massive imbalanced data," Statistical Papers, Springer, vol. 64(3), pages 807-834, June.
    13. Christis Katsouris, 2023. "High Dimensional Time Series Regression Models: Applications to Statistical Learning Methods," Papers 2308.16192, arXiv.org.
    14. Rand R. Wilcox, 2018. "Robust regression: an inferential method for determining which independent variables are most important," Journal of Applied Statistics, Taylor & Francis Journals, vol. 45(1), pages 100-111, January.
    15. Christian Gross & Pierre L. Siklos, 2020. "Analyzing credit risk transmission to the nonfinancial sector in Europe: A network approach," Journal of Applied Econometrics, John Wiley & Sons, Ltd., vol. 35(1), pages 61-81, January.
    16. Sweata Sen & Damitri Kundu & Kiranmoy Das, 2023. "Variable selection for categorical response: a comparative study," Computational Statistics, Springer, vol. 38(2), pages 809-826, June.
    17. Sonja Greven & Fabian Scheipl, 2020. "Comments on: Inference and computation with Generalized Additive Models and their extensions," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 29(2), pages 343-350, June.
    18. Lu, Jiannan & Deng, Alex, 2016. "Demystifying the bias from selective inference: A revisit to Dawid’s treatment selection problem," Statistics & Probability Letters, Elsevier, vol. 118(C), pages 8-15.
    19. Liao Zhu, 2021. "The Adaptive Multi-Factor Model and the Financial Market," Papers 2107.14410, arXiv.org, revised Aug 2021.
    20. Luigi Biagini & Simone Severini, 2021. "The role of Common Agricultural Policy (CAP) in enhancing and stabilising farm income: an analysis of income transfer efficiency and the Income Stabilisation Tool," Papers 2104.14188, arXiv.org.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:oup:biomet:v:105:y:2018:i:3:p:517-527.. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Oxford University Press (email available below). General contact details of provider: https://academic.oup.com/biomet .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.