IDEAS home Printed from https://ideas.repec.org/a/eee/csdana/v180y2023ics016794732200202x.html
   My bibliography  Save this article

Estimation of predictive performance in high-dimensional data settings using learning curves

Author

Listed:
  • Goedhart, Jeroen M.
  • Klausch, Thomas
  • van de Wiel, Mark A.

Abstract

In high-dimensional prediction settings, it remains challenging to reliably estimate the test performance. To address this challenge, a novel performance estimation framework is presented. This framework, called Learn2Evaluate, is based on learning curves by fitting a smooth monotone curve depicting test performance as a function of the sample size. Learn2Evaluate has several advantages compared to commonly applied performance estimation methodologies. Firstly, a learning curve offers a graphical overview of a learner. This overview assists in assessing the potential benefit of adding training samples and it provides a more complete comparison between learners than performance estimates at a fixed subsample size. Secondly, a learning curve facilitates in estimating the performance at the total sample size rather than a subsample size. Thirdly, Learn2Evaluate allows the computation of a theoretically justified and useful lower confidence bound. Furthermore, this bound may be tightened by performing a bias correction. The benefits of Learn2Evaluate are illustrated by a simulation study and applications to omics data.

Suggested Citation

  • Goedhart, Jeroen M. & Klausch, Thomas & van de Wiel, Mark A., 2023. "Estimation of predictive performance in high-dimensional data settings using learning curves," Computational Statistics & Data Analysis, Elsevier, vol. 180(C).
  • Handle: RePEc:eee:csdana:v:180:y:2023:i:c:s016794732200202x
    DOI: 10.1016/j.csda.2022.107622
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S016794732200202X
    Download Restriction: Full text for ScienceDirect subscribers only.

    File URL: https://libkey.io/10.1016/j.csda.2022.107622?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Friedman, Jerome H. & Hastie, Trevor & Tibshirani, Rob, 2010. "Regularization Paths for Generalized Linear Models via Coordinate Descent," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 33(i01).
    2. Jiang Wenyu & Varma Sudhir & Simon Richard, 2008. "Calculating Confidence Intervals for Prediction Error in Microarray Classification Using Resampling," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 7(1), pages 1-22, March.
    3. Schäfer Juliane & Strimmer Korbinian, 2005. "A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 4(1), pages 1-32, November.
    4. Kim, Ji-Hyun, 2009. "Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap," Computational Statistics & Data Analysis, Elsevier, vol. 53(11), pages 3735-3745, September.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Zhengnan Huang & Hongjiu Zhang & Jonathan Boss & Stephen A Goutman & Bhramar Mukherjee & Ivo D Dinov & Yuanfang Guan & for the Pooled Resource Open-Access ALS Clinical Trials Consortium, 2017. "Complete hazard ranking to analyze right-censored data: An ALS survival study," PLOS Computational Biology, Public Library of Science, vol. 13(12), pages 1-21, December.
    2. Aderhold Andrej & Husmeier Dirk & Grzegorczyk Marco, 2014. "Statistical inference of regulatory networks for circadian regulation," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 13(3), pages 227-273, June.
    3. Xiaodong Cai & Juan Andrés Bazerque & Georgios B Giannakis, 2013. "Inference of Gene Regulatory Networks with Sparse Structural Equation Models Exploiting Genetic Perturbations," PLOS Computational Biology, Public Library of Science, vol. 9(5), pages 1-13, May.
    4. Blum Yuna & Houée-Bigot Magalie & Causeur David, 2016. "Sparse factor model for co-expression networks with an application using prior biological knowledge," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 15(3), pages 253-272, June.
    5. Guibert, Quentin & Lopez, Olivier & Piette, Pierrick, 2019. "Forecasting mortality rate improvements with a high-dimensional VAR," Insurance: Mathematics and Economics, Elsevier, vol. 88(C), pages 255-272.
    6. Tutz, Gerhard & Pößnecker, Wolfgang & Uhlmann, Lorenz, 2015. "Variable selection in general multinomial logit models," Computational Statistics & Data Analysis, Elsevier, vol. 82(C), pages 207-222.
    7. Hannart, Alexis & Naveau, Philippe, 2014. "Estimating high dimensional covariance matrices: A new look at the Gaussian conjugate framework," Journal of Multivariate Analysis, Elsevier, vol. 131(C), pages 149-162.
    8. Rui Wang & Naihua Xiu & Kim-Chuan Toh, 2021. "Subspace quadratic regularization method for group sparse multinomial logistic regression," Computational Optimization and Applications, Springer, vol. 79(3), pages 531-559, July.
    9. Mkhadri, Abdallah & Ouhourane, Mohamed, 2013. "An extended variable inclusion and shrinkage algorithm for correlated variables," Computational Statistics & Data Analysis, Elsevier, vol. 57(1), pages 631-644.
    10. Chen, Le-Yu & Lee, Sokbae, 2018. "Best subset binary prediction," Journal of Econometrics, Elsevier, vol. 206(1), pages 39-56.
    11. Chuliá, Helena & Garrón, Ignacio & Uribe, Jorge M., 2024. "Daily growth at risk: Financial or real drivers? The answer is not always the same," International Journal of Forecasting, Elsevier, vol. 40(2), pages 762-776.
    12. Sung Jae Jun & Sokbae Lee, 2024. "Causal Inference Under Outcome-Based Sampling with Monotonicity Assumptions," Journal of Business & Economic Statistics, Taylor & Francis Journals, vol. 42(3), pages 998-1009, July.
    13. Xiangwei Li & Thomas Delerue & Ben Schöttker & Bernd Holleczek & Eva Grill & Annette Peters & Melanie Waldenberger & Barbara Thorand & Hermann Brenner, 2022. "Derivation and validation of an epigenetic frailty risk score in population-based cohorts of older adults," Nature Communications, Nature, vol. 13(1), pages 1-11, December.
    14. Christopher J Greenwood & George J Youssef & Primrose Letcher & Jacqui A Macdonald & Lauryn J Hagg & Ann Sanson & Jenn Mcintosh & Delyse M Hutchinson & John W Toumbourou & Matthew Fuller-Tyszkiewicz &, 2020. "A comparison of penalised regression methods for informing the selection of predictive markers," PLOS ONE, Public Library of Science, vol. 15(11), pages 1-14, November.
    15. Heng Chen & Daniel F. Heitjan, 2022. "Analysis of local sensitivity to nonignorability with missing outcomes and predictors," Biometrics, The International Biometric Society, vol. 78(4), pages 1342-1352, December.
    16. Jianqing Fan & Xu Han, 2017. "Estimation of the false discovery proportion with unknown dependence," Journal of the Royal Statistical Society Series B, Royal Statistical Society, vol. 79(4), pages 1143-1164, September.
    17. Wang Xiaoming & Dinu Irina & Liu Wei & Yasui Yutaka, 2011. "Linear Combination Test for Hierarchical Gene Set Analysis," Statistical Applications in Genetics and Molecular Biology, De Gruyter, vol. 10(1), pages 1-18, March.
    18. S Ariane Christie & Amanda S Conroy & Rachael A Callcut & Alan E Hubbard & Mitchell J Cohen, 2019. "Dynamic multi-outcome prediction after injury: Applying adaptive machine learning for precision medicine in trauma," PLOS ONE, Public Library of Science, vol. 14(4), pages 1-13, April.
    19. Zhu Wang, 2022. "MM for penalized estimation," TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, Springer;Sociedad de Estadística e Investigación Operativa, vol. 31(1), pages 54-75, March.
    20. Ida Kubiszewski & Kenneth Mulder & Diane Jarvis & Robert Costanza, 2022. "Toward better measurement of sustainable development and wellbeing: A small number of SDG indicators reliably predict life satisfaction," Sustainable Development, John Wiley & Sons, Ltd., vol. 30(1), pages 139-148, February.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:csdana:v:180:y:2023:i:c:s016794732200202x. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/csda .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.