IDEAS home Printed from https://ideas.repec.org/a/eee/csdana/v74y2014icp181-197.html
   My bibliography  Save this article

Learning algorithms may perform worse with increasing training set size: Algorithm–data incompatibility

Author

Listed:
  • Yousef, Waleed A.
  • Kundu, Subrata

Abstract

In machine learning problems a learning algorithm tries to learn the input–output dependency (relationship) of a system from a training dataset. This input–output relationship is usually deformed by a random noise. From experience, simulations, and special case theories, most practitioners believe that increasing the size of the training set improves the performance of the learning algorithm. It is shown that this phenomenon is not true in general for any pair of a learning algorithm and a data distribution. In particular, it is proven that for certain distributions and learning algorithms, increasing the training set size may result in a worse performance and increasing the training set size infinitely may result in the worst performance—even when there is no model misspecification for the input–output relationship. Simulation results and analysis of real datasets are provided to support the mathematical argument.

Suggested Citation

  • Yousef, Waleed A. & Kundu, Subrata, 2014. "Learning algorithms may perform worse with increasing training set size: Algorithm–data incompatibility," Computational Statistics & Data Analysis, Elsevier, vol. 74(C), pages 181-197.
  • Handle: RePEc:eee:csdana:v:74:y:2014:i:c:p:181-197
    DOI: 10.1016/j.csda.2013.05.021
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0167947313002089
    Download Restriction: Full text for ScienceDirect subscribers only.

    File URL: https://libkey.io/10.1016/j.csda.2013.05.021?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Vygantas Paulaauskas & Svetlozar Rachev, 2003. "Maximum likelihood estimators in regression models with infinite variance innovations," Statistical Papers, Springer, vol. 44(1), pages 47-65, January.
    2. M. Xu & C. Teichert, 2003. "Afm Analysis Of Quantum Dot Structures Induced By Ion Sputtering With Different Tips," Surface Review and Letters (SRL), World Scientific Publishing Co. Pte. Ltd., vol. 10(06), pages 837-841.
    3. Nguyen, T. T., 1995. "Conditional Distributions and Characterizations of Multivariate Stable Distribution," Journal of Multivariate Analysis, Elsevier, vol. 53(2), pages 181-193, May.
    4. Abdul-Hamid, Husein & Nolan, John P., 1998. "Multivariate Stable Densities as Functions of One Dimensional Projections," Journal of Multivariate Analysis, Elsevier, vol. 67(1), pages 80-89, October.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Kyrylo Klimenko & Sine A Rosenberg & Marianne Dybdahl & Eva B Wedebye & Nikolai G Nikolov, 2019. "QSAR modelling of a large imbalanced aryl hydrocarbon activation dataset by rational and random sampling and screening of 80,086 REACH pre-registered and/or registered substances," PLOS ONE, Public Library of Science, vol. 14(3), pages 1-21, March.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Nolan, John P., 2018. "Truncated fractional moments of stable laws," Statistics & Probability Letters, Elsevier, vol. 137(C), pages 312-318.
    2. Christoph P. Kustosz & Anne Leucht & Christine H. MÜller, 2016. "Tests Based on Simplicial Depth for AR(1) Models With Explosion," Journal of Time Series Analysis, Wiley Blackwell, vol. 37(6), pages 763-784, November.
    3. Greg Hannsgen, 2011. "Infinite-variance, Alpha-stable Shocks in Monetary SVAR: Final Working Paper Version," Economics Working Paper Archive wp_682, Levy Economics Institute.
    4. Battey, Heather & Linton, Oliver, 2014. "Nonparametric estimation of multivariate elliptic densities via finite mixture sieves," Journal of Multivariate Analysis, Elsevier, vol. 123(C), pages 43-67.
    5. John Nolan, 2013. "Multivariate elliptically contoured stable distributions: theory and estimation," Computational Statistics, Springer, vol. 28(5), pages 2067-2089, October.
    6. Tsionas, Mike G., 2016. "Bayesian analysis of multivariate stable distributions using one-dimensional projections," Journal of Multivariate Analysis, Elsevier, vol. 143(C), pages 185-193.
    7. D. M. Mahinda Samarakoon & Keith Knight, 2009. "A Note on Unit Root Tests with Infinite Variance Noise," Econometric Reviews, Taylor & Francis Journals, vol. 28(4), pages 314-334.
    8. Mallick, Madhuja & Ravishanker, Nalini & Kannan, Nandini, 2008. "Bivariate positive stable frailty models," Statistics & Probability Letters, Elsevier, vol. 78(15), pages 2371-2377, October.
    9. Paola Stolfi & Mauro Bernardi & Lea Petrella, 2018. "The sparse method of simulated quantiles: An application to portfolio optimization," Statistica Neerlandica, Netherlands Society for Statistics and Operations Research, vol. 72(3), pages 375-398, August.
    10. Heather Battey & Oliver Linton, 2013. "Nonparametric estimation of multivariate elliptic densities via finite mixture sieves," CeMMAP working papers 41/13, Institute for Fiscal Studies.
    11. Dai, Xinjie & Niu, Cuizhen & Guo, Xu, 2018. "Testing for central symmetry and inference of the unknown center," Computational Statistics & Data Analysis, Elsevier, vol. 127(C), pages 15-31.
    12. Karling, Maicon J. & Lopes, Sílvia R.C. & de Souza, Roberto M., 2023. "Multivariate α-stable distributions: VAR(1) processes, measures of dependence and their estimations," Journal of Multivariate Analysis, Elsevier, vol. 195(C).
    13. Christoph Kustosz & Christine Müller, 2014. "Analysis of crack growth with robust, distribution-free estimators and tests for non-stationary autoregressive processes," Statistical Papers, Springer, vol. 55(1), pages 125-140, February.
    14. Amit Shelef & Edna Schechtman, 2019. "A Gini-based time series analysis and test for reversibility," Statistical Papers, Springer, vol. 60(3), pages 687-716, June.
    15. Hasan A. Fallahgoul & Young S. Kim & Frank J. Fabozzi & Jiho Park, 2019. "Quanto Option Pricing with Lévy Models," Computational Economics, Springer;Society for Computational Economics, vol. 53(3), pages 1279-1308, March.
    16. Matsui, Muneya & Takemura, Akimichi, 2009. "Integral representations of one-dimensional projections for multivariate stable densities," Journal of Multivariate Analysis, Elsevier, vol. 100(3), pages 334-344, March.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:csdana:v:74:y:2014:i:c:p:181-197. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/csda .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.