IDEAS home Printed from https://ideas.repec.org/a/spr/compst/v36y2021i3d10.1007_s00180-020-00999-9.html
   My bibliography  Save this article

What is an optimal value of k in k-fold cross-validation in discrete Bayesian network analysis?

Author

Listed:
  • Bruce G. Marcot

    (Pacific Northwest Research Station)

  • Anca M. Hanea

    (University of Melbourne)

Abstract

Cross-validation using randomized subsets of data—known as k-fold cross-validation—is a powerful means of testing the success rate of models used for classification. However, few if any studies have explored how values of k (number of subsets) affect validation results in models tested with data of known statistical properties. Here, we explore conditions of sample size, model structure, and variable dependence affecting validation outcomes in discrete Bayesian networks (BNs). We created 6 variants of a BN model with known properties of variance and collinearity, along with data sets of n = 50, 500, and 5000 samples, and then tested classification success and evaluated CPU computation time with seven levels of folds (k = 2, 5, 10, 20, n − 5, n − 2, and n − 1). Classification error declined with increasing n, particularly in BN models with high multivariate dependence, and declined with increasing k, generally levelling out at k = 10, although k = 5 sufficed with large samples (n = 5000). Our work supports the common use of k = 10 in the literature, although in some cases k = 5 would suffice with BN models having independent variable structures.

Suggested Citation

  • Bruce G. Marcot & Anca M. Hanea, 2021. "What is an optimal value of k in k-fold cross-validation in discrete Bayesian network analysis?," Computational Statistics, Springer, vol. 36(3), pages 2009-2031, September.
  • Handle: RePEc:spr:compst:v:36:y:2021:i:3:d:10.1007_s00180-020-00999-9
    DOI: 10.1007/s00180-020-00999-9
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s00180-020-00999-9
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s00180-020-00999-9?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Marcot, Bruce G., 2012. "Metrics for evaluating performance and uncertainty of Bayesian network models," Ecological Modelling, Elsevier, vol. 230(C), pages 50-62.
    2. Shcheglovitova, Mariya & Anderson, Robert P., 2013. "Estimating optimal complexity for ecological niche models: A jackknife approach for species with small sample sizes," Ecological Modelling, Elsevier, vol. 269(C), pages 9-17.
    3. Forio, Marie Anne Eurie & Landuyt, Dries & Bennetsen, Elina & Lock, Koen & Nguyen, Thi Hanh Tien & Ambarita, Minar Naomi Damanik & Musonge, Peace Liz Sasha & Boets, Pieter & Everaert, Gert & Dominguez, 2015. "Bayesian belief network models to analyse and predict ecological water quality in rivers," Ecological Modelling, Elsevier, vol. 312(C), pages 222-238.
    4. Scutari, Marco, 2010. "Learning Bayesian Networks with the bnlearn R Package," Journal of Statistical Software, Foundation for Open Access Statistics, vol. 35(i03).
    5. A. M. Hanea & G. F. Nane, 2018. "The asymptotic distribution of the determinant of a random correlation matrix," Statistica Neerlandica, Netherlands Society for Statistics and Operations Research, vol. 72(1), pages 14-33, February.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Qianying Jin & Kristiaan Kerstens & Ignace Van de Woestyne, 2024. "Convex and nonconvex nonparametric frontier-based classification methods for anomaly detection," OR Spectrum: Quantitative Approaches in Management, Springer;Gesellschaft für Operations Research e.V., vol. 46(4), pages 1213-1239, December.
    2. Dao, Uyen & Sajid, Zaman & Khan, Faisal & Zhang, Yahui & Tran, Trung, 2023. "Modeling and analysis of internal corrosion induced failure of oil and gas pipelines," Reliability Engineering and System Safety, Elsevier, vol. 234(C).
    3. Georgios Friligkos, 2023. "A framework for applying the Logistic Regression model to obtain predictive analytics for tennis matches," Technium, Technium Science, vol. 15(1), pages 60-74.
    4. Lu Jiang & Xinyu Kang & Shan Huang & Bo Yang, 2022. "A refinement strategy for identification of scientific software from bioinformatics publications," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(6), pages 3293-3316, June.
    5. André Hartmann & Martin Behnisch & Robert Hecht & Gotthard Meinel, 2024. "Prediction of residential and non-residential building usage in Germany based on a novel nationwide reference data set," Environment and Planning B, , vol. 51(1), pages 216-233, January.
    6. Xia, Huosong & Wang, Yuan & Zhang, Justin Zuopeng & Zheng, Leven J. & Kamal, Muhammad Mustafa & Arya, Varsha, 2023. "COVID-19 fake news detection: A hybrid CNN-BiLSTM-AM model," Technological Forecasting and Social Change, Elsevier, vol. 195(C).
    7. Qazi, Abroon, 2023. "Exploring Global Competitiveness Index 4.0 through the lens of country risk," Technological Forecasting and Social Change, Elsevier, vol. 196(C).
    8. Qazi, Abroon & Simsekler, Mecit Can Emre, 2023. "Nexus between drivers of COVID-19 and country risks," Socio-Economic Planning Sciences, Elsevier, vol. 85(C).
    9. Kalahasthi, Lokesh Kumar & Sánchez-Díaz, Iván & Pablo Castrellon, Juan & Gil, Jorge & Browne, Michael & Hayes, Simon & Sentís Ros, Carles, 2022. "Joint modeling of arrivals and parking durations for freight loading zones: Potential applications to improving urban logistics," Transportation Research Part A: Policy and Practice, Elsevier, vol. 166(C), pages 307-329.
    10. Abhinash Jenasamanta & Subrajeet Mohapatra, 2022. "An automated system for the assessment and grading of adolescent delinquency using a machine learning-based soft voting framework," Palgrave Communications, Palgrave Macmillan, vol. 9(1), pages 1-11, December.
    11. Abdelaziz A. Abdelhamid & El-Sayed M. El-Kenawy & Fadwa Alrowais & Abdelhameed Ibrahim & Nima Khodadadi & Wei Hong Lim & Nuha Alruwais & Doaa Sami Khafaga, 2022. "Deep Learning with Dipper Throated Optimization Algorithm for Energy Consumption Forecasting in Smart Households," Energies, MDPI, vol. 15(23), pages 1-25, December.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Marcot, Bruce G., 2017. "Common quandaries and their practical solutions in Bayesian network modeling," Ecological Modelling, Elsevier, vol. 358(C), pages 1-9.
    2. Guo, Kai & Zhang, Xinchang & Kuai, Xi & Wu, Zhifeng & Chen, Yiyun & Liu, Yi, 2020. "A spatial bayesian-network approach as a decision-making tool for ecological-risk prevention in land ecosystems," Ecological Modelling, Elsevier, vol. 419(C).
    3. Prabal Das & D. A. Sachindra & Kironmala Chanda, 2022. "Machine Learning-Based Rainfall Forecasting with Multiple Non-Linear Feature Selection Algorithms," Water Resources Management: An International Journal, Published for the European Water Resources Association (EWRA), Springer;European Water Resources Association (EWRA), vol. 36(15), pages 6043-6071, December.
    4. Vuong, Quan-Hoang & La, Viet-Phuong, 2019. "The bayesvl R package. User guide v0.8.1," OSF Preprints w5dx6, Center for Open Science.
    5. F. Cugnata & G. Perucca & S. Salini, 2017. "Bayesian networks and the assessment of universities' value added," Journal of Applied Statistics, Taylor & Francis Journals, vol. 44(10), pages 1785-1806, July.
    6. Roland R. Ramsahai, 2020. "Connecting actuarial judgment to probabilistic learning techniques with graph theory," Papers 2007.15475, arXiv.org.
    7. Tang, Kayu & Parsons, David J. & Jude, Simon, 2019. "Comparison of automatic and guided learning for Bayesian networks to analyse pipe failures in the water distribution system," Reliability Engineering and System Safety, Elsevier, vol. 186(C), pages 24-36.
    8. Myriam Patricia Cifuentes & Clara Mercedes Suarez & Ricardo Cifuentes & Noel Malod-Dognin & Sam Windels & Jose Fernando Valderrama & Paul D. Juarez & R. Burciaga Valdez & Cynthia Colen & Charles Phill, 2022. "Big Data to Knowledge Analytics Reveals the Zika Virus Epidemic as Only One of Multiple Factors Contributing to a Year-Over-Year 28-Fold Increase in Microcephaly Incidence," IJERPH, MDPI, vol. 19(15), pages 1-21, July.
    9. Moe, S. Jannicke & Haande, Sigrid & Couture, Raoul-Marie, 2016. "Climate change, cyanobacteria blooms and ecological status of lakes: A Bayesian network approach," Ecological Modelling, Elsevier, vol. 337(C), pages 330-347.
    10. Silvia de Juan & Maria Dulce Subida & Andres Ospina-Alvarez & Ainara Aguilar & Miriam Fernandez, 2020. "Disentangling the socio-ecological drivers behind illegal fishing in a small-scale fishery managed by a TURF system," Papers 2012.08970, arXiv.org.
    11. Meineri, Eric & Dahlberg, C. Johan & Hylander, Kristoffer, 2015. "Using Gaussian Bayesian Networks to disentangle direct and indirect associations between landscape physiography, environmental variables and species distribution," Ecological Modelling, Elsevier, vol. 313(C), pages 127-136.
    12. Michail Tsagris, 2021. "A New Scalable Bayesian Network Learning Algorithm with Applications to Economics," Computational Economics, Springer;Society for Computational Economics, vol. 57(1), pages 341-367, January.
    13. Lotte Yanore & Jaap Sok & Alfons Oude Lansink, 2024. "Do Dutch farmers invest in expansion despite increased policy uncertainty? A participatory Bayesian network approach," Agribusiness, John Wiley & Sons, Ltd., vol. 40(1), pages 93-115, January.
    14. Michael J. Brusco & Douglas Steinley & Ashley L. Watts, 2022. "Disentangling relationships in symptom networks using matrix permutation methods," Psychometrika, Springer;The Psychometric Society, vol. 87(1), pages 133-155, March.
    15. Amaro, George & Fidelis, Elisangela Gomes & da Silva, Ricardo Siqueira & Marchioro, Cesar Augusto, 2023. "Effect of study area extent on the potential distribution of Species: A case study with models for Raoiella indica Hirst (Acari: Tenuipalpidae)," Ecological Modelling, Elsevier, vol. 483(C).
    16. Leonel Lara-Estrada & Livia Rasche & L. Enrique Sucar & Uwe A. Schneider, 2018. "Inferring Missing Climate Data for Agricultural Planning Using Bayesian Networks," Land, MDPI, vol. 7(1), pages 1-13, January.
    17. Sangsung Park & Sunghae Jun, 2020. "Patent Keyword Analysis of Disaster Artificial Intelligence Using Bayesian Network Modeling and Factor Analysis," Sustainability, MDPI, vol. 12(2), pages 1-11, January.
    18. Federica Cugnata & Silvia Salini & Elena Siletti, 2021. "Deepening Well-Being Evaluation with Different Data Sources: A Bayesian Networks Approach," IJERPH, MDPI, vol. 18(15), pages 1-10, July.
    19. Valentina Lucia Astrid Laface & Carmelo Maria Musarella & Gianmarco Tavilla & Agostino Sorgonà & Ana Cano-Ortiz & Ricardo Quinto Canas & Giovanni Spampinato, 2023. "Current and Potential Future Distribution of Endemic Salvia ceratophylloides Ard. (Lamiaceae)," Land, MDPI, vol. 12(1), pages 1-21, January.
    20. O'Brien, G. C. & Dickens, Chris & Hines, E. & Wepener, V. & Stassen, R. & Landis, W. G., 2017. "A regional scale ecological risk framework for environmental flow evaluations," Papers published in Journals (Open Access), International Water Management Institute, pages 22(2):957-9.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:compst:v:36:y:2021:i:3:d:10.1007_s00180-020-00999-9. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.