IDEAS home Printed from https://ideas.repec.org/p/arx/papers/2501.14476.html
   My bibliography  Save this paper

Avoiding Overfitting in Variable-Order Markov Models: a Cross-Validation Approach

Author

Listed:
  • Valeria Secchini
  • Javier Garcia-Bernardo
  • Petr Jansk'y

Abstract

Higher$\text{-}$order Markov chain models are widely used to represent agent transitions in dynamic systems, such as passengers in transport networks. They capture transitions in complex systems by considering not only the current state but also the path of previously visited states. For example, the likelihood of train passengers traveling from Paris (current state) to Rome could increase significantly if their journey originated in Italy (prior state). Although this approach provides a more faithful representation of the system than first$\text{-}$order models, we find that commonly used methods$-$relying on Kullback$\text{-}$Leibler divergence$-$frequently overfit the data, mistaking fluctuations for higher$\text{-}$order dependencies and undermining forecasts and resource allocation. Here, we introduce DIVOP (Detection of Informative Variable$\text{-}$Order Paths), an algorithm that employs cross$\text{-}$validation to robustly distinguish meaningful higher$\text{-}$order dependencies from noise. In both synthetic and real$\text{-}$world datasets, DIVOP outperforms two state$\text{-}$of$\text{-}$the$\text{-}$art algorithms by achieving higher precision, recall, and sparser representations of the underlying dynamics. When applied to global corporate ownership data, DIVOP reveals that tax havens appear in 82$\%$ of all significant higher$\text{-}$order dependencies, underscoring their outsized influence in corporate networks. By mitigating overfitting, DIVOP enables more reliable multi$\text{-}$step predictions and decision$\text{-}$making, paving the way toward deeper insights into the hidden structures that drive modern interconnected systems.

Suggested Citation

  • Valeria Secchini & Javier Garcia-Bernardo & Petr Jansk'y, 2025. "Avoiding Overfitting in Variable-Order Markov Models: a Cross-Validation Approach," Papers 2501.14476, arXiv.org.
  • Handle: RePEc:arx:papers:2501.14476
    as

    Download full text from publisher

    File URL: http://arxiv.org/pdf/2501.14476
    File Function: Latest version
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Guttorm Schjelderup, 2016. "Secrecy jurisdictions," International Tax and Public Finance, Springer;International Institute of Public Finance, vol. 23(1), pages 168-189, February.
    2. Tiago P. Peixoto & Martin Rosvall, 2017. "Modelling sequences and temporal networks with dynamic community structures," Nature Communications, Nature, vol. 8(1), pages 1-12, December.
    3. Martin Rosvall & Alcides V. Esquivel & Andrea Lancichinetti & Jevin D. West & Renaud Lambiotte, 2014. "Memory in network flows and its effects on spreading dynamics and community detection," Nature Communications, Nature, vol. 5(1), pages 1-13, December.
    4. Ingo Scholtes & Nicolas Wider & René Pfitzner & Antonios Garas & Claudio J. Tessone & Frank Schweitzer, 2014. "Causality-driven slow-down and speed-up of diffusion in non-Markovian temporal networks," Nature Communications, Nature, vol. 5(1), pages 1-9, December.
    5. Väinö Jääskinen & Jie Xiong & Jukka Corander & Timo Koski, 2014. "Sparse Markov Chains for Sequence Data," Scandinavian Journal of Statistics, Danish Society for Theoretical Statistics;Finnish Statistical Society;Norwegian Statistical Association;Swedish Statistical Association, vol. 41(3), pages 639-655, September.
    6. Armin Shmilovici & Irad Ben-Gal, 2007. "Using a VOM model for reconstructing potential coding regions in EST sequences," Computational Statistics, Springer, vol. 22(1), pages 49-69, April.
    7. Javier Garcia-Bernardo & Jan Fichtner & Eelke M. Heemskerk & Frank W. Takes, 2017. "Uncovering Offshore Financial Centers: Conduits and Sinks in the Global Corporate Ownership Network," Papers 1703.03016, arXiv.org, revised May 2017.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Luca Gallo & Lucas Lacasa & Vito Latora & Federico Battiston, 2024. "Higher-order correlations reveal complex memory in temporal hypergraphs," Nature Communications, Nature, vol. 15(1), pages 1-7, December.
    2. Xie, Fengjie & Ma, Mengdi & Ren, Cuiping, 2022. "Research on multilayer network structure characteristics from a higher-order model: The case of a Chinese high-speed railway system," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 586(C).
    3. Andrew Mellor, 2019. "Event Graphs: Advances And Applications Of Second-Order Time-Unfolded Temporal Network Models," Advances in Complex Systems (ACS), World Scientific Publishing Co. Pte. Ltd., vol. 22(03), pages 1-26, May.
    4. Damgaard, Jannick & Elkjaer, Thomas & Johannesen, Niels, 2024. "What is real and what is not in the global FDI network?," Journal of International Money and Finance, Elsevier, vol. 140(C).
    5. Franch, Fabio & Nocciola, Luca & Vouldis, Angelos, 2024. "Temporal networks and financial contagion," Journal of Financial Stability, Elsevier, vol. 71(C).
    6. Ivar Kolstad, 2017. "Protected tax havens: Cornering the market through international reform?," CMI Working Papers 7, CMI (Chr. Michelsen Institute), Bergen, Norway.
    7. Funel, Agostino, 2022. "A method to compute the communicability of nodes through causal paths in temporal networks," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 593(C).
    8. Xiang Li & Chengli Zhao & Zhaolong Hu & Caixia Yu & Xiaojun Duan, 2022. "Revealing the character of journals in higher-order citation networks," Scientometrics, Springer;Akadémiai Kiadó, vol. 127(11), pages 6315-6338, November.
    9. Menkhoff, Lukas & Miethe, Jakob, 2019. "Tax evasion in new disguise? Examining tax havens' international bank deposits," EconStor Open Access Articles and Book Chapters, ZBW - Leibniz Information Centre for Economics, vol. 176, pages 53-78.
    10. Chao Min & Qingyu Chen & Erjia Yan & Yi Bu & Jianjun Sun, 2021. "Citation cascade and the evolution of topic relevance," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 72(1), pages 110-127, January.
    11. I. P. Gurova, 2020. "Offshore Investment in the Russian Economy," Studies on Russian Economic Development, Springer, vol. 31(4), pages 449-456, July.
    12. Panayotis Christidis & Álvaro Gomez Losada, 2019. "Email Based Institutional Network Analysis: Applications and Risks," Social Sciences, MDPI, vol. 8(11), pages 1-14, November.
    13. Franz Reiter & Dominika Langenmayr & Svea Holtmann, 2021. "Avoiding taxes: banks’ use of internal debt," International Tax and Public Finance, Springer;International Institute of Public Finance, vol. 28(3), pages 717-745, June.
    14. Alex C. Michalos & P. Maurine Hatch, 2020. "Good Societies, Financial Inequality and Secrecy, and a Good Life: from Aristotle to Piketty," Applied Research in Quality of Life, Springer;International Society for Quality-of-Life Studies, vol. 15(4), pages 1005-1054, September.
    15. Sébastien Laffitte & Farid Toubal, 2018. "Firms, Trade and Profit Shifting: Evidence from Aggregate Data," CESifo Working Paper Series 7171, CESifo.
    16. Rabbani, Fereshteh & Khraisha, Tamer & Abbasi, Fatemeh & Jafari, Gholam Reza, 2021. "Memory effects on link formation in temporal networks: A fractional calculus approach," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 564(C).
    17. Chakraborty, Abhijit & Krichene, Hazem & Inoue, Hiroyasu & Fujiwara, Yoshi, 2019. "Characterization of the community structure in a large-scale production network in Japan," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 513(C), pages 210-221.
    18. Marco Bardoscia & Fabio Caccioli & Juan Ignacio Perotti & Gianna Vivaldo & Guido Caldarelli, 2016. "Distress Propagation in Complex Networks: The Case of Non-Linear DebtRank," PLOS ONE, Public Library of Science, vol. 11(10), pages 1-12, October.
    19. Pamela Pogliani & Goetz von Peter & Philip Wooldridge, 2022. "The outsize role of cross-border financial centres," BIS Quarterly Review, Bank for International Settlements, June.
    20. Gong, Chang & Li, Jichao & Qian, Liwei & Li, Siwei & Yang, Zhiwei & Yang, Kewei, 2024. "HMSL: Source localization based on higher-order Markov propagation," Chaos, Solitons & Fractals, Elsevier, vol. 182(C).

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arx:papers:2501.14476. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: arXiv administrators (email available below). General contact details of provider: http://arxiv.org/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.