IDEAS home Printed from https://ideas.repec.org/p/arx/papers/2501.10677.html
   My bibliography  Save this paper

Class-Imbalanced-Aware Adaptive Dataset Distillation for Scalable Pretrained Model on Credit Scoring

Author

Listed:
  • Xia Li
  • Hanghang Zheng
  • Xiao Chen
  • Hong Liu
  • Mao Mao

Abstract

The advent of artificial intelligence has significantly enhanced credit scoring technologies. Despite the remarkable efficacy of advanced deep learning models, mainstream adoption continues to favor tree-structured models due to their robust predictive performance on tabular data. Although pretrained models have seen considerable development, their application within the financial realm predominantly revolves around question-answering tasks and the use of such models for tabular-structured credit scoring datasets remains largely unexplored. Tabular-oriented large models, such as TabPFN, has made the application of large models in credit scoring feasible, albeit can only processing with limited sample sizes. This paper provides a novel framework to combine tabular-tailored dataset distillation technique with the pretrained model, empowers the scalability for TabPFN. Furthermore, though class imbalance distribution is the common nature in financial datasets, its influence during dataset distillation has not been explored. We thus integrate the imbalance-aware techniques during dataset distillation, resulting in improved performance in financial datasets (e.g., a 2.5% enhancement in AUC). This study presents a novel framework for scaling up the application of large pretrained models on financial tabular datasets and offers a comparative analysis of the influence of class imbalance on the dataset distillation process. We believe this approach can broaden the applications and downstream tasks of large models in the financial domain.

Suggested Citation

  • Xia Li & Hanghang Zheng & Xiao Chen & Hong Liu & Mao Mao, 2025. "Class-Imbalanced-Aware Adaptive Dataset Distillation for Scalable Pretrained Model on Credit Scoring," Papers 2501.10677, arXiv.org, revised Jan 2025.
  • Handle: RePEc:arx:papers:2501.10677
    as

    Download full text from publisher

    File URL: http://arxiv.org/pdf/2501.10677
    File Function: Latest version
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Lara Marie Demajo & Vince Vella & Alexiei Dingli, 2020. "Explainable AI for Interpretable Credit Scoring," Papers 2012.03749, arXiv.org.
    2. Crone, Sven F. & Finlay, Steven, 2012. "Instance sampling in credit scoring: An empirical study of sample size and balancing," International Journal of Forecasting, Elsevier, vol. 28(1), pages 224-238.
    3. Li, Huan & Wu, Weixing, 2024. "Loan default predictability with explainable machine learning," Finance Research Letters, Elsevier, vol. 60(C).
    4. Shijie Wu & Ozan Irsoy & Steven Lu & Vadim Dabravolski & Mark Dredze & Sebastian Gehrmann & Prabhanjan Kambadur & David Rosenberg & Gideon Mann, 2023. "BloombergGPT: A Large Language Model for Finance," Papers 2303.17564, arXiv.org, revised Dec 2023.
    5. T Bellotti & J Crook, 2009. "Credit scoring with macroeconomic variables using survival analysis," Journal of the Operational Research Society, Palgrave Macmillan;The OR Society, vol. 60(12), pages 1699-1707, December.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Medina-Olivares, Victor & Calabrese, Raffaella & Dong, Yizhe & Shi, Baofeng, 2022. "Spatial dependence in microfinance credit default," International Journal of Forecasting, Elsevier, vol. 38(3), pages 1071-1085.
    2. Alexandre, Michel & Antônio Silva Brito, Giovani & Cotrim Martins, Theo, 2017. "Default contagion among credit modalities: evidence from Brazilian data," MPRA Paper 76859, University Library of Munich, Germany.
    3. Ching-Nam Hang & Pei-Duo Yu & Roberto Morabito & Chee-Wei Tan, 2024. "Large Language Models Meet Next-Generation Networking Technologies: A Review," Future Internet, MDPI, vol. 16(10), pages 1-29, October.
    4. Serrano-Cinca, Carlos & Gutiérrez-Nieto, Begoña & Bernate-Valbuena, Martha, 2019. "The use of accounting anomalies indicators to predict business failure," European Management Journal, Elsevier, vol. 37(3), pages 353-375.
    5. Boris Ter-Avanesov & Homayoon Beigi, 2024. "MLP, XGBoost, KAN, TDNN, and LSTM-GRU Hybrid RNN with Attention for SPX and NDX European Call Option Pricing," Papers 2409.06724, arXiv.org, revised Oct 2024.
    6. Lezhi Li & Ting-Yu Chang & Hai Wang, 2023. "Multimodal Gen-AI for Fundamental Investment Research," Papers 2401.06164, arXiv.org.
    7. Ismail Tijjani Idris & Sabri Nayan, 2016. "The Moderating Role of Loan Monitoring on the Relationship between Macroeconomic Variables and Non-performing Loans in Association of Southeast Asian Nations Countries," International Journal of Economics and Financial Issues, Econjournals, vol. 6(2), pages 402-408.
    8. Rasa Kanapickiene & Renatas Spicas, 2019. "Credit Risk Assessment Model for Small and Micro-Enterprises: The Case of Lithuania," Risks, MDPI, vol. 7(2), pages 1-23, June.
    9. Dirick, Lore & Claeskens, Gerda & Vasnev, Andrey & Baesens, Bart, 2022. "A hierarchical mixture cure model with unobserved heterogeneity for credit risk," Econometrics and Statistics, Elsevier, vol. 22(C), pages 39-55.
    10. Dirick, Lore & Claeskens, Gerda & Baesens, Bart, 2015. "An Akaike information criterion for multiple event mixture cure models," European Journal of Operational Research, Elsevier, vol. 241(2), pages 449-457.
    11. Thanos Konstantinidis & Giorgos Iacovides & Mingxue Xu & Tony G. Constantinides & Danilo Mandic, 2024. "FinLlama: Financial Sentiment Classification for Algorithmic Trading Applications," Papers 2403.12285, arXiv.org.
    12. Goldmann, Leonie & Crook, Jonathan & Calabrese, Raffaella, 2024. "A new ordinal mixed-data sampling model with an application to corporate credit rating levels," European Journal of Operational Research, Elsevier, vol. 314(3), pages 1111-1126.
    13. Casado Yusta, Silvia & Nœ–ez Letamendía, Laura & Pacheco Bonrostro, Joaqu’n Antonio, 2018. "Predicting Corporate Failure: The GRASP-LOGIT Model || Predicci—n de la quiebra empresarial: el modelo GRASP-LOGIT," Revista de Métodos Cuantitativos para la Economía y la Empresa = Journal of Quantitative Methods for Economics and Business Administration, Universidad Pablo de Olavide, Department of Quantitative Methods for Economics and Business Administration, vol. 26(1), pages 294-314, Diciembre.
    14. Maldonado, Sebastián & Pérez, Juan & Bravo, Cristián, 2017. "Cost-based feature selection for Support Vector Machines: An application in credit scoring," European Journal of Operational Research, Elsevier, vol. 261(2), pages 656-665.
    15. Jorge Tejero, 2022. "Unwrapping black box models A case study in credit risk," Revista de Estabilidad Financiera, Banco de España, issue Otoño.
    16. Bellotti, Tony & Crook, Jonathan, 2013. "Forecasting and stress testing credit card default using dynamic models," International Journal of Forecasting, Elsevier, vol. 29(4), pages 563-574.
    17. Frank Xing, 2024. "Designing Heterogeneous LLM Agents for Financial Sentiment Analysis," Papers 2401.05799, arXiv.org.
    18. Nadia Ayed & Khemaies Bougatef, 2024. "Performance Assessment of Logistic Regression (LR), Artificial Neural Network (ANN), Fuzzy Inference System (FIS) and Adaptive Neuro-Fuzzy System (ANFIS) in Predicting Default Probability: The Case of," Computational Economics, Springer;Society for Computational Economics, vol. 64(3), pages 1803-1835, September.
    19. Ruize Gao & Shaoze Cui & Yu Wang & Wei Xu, 2025. "Predicting financial distress in high-dimensional imbalanced datasets: a multi-heterogeneous self-paced ensemble learning framework," Financial Innovation, Springer;Southwestern University of Finance and Economics, vol. 11(1), pages 1-34, December.
    20. Hoyoung Lee & Youngsoo Choi & Yuhee Kwon, 2024. "Quantifying Qualitative Insights: Leveraging LLMs to Market Predict," Papers 2411.08404, arXiv.org.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arx:papers:2501.10677. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: arXiv administrators (email available below). General contact details of provider: http://arxiv.org/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.