IDEAS home Printed from https://ideas.repec.org/p/arx/papers/2406.03614.html
   My bibliography  Save this paper

Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs

Author

Listed:
  • Alexander Bakumenko

    (Clemson University, USA)

  • Katev{r}ina Hlav'av{c}kov'a-Schindler

    (University of Vienna, Austria)

  • Claudia Plant

    (University of Vienna, Austria)

  • Nina C. Hubig

    (Clemson University, USA)

Abstract

Detecting anomalies in general ledger data is of utmost importance to ensure trustworthiness of financial records. Financial audits increasingly rely on machine learning (ML) algorithms to identify irregular or potentially fraudulent journal entries, each characterized by a varying number of transactions. In machine learning, heterogeneity in feature dimensions adds significant complexity to data analysis. In this paper, we introduce a novel approach to anomaly detection in financial data using Large Language Models (LLMs) embeddings. To encode non-semantic categorical data from real-world financial records, we tested 3 pre-trained general purpose sentence-transformer models. For the downstream classification task, we implemented and evaluated 5 optimized ML models including Logistic Regression, Random Forest, Gradient Boosting Machines, Support Vector Machines, and Neural Networks. Our experiments demonstrate that LLMs contribute valuable information to anomaly detection as our models outperform the baselines, in selected settings even by a large margin. The findings further underscore the effectiveness of LLMs in enhancing anomaly detection in financial journal entries, particularly by tackling feature sparsity. We discuss a promising perspective on using LLM embeddings for non-semantic data in the financial context and beyond.

Suggested Citation

  • Alexander Bakumenko & Katev{r}ina Hlav'av{c}kov'a-Schindler & Claudia Plant & Nina C. Hubig, 2024. "Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs," Papers 2406.03614, arXiv.org.
  • Handle: RePEc:arx:papers:2406.03614
    as

    Download full text from publisher

    File URL: http://arxiv.org/pdf/2406.03614
    File Function: Latest version
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Dimitrios Vamvourellis & M'at'e Toth & Snigdha Bhagat & Dhruv Desai & Dhagash Mehta & Stefano Pasquali, 2023. "Company Similarity using Large Language Models," Papers 2308.08031, arXiv.org.
    2. Xiu Li & Aron Henriksson & Martin Duneld & Jalal Nouri & Yongchao Wu, 2023. "Evaluating Embeddings from Pre-Trained Language Models and Knowledge Graphs for Educational Content Recommendation," Future Internet, MDPI, vol. 16(1), pages 1-21, December.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.

      More about this item

      NEP fields

      This paper has been announced in the following NEP Reports:

      Statistics

      Access and download statistics

      Corrections

      All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arx:papers:2406.03614. See general information about how to correct material in RePEc.

      If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

      If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

      If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

      For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: arXiv administrators (email available below). General contact details of provider: http://arxiv.org/ .

      Please note that corrections may take a couple of weeks to filter through the various RePEc services.

      IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.