IDEAS home Printed from https://ideas.repec.org/p/bis/biswps/1245.html
   My bibliography  Save this paper

Putting AI agents through their paces on general tasks

Author

Listed:
  • Fernando Perez-Cruz
  • Hyun Song Shin

Abstract

Multimodal large language models (LLMs), trained on vast datasets are becoming increasingly capable in many settings. However, the capabilities of such models are typically evaluated in narrow tasks, much like standard machine learning models trained for specific objectives. We take a different tack by putting the latest LLM agents through their paces in general tasks involved in solving three popular games - Wordle, Face Quiz and Flashback. These games are easily tackled by humans but they demand a degree of self-awareness and higher-level abilities to experiment, to learn from mistakes and to plan accordingly. We find that the LLM agents display mixed performance in these general tasks. They lack the awareness to learn from mistakes and the capacity for self-correction. LLMs' performance in the most complex cognitive subtasks may not be the limiting factor for their deployment in real-world environments. Instead, it would be important to evaluate the capabilities of AGI-aspiring LLMs through general tests that encompass multiple cognitive tasks, enabling them to solve complete, real-world applications.

Suggested Citation

  • Fernando Perez-Cruz & Hyun Song Shin, 2025. "Putting AI agents through their paces on general tasks," BIS Working Papers 1245, Bank for International Settlements.
  • Handle: RePEc:bis:biswps:1245
    as

    Download full text from publisher

    File URL: https://www.bis.org/publ/work1245.pdf
    File Function: Full PDF document
    Download Restriction: no

    File URL: https://www.bis.org/publ/work1245.htm
    Download Restriction: no
    ---><---

    More about this item

    Keywords

    AI Agents; LLMs evaluation;

    JEL classification:

    • C88 - Mathematical and Quantitative Methods - - Data Collection and Data Estimation Methodology; Computer Programs - - - Other Computer Software

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:bis:biswps:1245. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Martin Fessler (email available below). General contact details of provider: https://edirc.repec.org/data/bisssch.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.