IDEAS home Printed from https://ideas.repec.org/a/aza/airwa0/y2024v3i2p142-150.html
   My bibliography  Save this article

Information retrieval from textual data: Harnessing large language models, retrieval augmented generation and prompt engineering

Author

Listed:
  • Hikov, Asen

    (Data Scientist, Amplify Analytix, Bulgaria)

  • Murphy, Laura

    (Amplify Analytix BV, The Netherlands)

Abstract

This paper describes how recent advancements in the field of Generative AI (GenAI), and more specifically large language models (LLMs), are incorporated into a practical application that solves the widespread and relevant business problem of information retrieval from textual data in PDF format: searching through legal texts, financial reports, research articles and so on. Marketing research, for example, often requires reading through hundreds of pages of financial reports to extract key information for research on competitors, partners, markets and prospective clients. It is a manual, error-prone and time-consuming task for marketers, where until recently there was little scope for automation, optimisation and scaling. The application we have developed combines LLMs with a retrieval augmented generation (RAG) architecture and prompt engineering to make this process more efficient. We have developed a chatbot that allows the user to upload multiple PDF documents and obtain a summary of predefined key areas as well as to ask specific questions and get answers from the combined documents’ content. The application’s architecture begins with the creation of an index for each of the PDF files. This index includes embedding the textual content and constructing a vector store. A query engine, employing a small-to-big retrieval method, is then used to accurately respond to a set of predefined questions for each PDF to create the summary. The prompt has been designed in a manner that minimises the risk of hallucination which is common in this type of model. The user interacts with the model via a chatbot feature. It utilises similar small-to-big retrieval techniques over the indices for straightforward queries, and a more complex sub-questions engine for in-depth analysis, providing a comprehensive and interactive tool for document analysis. We have estimated that the implementation of this tool would reduce the time spent on manual research tasks by around 60 per cent, based on the discussions we have had with potential users.

Suggested Citation

  • Hikov, Asen & Murphy, Laura, 2024. "Information retrieval from textual data: Harnessing large language models, retrieval augmented generation and prompt engineering," Journal of AI, Robotics & Workplace Automation, Henry Stewart Publications, vol. 3(2), pages 142-150, March.
  • Handle: RePEc:aza:airwa0:y:2024:v:3:i:2:p:142-150
    as

    Download full text from publisher

    File URL: https://hstalks.com/article/8575/download/
    Download Restriction: Requires a paid subscription for full access.

    File URL: https://hstalks.com/article/8575/
    Download Restriction: Requires a paid subscription for full access.
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    More about this item

    Keywords

    RAG architecture; LLM; PDF parsing; query engine;
    All these keywords.

    JEL classification:

    • M15 - Business Administration and Business Economics; Marketing; Accounting; Personnel Economics - - Business Administration - - - IT Management
    • G2 - Financial Economics - - Financial Institutions and Services

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:aza:airwa0:y:2024:v:3:i:2:p:142-150. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Henry Stewart Talks (email available below). General contact details of provider: .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.