IDEAS home Printed from https://ideas.repec.org/p/arx/papers/2412.02065.html
   My bibliography  Save this paper

Leveraging Large Language Models to Democratize Access to Costly Financial Datasets for Academic Research

Author

Listed:
  • Julian Junyan Wang
  • Victor Xiaoqi Wang

Abstract

Unequal access to costly datasets essential for empirical research has long hindered researchers from disadvantaged institutions, limiting their ability to contribute to their fields and advance their careers. Recent breakthroughs in Large Language Models (LLMs) have the potential to democratize data access by automating data collection from unstructured sources. We develop and evaluate a novel methodology using GPT-4o-mini within a Retrieval-Augmented Generation (RAG) framework to collect data from corporate disclosures. Our approach achieves human-level accuracy in collecting CEO pay ratios from approximately 10,000 proxy statements and Critical Audit Matters (CAMs) from more than 12,000 10-K filings, with LLM processing times of 9 and 40 minutes respectively, each at a cost under $10. This stands in stark contrast to the hundreds of hours needed for manual collection or the thousands of dollars required for commercial database subscriptions. To foster a more inclusive research community by empowering researchers with limited resources to explore new avenues of inquiry, we share our methodology and the resulting datasets.

Suggested Citation

  • Julian Junyan Wang & Victor Xiaoqi Wang, 2024. "Leveraging Large Language Models to Democratize Access to Costly Financial Datasets for Academic Research," Papers 2412.02065, arXiv.org.
  • Handle: RePEc:arx:papers:2412.02065
    as

    Download full text from publisher

    File URL: http://arxiv.org/pdf/2412.02065
    File Function: Latest version
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Kim, E. Han & Morse, Adair & Zingales, Luigi, 2009. "Are elite universities losing their competitive edge?," Journal of Financial Economics, Elsevier, vol. 93(3), pages 353-381, September.
    2. Hendrik P. van Dalen & Kène Henkens, 2012. "Intended and unintended consequences of a publish-or-perish culture: A worldwide survey," Journal of the Association for Information Science & Technology, Association for Information Science & Technology, vol. 63(7), pages 1282-1293, July.
    3. Josh Angrist & Pierre Azoulay & Glenn Ellison & Ryan Hill & Susan Feng Lu, 2020. "Inside Job or Deep Impact? Extramural Citations and the Influence of Economic Scholarship," Journal of Economic Literature, American Economic Association, vol. 58(1), pages 3-52, March.
    4. Beaudry, Catherine & Allaoui, Sedki, 2012. "Impact of public and private research funding on scientific production: The case of nanotechnology," Research Policy, Elsevier, vol. 41(9), pages 1589-1606.
    5. Anton Korinek, 2023. "Generative AI for Economic Research: Use Cases and Implications for Economists," Journal of Economic Literature, American Economic Association, vol. 61(4), pages 1281-1317, December.
    6. Daniel S. Hamermesh, 2018. "Citations in Economics: Measurement, Uses, and Impacts," Journal of Economic Literature, American Economic Association, vol. 56(1), pages 115-156, March.
    7. Edward P. Swanson, 2004. "Publishing in the Majors: A Comparison of Accounting, Finance, Management, and Marketing," Contemporary Accounting Research, John Wiley & Sons, vol. 21(1), pages 223-255, March.
    8. Richard Van Noorden & Jeffrey M. Perkel, 2023. "AI and science: what 1,600 researchers think," Nature, Nature, vol. 621(7980), pages 672-675, September.
    9. Dong, Mengming Michael & Stratopoulos, Theophanis C. & Wang, Victor Xiaoqi, 2024. "A scoping review of ChatGPT research in accounting and finance," International Journal of Accounting Information Systems, Elsevier, vol. 55(C).
    10. Dowling, Michael & Lucey, Brian, 2023. "ChatGPT for (Finance) research: The Bananarama Conjecture," Finance Research Letters, Elsevier, vol. 53(C).
    11. Dyer, Travis & Lang, Mark & Stice-Lawrence, Lorien, 2017. "The evolution of 10-K textual disclosure: Evidence from Latent Dirichlet Allocation," Journal of Accounting and Economics, Elsevier, vol. 64(2), pages 221-245.
    12. Kim, E. Han & Morse, Adair & Zingales, Luigi, 2006. "What Has Mattered to Economics Since 1970," Working Papers 212, The University of Chicago Booth School of Business, George J. Stigler Center for the Study of the Economy and the State.
    13. E. Han Kim & Adair Morse & Luigi Zingales, 2006. "What Has Mattered to Economics Since 1970," Journal of Economic Perspectives, American Economic Association, vol. 20(4), pages 189-202, Fall.
    14. Feng Li, 2010. "The Information Content of Forward‐Looking Statements in Corporate Filings—A Naïve Bayesian Machine Learning Approach," Journal of Accounting Research, Wiley Blackwell, vol. 48(5), pages 1049-1102, December.
    15. Ghio, Alessandro, 2024. "Democratizing academic research with Artificial Intelligence: The misleading case of language," CRITICAL PERSPECTIVES ON ACCOUNTING, Elsevier, vol. 98(C).
    16. Mahmoud El-Haj & Paulo Alves & Paul Rayson & Martin Walker & Steven Young, 2020. "Retrieving, classifying and analysing narrative commentary in unstructured (glossy) annual reports published as PDF files," Accounting and Business Research, Taylor & Francis Journals, vol. 50(1), pages 6-34, January.
    17. Rui Dai & Lawrence Donohue & Qingyi (Freda) Drechsler & Wei Jiang, 2023. "Dissemination, Publication, and Impact of Finance Research: When Novelty Meets Conventionality," Review of Finance, European Finance Association, vol. 27(1), pages 79-141.
    18. G. Andrew Karolyi, 2016. "Home Bias, an Academic Puzzle," Review of Finance, European Finance Association, vol. 20(6), pages 2049-2078.
    19. Audra Boone & Austin Starkweather & Joshua T White, 2024. "The saliency of the CEO pay ratio," Review of Finance, European Finance Association, vol. 28(3), pages 1059-1104.
    20. Jingwei Ni & Julia Bingler & Chiara Colesanti Senni & Mathias Kraus & Glen Gostlow & Tobias Schimanski & Dominik Stammbach & Saeid Vaghefi & Qian Wang & Nicolas Webersinke & Tobias Wekhof & Tingyu Yu , 2023. "chatReport: Democratizing Sustainability Disclosure Analysis through LLM-based Tools," Swiss Finance Institute Research Paper Series 23-111, Swiss Finance Institute.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Rui Dai & Lawrence Donohue & Qingyi (Freda) Drechsler & Wei Jiang, 2023. "Dissemination, Publication, and Impact of Finance Research: When Novelty Meets Conventionality," Review of Finance, European Finance Association, vol. 27(1), pages 79-141.
    2. Dong, Mengming Michael & Stratopoulos, Theophanis C. & Wang, Victor Xiaoqi, 2024. "A scoping review of ChatGPT research in accounting and finance," International Journal of Accounting Information Systems, Elsevier, vol. 55(C).
    3. Everett, Jeff & Shiraz Rahaman, Abu & Neu, Dean & Saxton, Gregory, 2024. "Letters to the editor, institutional experimentation, and the public accounting professional," CRITICAL PERSPECTIVES ON ACCOUNTING, Elsevier, vol. 99(C).
    4. Allen H. Huang & Jianghua Shen & Amy Y. Zang, 2022. "The unintended benefit of the risk factor mandate of 2005," Review of Accounting Studies, Springer, vol. 27(4), pages 1319-1355, December.
    5. Sara Mota Cardoso & Aurora A. C. Teixeira, 2020. "The Focus on Poverty in the Most Influential Journals in Economics: A Bibliometric Analysis of the “Blue Ribbon” Journals," Poverty & Public Policy, John Wiley & Sons, vol. 12(1), pages 10-42, March.
    6. Matthias Aistleitner & Jakob Kapeller & Stefan Steinerberger, 2018. "Citation Patterns in Economics and Beyond," Working Papers Series 85, Institute for New Economic Thinking.
    7. James P. Ryans, 2021. "Textual classification of SEC comment letters," Review of Accounting Studies, Springer, vol. 26(1), pages 37-80, March.
    8. Syed Hasan & Robert Breunig, 2021. "Article length and citation outcomes," Scientometrics, Springer;Akadémiai Kiadó, vol. 126(9), pages 7583-7608, September.
    9. Alexander L. Brown & Taisuke Imai & Ferdinand M. Vieider & Colin F. Camerer, 2024. "Meta-analysis of Empirical Estimates of Loss Aversion," Journal of Economic Literature, American Economic Association, vol. 62(2), pages 485-516, June.
    10. Meyer, Matthias & Waldkirch, Rüdiger W. & Duscher, Irina & Just, Alexander, 2018. "Drivers of citations: An analysis of publications in “top” accounting journals," CRITICAL PERSPECTIVES ON ACCOUNTING, Elsevier, vol. 51(C), pages 24-46.
    11. Claude Diebolt & Michael Haupert, 2021. "The Role of Cliometrics in History and Economics," Working Papers of BETA 2021-26, Bureau d'Economie Théorique et Appliquée, UDS, Strasbourg.
    12. Berkin, Anil & Aerts, Walter & Van Caneghem, Tom, 2023. "Feasibility analysis of machine learning for performance-related attributional statements," International Journal of Accounting Information Systems, Elsevier, vol. 48(C).
    13. Erich Battistin & Marco Ovidi, 2022. "Rising Stars: Expert Reviews and Reputational Yardsticks in the Research Excellence Framework," Economica, London School of Economics and Political Science, vol. 89(356), pages 830-848, October.
    14. Roberto Di Pietra & Stefano Zambon, 2022. "Book Review. Lorenzo Simoni, Business Models and Corporate Reporting: Defining the Platform to Illustrate Value Creation, Routledge, 2022 by Sam Rawsthorne," FINANCIAL REPORTING, FrancoAngeli Editore, vol. 2022(1), pages 167-172.
    15. Carlo D'Ippoliti, 2021. "“Many‐Citedness”: Citations Measure More Than Just Scientific Quality," Journal of Economic Surveys, Wiley Blackwell, vol. 35(5), pages 1271-1301, December.
    16. Martina Cioni & Giovanni Federico & Michelangelo Vasta, 2022. "Persistence studies: a new kind of economic history?," Review of Regional Research: Jahrbuch für Regionalwissenschaft, Springer;Gesellschaft für Regionalforschung (GfR), vol. 42(3), pages 227-248, December.
    17. Damien Besancenot & Abdelghani Maddi, 2019. "Should citations be weighted to assess the influence of an academic article?," Economics Bulletin, AccessEcon, vol. 435(1), pages 435-445.
    18. Merigó, José M. & Gil-Lafuente, Anna M. & Kydland, Finn & Amiguet, Lluis & Vivoda, Vlado & Campbell, Gary & Lei, Yalin & Fleming-Muñoz, David, 2024. "50 years of Resources Policy: A bibliometric analysis," Resources Policy, Elsevier, vol. 96(C).
    19. Martina Cioni & Giovanni Federico & Michelangelo Vasta, 2021. "The State of the Art of Economic History: The Uneasy Relation with Economics," Working Papers 20210067, New York University Abu Dhabi, Department of Social Science, revised Jun 2021.
    20. Jelnov, Pavel & Weiss, Yoram, 2022. "Influence in economics and aging," Labour Economics, Elsevier, vol. 77(C).

    More about this item

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arx:papers:2412.02065. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: arXiv administrators (email available below). General contact details of provider: http://arxiv.org/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.