IDEAS home Printed from https://ideas.repec.org/p/arx/papers/2412.02605.html
   My bibliography  Save this paper

Interpretable Company Similarity with Sparse Autoencoders

Author

Listed:
  • Marco Molinari
  • Victor Shao
  • Vladimir Tregubiak
  • Abhimanyu Pandey
  • Mateusz Mikolajczak
  • Sebastian Kuznetsov Ryder Torres Pereira

Abstract

Determining company similarity is a vital task in finance, underpinning hedging, risk management, portfolio diversification, and more. Practitioners often rely on sector and industry classifications to gauge similarity, such as SIC-codes and GICS-codes - the former being used by the U.S. Securities and Exchange Commission (SEC), and the latter widely used by the investment community. Since these classifications can lack granularity and often need to be updated, using clusters of embeddings of company descriptions has been proposed as a potential alternative, but the lack of interpretability in token embeddings poses a significant barrier to adoption in high-stakes contexts. Sparse Autoencoders (SAEs) have shown promise in enhancing the interpretability of Large Language Models (LLMs) by decomposing LLM activations into interpretable features. We apply SAEs to company descriptions, obtaining meaningful clusters of equities in the process. We benchmark SAE features against SIC-codes, Major Group codes, and Embeddings. Our results demonstrate that SAE features not only replicate but often surpass sector classifications and embeddings in capturing fundamental company characteristics. This is evidenced by their superior performance in correlating monthly returns - a proxy for similarity - and generating higher Sharpe ratio co-integration strategies, which underscores deeper fundamental similarities among companies.

Suggested Citation

  • Marco Molinari & Victor Shao & Vladimir Tregubiak & Abhimanyu Pandey & Mateusz Mikolajczak & Sebastian Kuznetsov Ryder Torres Pereira, 2024. "Interpretable Company Similarity with Sparse Autoencoders," Papers 2412.02605, arXiv.org, revised Dec 2024.
  • Handle: RePEc:arx:papers:2412.02605
    as

    Download full text from publisher

    File URL: http://arxiv.org/pdf/2412.02605
    File Function: Latest version
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Tim Loughran & Bill McDonald & Hayong Yun, 2009. "A Wolf in Sheep’s Clothing: The Use of Ethics-Related Terms in 10-K Reports," Journal of Business Ethics, Springer, vol. 89(1), pages 39-49, May.
    2. G. Bonanno & G. Caldarelli & F. Lillo & S. Micciché & N. Vandewalle & R. Mantegna, 2004. "Networks of equities in financial markets," The European Physical Journal B: Condensed Matter and Complex Systems, Springer;EDP Sciences, vol. 38(2), pages 363-371, March.
    3. Shuangshuang Chen & Wei Guo, 2023. "Auto-Encoders in Deep Learning—A Review with New Perspectives," Mathematics, MDPI, vol. 11(8), pages 1-54, April.
    4. Ole Peters, 2011. "Optimal leverage from non-ergodicity," Quantitative Finance, Taylor & Francis Journals, vol. 11(11), pages 1593-1602.
    5. Mico Loretan & William B English, 2000. "Evaluating changes in correlations during periods of high market volatility," BIS Quarterly Review, Bank for International Settlements, pages 29-36, June.
    6. Dimitrios Vamvourellis & M'at'e Toth & Snigdha Bhagat & Dhruv Desai & Dhagash Mehta & Stefano Pasquali, 2023. "Company Similarity using Large Language Models," Papers 2308.08031, arXiv.org.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Champagne, Claudia, 2014. "The international syndicated loan market network: An “unholy trinity”?," Global Finance Journal, Elsevier, vol. 25(2), pages 148-168.
    2. Sebastiano Michele Zema & Giorgio Fagiolo & Tiziano Squartini & Diego Garlaschelli, 2021. "Mesoscopic Structure of the Stock Market and Portfolio Optimization," Papers 2112.06544, arXiv.org.
    3. Viktor Stojkoski & Trifce Sandev & Lasko Basnarkov & Ljupco Kocarev & Ralf Metzler, 2020. "Generalised geometric Brownian motion: Theory and applications to option pricing," Papers 2011.00312, arXiv.org.
    4. Liu, Pu & Nguyen, Hazel T., 2020. "CEO characteristics and tone at the top inconsistency," Journal of Economics and Business, Elsevier, vol. 108(C).
    5. Kladakis, George & Chen, Lei & Bellos, Sotirios K., 2023. "Ethical bank disclosures and liquidity creation," Journal of International Financial Markets, Institutions and Money, Elsevier, vol. 84(C).
    6. Paulus, Michal & Kristoufek, Ladislav, 2015. "Worldwide clustering of the corruption perception," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 428(C), pages 351-358.
    7. Peng Yue & Qing Cai & Wanfeng Yan & Wei-Xing Zhou, 2020. "Information flow networks of Chinese stock market sectors," Papers 2004.08759, arXiv.org.
    8. Djauhari, Maman Abdurachman & Gan, Siew Lee, 2015. "Optimality problem of network topology in stocks market analysis," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 419(C), pages 108-114.
    9. Sandoval, Leonidas, 2014. "To lag or not to lag? How to compare indices of stock markets that operate on different times," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 403(C), pages 227-243.
    10. Peralta, Gustavo & Zareei, Abalfazl, 2016. "A network approach to portfolio selection," Journal of Empirical Finance, Elsevier, vol. 38(PA), pages 157-180.
    11. Castagna, Alina & Chentouf, Leila & Ernst, Ekkehard, 2017. "Economic vulnerabilities in Italy: A network analysis using similarities in sectoral employment," GLO Discussion Paper Series 50, Global Labor Organization (GLO).
    12. Artur F. Tomeczek & Tomasz M. Napiórkowski, 2024. "PageRank and Regression as a Two-Step Approach to Analysing a Network of Nasdaq Firms During a Recession: Insights from Minimum Spanning Tree Topology," Gospodarka Narodowa. The Polish Journal of Economics, Warsaw School of Economics, issue 3, pages 56-69.
    13. Yanhua Chen & Rosario N Mantegna & Athanasios A Pantelous & Konstantin M Zuev, 2018. "A dynamic analysis of S&P 500, FTSE 100 and EURO STOXX 50 indices under different exchange rates," PLOS ONE, Public Library of Science, vol. 13(3), pages 1-40, March.
    14. Teh, Boon Kin & Goo, Yik Wen & Lian, Tong Wei & Ong, Wei Guang & Choi, Wen Ting & Damodaran, Mridula & Cheong, Siew Ann, 2015. "The Chinese Correction of February 2007: How financial hierarchies change in a market crash," Physica A: Statistical Mechanics and its Applications, Elsevier, vol. 424(C), pages 225-241.
    15. Jia Xu & Jiuchang Wei & Liangdong Lu, 2019. "Strategic stakeholder management, environmental corporate social responsibility engagement, and financial performance of stigmatized firms derived from Chinese special environmental policy," Business Strategy and the Environment, Wiley Blackwell, vol. 28(6), pages 1027-1044, September.
    16. Carlos León & Geun-Young Kim & Constanza Martínez & Daeyup Lee, 2017. "Equity markets’ clustering and the global financial crisis," Quantitative Finance, Taylor & Francis Journals, vol. 17(12), pages 1905-1922, December.
    17. Anna Maria D’Arcangelis & Giulia Rotundo, 2016. "Complex Networks in Finance," Lecture Notes in Economics and Mathematical Systems, in: Pasquale Commendatore & Mariano Matilla-García & Luis M. Varela & Jose S. Cánovas (ed.), Complex Networks and Dynamics, pages 209-235, Springer.
    18. Cheng Juan Zhan & William Rea & Alethea Rea, 2016. "Stock Selection as a Problem in Phylogenetics—Evidence from the ASX," IJFS, MDPI, vol. 4(4), pages 1-19, September.
    19. Shekhtman, Louis M. & Danziger, Michael M. & Havlin, Shlomo, 2016. "Recent advances on failure and recovery in networks of networks," Chaos, Solitons & Fractals, Elsevier, vol. 90(C), pages 28-36.
    20. Irena Vodenska & Alexander P. Becker & Di Zhou & Dror Y. Kenett & H. Eugene Stanley & Shlomo Havlin, 2016. "Community Analysis of Global Financial Markets," Risks, MDPI, vol. 4(2), pages 1-15, May.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:arx:papers:2412.02605. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: arXiv administrators (email available below). General contact details of provider: http://arxiv.org/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.