IDEAS home Printed from https://ideas.repec.org/p/tse/wpaper/128497.html
   My bibliography  Save this paper

Parsimonious Wasserstein Text-mining

Author

Listed:
  • Gadat, Sébastien
  • Villeneuve, Stéphane

Abstract

This document introduces a parsimonious novel method of processing textual data based on the NMF factorization and on supervised clustering withWasserstein barycenter’s to reduce the dimension of the model. This dual treatment of textual data allows for a representation of a text as a probability distribution on the space of profiles which accounts for both uncertainty and semantic interpretability with the Wasserstein distance. The full textual information of a given period is represented as a random probability measure. This opens the door to a statistical inference method that seeks to predict a financial data using the information generated by the texts of a given period.

Suggested Citation

  • Gadat, Sébastien & Villeneuve, Stéphane, 2023. "Parsimonious Wasserstein Text-mining," TSE Working Papers 23-1471, Toulouse School of Economics (TSE).
  • Handle: RePEc:tse:wpaper:128497
    as

    Download full text from publisher

    File URL: https://www.tse-fr.eu/sites/default/files/TSE/documents/doc/wp/2023/wp_tse_1471.pdf
    File Function: Full Text
    Download Restriction: no
    ---><---

    References listed on IDEAS

    as
    1. Daniel D. Lee & H. Sebastian Seung, 1999. "Learning the parts of objects by non-negative matrix factorization," Nature, Nature, vol. 401(6755), pages 788-791, October.
    2. repec:hal:spmain:info:hdl:2441/1293p84sf58s482v2dpn0gsd67 is not listed on IDEAS
    3. Alfred Galichon & Bernard Salanié, 2010. "Matching with Trade-offs: Revealed Preferences over Competiting Characteristics," Working Papers hal-00473173, HAL.
    4. Grimmer, Justin & Stewart, Brandon M., 2013. "Text as Data: The Promise and Pitfalls of Automatic Content Analysis Methods for Political Texts," Political Analysis, Cambridge University Press, vol. 21(3), pages 267-297, July.
    5. Kearney, Colm & Liu, Sha, 2014. "Textual sentiment in finance: A survey of methods and models," International Review of Financial Analysis, Elsevier, vol. 33(C), pages 171-185.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. David Bholat & Stephen Hans & Pedro Santos & Cheryl Schonhardt-Bailey, 2015. "Text mining for central banks," Handbooks, Centre for Central Banking Studies, Bank of England, number 33, April.
    2. Yuting Chen & Don Bredin & Valerio Potì & Roman Matkovskyy, 2022. "COVID risk narratives: a computational linguistic approach to the econometric identification of narrative risk during a pandemic," Digital Finance, Springer, vol. 4(1), pages 17-61, March.
    3. van Loon, Austin, 2022. "Three Families of Automated Text Analysis," SocArXiv htnej, Center for Open Science.
    4. Christina Bannier & Thomas Pauls & Andreas Walter, 2019. "Content analysis of business communication: introducing a German dictionary," Journal of Business Economics, Springer, vol. 89(1), pages 79-123, February.
    5. Gunnar Friede, 2019. "Why don't we see more action? A metasynthesis of the investor impediments to integrate environmental, social, and governance factors," Business Strategy and the Environment, Wiley Blackwell, vol. 28(6), pages 1260-1282, September.
    6. Andres Algaba & David Ardia & Keven Bluteau & Samuel Borms & Kris Boudt, 2020. "Econometrics Meets Sentiment: An Overview Of Methodology And Applications," Journal of Economic Surveys, Wiley Blackwell, vol. 34(3), pages 512-547, July.
    7. Giovanna Maria Dora Dore, 2023. "A Natural Language Processing Analysis of Newspapers Coverage of Hong Kong Protests Between 1998 and 2020," Social Indicators Research: An International and Interdisciplinary Journal for Quality-of-Life Measurement, Springer, vol. 169(1), pages 143-166, September.
    8. Fabienne Kiener & Ann-Sophie Gnehm & Simon Clematide & Uschi Backes-Gellner, 2019. "IT skills in vocational training curricula and labour market outcomes," Economics of Education Working Paper Series 0159, University of Zurich, Department of Business Administration (IBW), revised Sep 2022.
    9. Andrew Todd & James Bowden & Yashar Moshfeghi, 2024. "Text‐based sentiment analysis in finance: Synthesising the existing literature and exploring future directions," Intelligent Systems in Accounting, Finance and Management, John Wiley & Sons, Ltd., vol. 31(1), March.
    10. Rafael Teixeira & Mário Antunes & Diogo Gomes & Rui L. Aguiar, 2024. "Comparison of Semantic Similarity Models on Constrained Scenarios," Information Systems Frontiers, Springer, vol. 26(4), pages 1307-1330, August.
    11. Del Corso, Gianna M. & Romani, Francesco, 2019. "Adaptive nonnegative matrix factorization and measure comparisons for recommender systems," Applied Mathematics and Computation, Elsevier, vol. 354(C), pages 164-179.
    12. Bernhardt, Lea & Dewenter, Ralf & Thomas, Tobias, 2023. "Measuring partisan media bias in US newscasts from 2001 to 2012," European Journal of Political Economy, Elsevier, vol. 78(C).
    13. P Fogel & C Geissler & P Cotte & G Luta, 2022. "Applying separative non-negative matrix factorization to extra-financial data," Working Papers hal-03689774, HAL.
    14. Ntentas, Raphael, 2021. "Quantifying political populism and examining the link with economic insecurity: evidence from Greece," LSE Research Online Documents on Economics 112579, London School of Economics and Political Science, LSE Library.
    15. Xiao-Bai Li & Jialun Qin, 2017. "Anonymizing and Sharing Medical Text Records," Information Systems Research, INFORMS, vol. 28(2), pages 332-352, June.
    16. Yan Luo & Linying Zhou, 2020. "Textual tone in corporate financial disclosures: a survey of the literature," International Journal of Disclosure and Governance, Palgrave Macmillan, vol. 17(2), pages 101-110, September.
    17. Lin, Annie E. & Young, Jimmy A. & Guarino, Jeannine E., 2022. "Mother-Daughter sexual abuse: An exploratory study of the experiences of survivors of MDSA using Reddit," Children and Youth Services Review, Elsevier, vol. 138(C).
    18. Bennani, Hamza, 2018. "Media coverage and ECB policy-making: Evidence from an augmented Taylor rule," Journal of Macroeconomics, Elsevier, vol. 57(C), pages 26-38.
    19. Joshua Zoen Git Hiew & Xin Huang & Hao Mou & Duan Li & Qi Wu & Yabo Xu, 2019. "BERT-based Financial Sentiment Index and LSTM-based Stock Return Predictability," Papers 1906.09024, arXiv.org, revised Jul 2022.
    20. Rybinski, Krzysztof, 2020. "The forecasting power of the multi-language narrative of sell-side research: A machine learning evaluation," Finance Research Letters, Elsevier, vol. 34(C).

    More about this item

    Keywords

    Natural Language Processing; Textual Analysis; Wasserstein distance; clustering;
    All these keywords.

    NEP fields

    This paper has been announced in the following NEP Reports:

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:tse:wpaper:128497. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: the person in charge (email available below). General contact details of provider: https://edirc.repec.org/data/tsetofr.html .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.