IDEAS home Printed from https://ideas.repec.org/a/spr/sankhb/v86y2024i2d10.1007_s13571-024-00331-1.html
   My bibliography  Save this article

Word Embeddings as Statistical Estimators

Author

Listed:
  • Neil Dey

    (North Carolina State University)

  • Matthew Singer

    (North Carolina State University)

  • Jonathan P. Williams

    (North Carolina State University
    Norwegian Academy of Science and Letters)

  • Srijan Sengupta

    (North Carolina State University)

Abstract

Word embeddings are a fundamental tool in natural language processing. Currently, word embedding methods are evaluated on the basis of empirical performance on benchmark data sets, and there is a lack of rigorous understanding of their theoretical properties. This paper studies word embeddings from a statistical theoretical perspective, which is essential for formal inference and uncertainty quantification. We propose a copula-based statistical model for text data and show that under this model, the now-classical Word2Vec method can be interpreted as a statistical estimation method for estimating the theoretical pointwise mutual information (PMI). We further illustrate the utility of this statistical model by using it to develop a missing value-based estimator as a statistically tractable and interpretable alternative to the Word2Vec approach. The estimation error of this estimator is comparable to Word2Vec and improves upon the truncation-based method proposed by Levy and Goldberg (Adv. Neural Inf. Process. Syst., 27, 2177–2185 2014). The resulting estimator also is comparable to Word2Vec in a benchmark sentiment analysis task on the IMDb Movie Reviews data set and a part-of-speech tagging task on the OntoNotes data set.

Suggested Citation

  • Neil Dey & Matthew Singer & Jonathan P. Williams & Srijan Sengupta, 2024. "Word Embeddings as Statistical Estimators," Sankhya B: The Indian Journal of Statistics, Springer;Indian Statistical Institute, vol. 86(2), pages 415-441, November.
  • Handle: RePEc:spr:sankhb:v:86:y:2024:i:2:d:10.1007_s13571-024-00331-1
    DOI: 10.1007/s13571-024-00331-1
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s13571-024-00331-1
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s13571-024-00331-1?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Junhui Wang & Xiaotong Shen & Yiwen Sun & Annie Qu, 2016. "Classification With Unstructured Predictors and an Application to Sentiment Analysis," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 111(515), pages 1242-1253, July.
    2. Lucy Xia & Richard Zhao & Yanhui Wu & Xin Tong, 2021. "Intentional Control of Type I Error Over Unconscious Data Distortion: A Neyman–Pearson Approach to Text Classification," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 116(533), pages 68-81, March.
    3. Yutong Li & Ruoqing Zhu & Annie Qu & Han Ye & Zhankun Sun, 2021. "Topic Modeling on Triage Notes With Semiorthogonal Nonnegative Matrix Factorization," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 116(536), pages 1609-1624, October.
    4. Xiaohan Yan & Jacob Bien, 2021. "Rare Feature Selection in High Dimensions," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 116(534), pages 887-900, April.
    5. Mingyuan Zhou & Oscar Hernan Madrid Padilla & James G. Scott, 2016. "Priors for Random Count Matrices Derived from a Family of Negative Binomial Processes," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 111(515), pages 1144-1156, July.
    6. Matt Taddy, 2013. "Multinomial Inverse Regression for Text Analysis," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 108(503), pages 755-770, September.
    7. Edoardo M. Airoldi & Jonathan M. Bischof, 2016. "Improving and Evaluating Topic Models and Other Models of Text," Journal of the American Statistical Association, Taylor & Francis Journals, vol. 111(516), pages 1381-1403, October.
    8. Scott Deerwester & Susan T. Dumais & George W. Furnas & Thomas K. Landauer & Richard Harshman, 1990. "Indexing by latent semantic analysis," Journal of the American Society for Information Science, Association for Information Science & Technology, vol. 41(6), pages 391-407, September.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Saskia Ter Ellen & Vegard H. Larsen & Leif Anders Thorsrud, 2022. "Narrative Monetary Policy Surprises and the Media," Journal of Money, Credit and Banking, Blackwell Publishing, vol. 54(5), pages 1525-1549, August.
    2. Matthew Gentzkow & Bryan T. Kelly & Matt Taddy, 2017. "Text as Data," NBER Working Papers 23276, National Bureau of Economic Research, Inc.
    3. Berk Wheelock, Lauren & Pachamanova, Dessislava A., 2022. "Acceptable set topic modeling," European Journal of Operational Research, Elsevier, vol. 299(2), pages 653-673.
    4. Puklavec, Žiga & Kogler, Christoph & Stavrova, Olga & Zeelenberg, Marcel, 2023. "What we tweet about when we tweet about taxes: A topic modelling approach," Journal of Economic Behavior & Organization, Elsevier, vol. 212(C), pages 1242-1254.
    5. Irina Wedel & Michael Palk & Stefan Voß, 2022. "A Bilingual Comparison of Sentiment and Topics for a Product Event on Twitter," Information Systems Frontiers, Springer, vol. 24(5), pages 1635-1646, October.
    6. Julia Cagé & Caroline Le Pennec & Elisa Mougin, 2021. "Corporate Donations and Political Rhetoric: Evidence from a National Ban," Working Papers hal-03877943, HAL.
    7. Mohammed Salem Binwahlan, 2023. "Polynomial Networks Model for Arabic Text Summarization," International Journal of Research and Scientific Innovation, International Journal of Research and Scientific Innovation (IJRSI), vol. 10(2), pages 74-84, February.
    8. Federico Maria Ferrara & Jörg S Haas & Andrew Peterson & Thomas Sattler, 2022. "Exports vs. Investment: How Public Discourse Shapes Support for External Imbalances," Post-Print hal-02569351, HAL.
    9. Curci, Ylenia & Mongeau Ospina, Christian A., 2016. "Investigating biofuels through network analysis," Energy Policy, Elsevier, vol. 97(C), pages 60-72.
    10. Chao Wei & Senlin Luo & Xincheng Ma & Hao Ren & Ji Zhang & Limin Pan, 2016. "Locally Embedding Autoencoders: A Semi-Supervised Manifold Learning Approach of Document Representation," PLOS ONE, Public Library of Science, vol. 11(1), pages 1-20, January.
    11. Dehler-Holland, Joris & Schumacher, Kira & Fichtner, Wolf, 2021. "Topic Modeling Uncovers Shifts in Media Framing of the German Renewable Energy Act," EconStor Open Access Articles and Book Chapters, ZBW - Leibniz Information Centre for Economics, vol. 2(1).
    12. Maksym Polyakov & Morteza Chalak & Md. Sayed Iftekhar & Ram Pandit & Sorada Tapsuwan & Fan Zhang & Chunbo Ma, 2018. "Authorship, Collaboration, Topics, and Research Gaps in Environmental and Resource Economics 1991–2015," Environmental & Resource Economics, Springer;European Association of Environmental and Resource Economists, vol. 71(1), pages 217-239, September.
    13. Ding, Ying, 2011. "Community detection: Topological vs. topical," Journal of Informetrics, Elsevier, vol. 5(4), pages 498-514.
    14. Klaus Gugler & Florian Szücs & Ulrich Wohak, 2023. "Start-up Acquisitions, Venture Capital and Innovation: A Comparative Study of Google, Apple, Facebook, Amazon and Microsoft," Department of Economics Working Papers wuwp340, Vienna University of Economics and Business, Department of Economics.
    15. Md Nazrul Islam & Md Mofazzal Hossain & Md Shafayet Shahed Ornob, 2024. "Business research on Industry 4.0: a systematic review using topic modelling approach," Future Business Journal, Springer, vol. 10(1), pages 1-15, December.
    16. Matthew Gentzkow & Jesse M. Shapiro & Matt Taddy, 2019. "Measuring Group Differences in High‐Dimensional Choices: Method and Application to Congressional Speech," Econometrica, Econometric Society, vol. 87(4), pages 1307-1340, July.
    17. Juan Shi & Kin Keung Lai & Ping Hu & Gang Chen, 2018. "Factors dominating individual information disseminating behavior on social networking sites," Information Technology and Management, Springer, vol. 19(2), pages 121-139, June.
    18. Ganesh Dash & Chetan Sharma & Shamneesh Sharma, 2023. "Sustainable Marketing and the Role of Social Media: An Experimental Study Using Natural Language Processing (NLP)," Sustainability, MDPI, vol. 15(6), pages 1-16, March.
    19. Paola Cerchiello & Giancarlo Nicola, 2018. "Assessing News Contagion in Finance," Econometrics, MDPI, vol. 6(1), pages 1-19, February.
    20. Shr-Wei Kao & Pin Luarn, 2020. "Topic Modeling Analysis of Social Enterprises: Twitter Evidence," Sustainability, MDPI, vol. 12(8), pages 1-20, April.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:sankhb:v:86:y:2024:i:2:d:10.1007_s13571-024-00331-1. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.