IDEAS home Printed from https://ideas.repec.org/a/eee/exehis/v87y2023ics0014498322000729.html
   My bibliography  Save this article

Measuring document similarity with weighted averages of word embeddings

Author

Listed:
  • Seegmiller, Bryan
  • Papanikolaou, Dimitris
  • Schmidt, Lawrence D.W.

Abstract

We detail a methodology for estimating the textual similarity between two documents while accounting for the possibility that two different words can have a similar meaning. We illustrate the method’s usefulness in facilitating comparisons between documents with very different formats and vocabularies by textually linking occupation task and industry output descriptions with related technologies as described in patent texts; we also examine economic applications of the resultant document similarity measures. In a final application we demonstrate that the method also works well relative to alternatives for comparing documents within the same domain by showing that pairwise textual similarity between occupations’ task descriptions strongly predicts the probability that a given worker will transition from one occupation to another. Finally, we offer some suggestions on other potential uses and guidance in implementing the method.

Suggested Citation

  • Seegmiller, Bryan & Papanikolaou, Dimitris & Schmidt, Lawrence D.W., 2023. "Measuring document similarity with weighted averages of word embeddings," Explorations in Economic History, Elsevier, vol. 87(C).
  • Handle: RePEc:eee:exehis:v:87:y:2023:i:c:s0014498322000729
    DOI: 10.1016/j.eeh.2022.101494
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0014498322000729
    Download Restriction: Full text for ScienceDirect subscribers only

    File URL: https://libkey.io/10.1016/j.eeh.2022.101494?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Bryan Kelly & Dimitris Papanikolaou & Amit Seru & Matt Taddy, 2021. "Measuring Technological Innovation over the Long Run," American Economic Review: Insights, American Economic Association, vol. 3(3), pages 303-320, September.
    2. Stephen Hansen & Tejas Ramdas & Raffaella Sadun & Joe Fuller, 2021. "The Demand for Executive Skills," NBER Working Papers 28959, National Bureau of Economic Research, Inc.
    3. Gerard Hoberg & Gordon Phillips, 2016. "Text-Based Network Industries and Endogenous Product Differentiation," Journal of Political Economy, University of Chicago Press, vol. 124(5), pages 1423-1465.
    4. David Autor & Caroline Chin & Anna Salomons & Bryan Seegmiller, 2024. "New Frontiers: The Origins and Content of New Work, 1940–2018," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 139(3), pages 1399-1465.
    5. Barbara Biasi & Song Ma, 2022. "The Education-Innovation Gap," CESifo Working Paper Series 9653, CESifo.
    6. Acemoglu, Daron & Autor, David, 2011. "Skills, Tasks and Technologies: Implications for Employment and Earnings," Handbook of Labor Economics, in: O. Ashenfelter & D. Card (ed.), Handbook of Labor Economics, edition 1, volume 4, chapter 12, pages 1043-1171, Elsevier.
    7. Enghin Atalay & Phai Phongthiengtham & Sebastian Sotelo & Daniel Tannenbaum, 2020. "The Evolution of Work in the United States," American Economic Journal: Applied Economics, American Economic Association, vol. 12(2), pages 1-34, April.
    8. Barbara Biasi & Song Ma, 2022. "The Education-Innovation Gap," NBER Working Papers 29853, National Bureau of Economic Research, Inc.
    9. Scott Deerwester & Susan T. Dumais & George W. Furnas & Thomas K. Landauer & Richard Harshman, 1990. "Indexing by latent semantic analysis," Journal of the American Society for Information Science, Association for Information Science & Technology, vol. 41(6), pages 391-407, September.
    10. Leonid Kogan & Dimitris Papanikolaou & Lawrence D. W. Schmidt & Bryan Seegmiller, 2021. "Technology, Vintage-Specific Human Capital, and Labor Displacement: Evidence from Linking Patents with Occupations," NBER Working Papers 29552, National Bureau of Economic Research, Inc.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. von Bodman, Nicolas, 2024. "The impact of prospectus language on IPO underpricing: A textual analysis of European IPOs," Junior Management Science (JUMS), Junior Management Science e. V., vol. 9(4), pages 1934-1963.

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Christina Langer & Simon Wiederhold, 2023. "The Value of Early-Career Skills," Working Papers 222, Bavarian Graduate Program in Economics (BGPE).
    2. John Carter Braxton & Kyle F. Herkenhoff & Jonathan Rothbaum & Lawrence Schmidt, 2021. "Changing Income Risk across the US Skill Distribution: Evidence from a Generalized Kalman Filter," Opportunity and Inclusive Growth Institute Working Papers 55, Federal Reserve Bank of Minneapolis.
    3. Kang, Yankun & Leng, Xuan & Liao, Yunxiang & Zheng, Shilin, 2024. "Information disclosure, spillovers, and knowledge accumulation," China Economic Review, Elsevier, vol. 84(C).
    4. Nicholas Bloom & Tarek Alexander Hassan & Aakash Kalyani & Josh Lerner & Ahmed Tahoun, 2021. "The diffusion of disruptive technologies," CEP Discussion Papers dp1798, Centre for Economic Performance, LSE.
    5. Marin, Giovanni & Vona, Francesco, 2023. "Finance and the reallocation of scientific, engineering and mathematical talent," Research Policy, Elsevier, vol. 52(5).
    6. Klaus Gugler & Florian Szücs & Ulrich Wohak, 2023. "Start-up Acquisitions, Venture Capital and Innovation: A Comparative Study of Google, Apple, Facebook, Amazon and Microsoft," Department of Economics Working Papers wuwp340, Vienna University of Economics and Business, Department of Economics.
    7. Hege, Ulrich & Li, Kai & Zhang, Yifei, 2025. "Climate Innovation and Carbon Emissions: Evidence from Supply Chain Networks," TSE Working Papers 25-1608, Toulouse School of Economics (TSE).
    8. Hensvik, Lena & Skans, Oskar Nordström, 2023. "The skill-specific impact of past and projected occupational decline," Labour Economics, Elsevier, vol. 81(C).
    9. Antonio Martins-Neto & Nanditha Mathew & Pierre Mohnen & Tania Treibich, 2024. "Is There Job Polarization in Developing Economies? A Review and Outlook," The World Bank Research Observer, World Bank, vol. 39(2), pages 259-288.
    10. Max Nathan & Anna Rosso, 2017. "Innovative events," Development Working Papers 429, Centro Studi Luca d'Agliano, University of Milano, revised 08 Apr 2019.
    11. Sergio Ocampo, 2019. "A task-based theory of occupations with multidimensional heterogeneity," 2019 Meeting Papers 477, Society for Economic Dynamics.
    12. David Autor & Caroline Chin & Anna Salomons & Bryan Seegmiller, 2024. "New Frontiers: The Origins and Content of New Work, 1940–2018," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 139(3), pages 1399-1465.
    13. Consoli, Davide & Marin, Giovanni & Rentocchini, Francesco & Vona, Francesco, 2023. "Routinization, within-occupation task changes and long-run employment dynamics," Research Policy, Elsevier, vol. 52(1).
    14. Koomen, Miriam & Backes-Gellner, Uschi, 2022. "Occupational tasks and wage inequality in West Germany: A decomposition analysis," Labour Economics, Elsevier, vol. 79(C).
    15. Hemelt, Steven W. & Hershbein, Brad & Martin, Shawn & Stange, Kevin M., 2023. "College majors and skills: Evidence from the universe of online job ads," Labour Economics, Elsevier, vol. 85(C).
    16. Mahyar Habibi, 2025. "Open Sourcing GPTs: Economics of Open Sourcing Advanced AI Models," Papers 2501.11581, arXiv.org.
    17. Baslandze, Salomé & Argente, David & Hanley, Douglas & Moreira, Sara, 2020. "Patents to Products: Product Innovation and Firm Dynamics," CEPR Discussion Papers 14692, C.E.P.R. Discussion Papers.
    18. Laura Battaglia & Timothy M. Christensen & Stephen Hansen & Szymon Sacher, 2024. "Inference for regression with variables generated from unstructured data," CeMMAP working papers 10/24, Institute for Fiscal Studies.
    19. Qiguo Gong, 2023. "Machine endowment cost model: task assignment between humans and machines," Palgrave Communications, Palgrave Macmillan, vol. 10(1), pages 1-8, December.
    20. Aakash Kalyani & Nicholas Bloom & Marcela Carvalho & Tarek Alexander Hassan & Josh Lerner & Ahmed Tahoun, 2021. "The Diffusion of New Technologies," NBER Working Papers 28999, National Bureau of Economic Research, Inc.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:exehis:v:87:y:2023:i:c:s0014498322000729. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/inca/622830 .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.