IDEAS home Printed from https://ideas.repec.org/a/cup/polals/v28y2020i4p445-468_1.html
   My bibliography  Save this article

Matching with Text Data: An Experimental Evaluation of Methods for Matching Documents and of Measuring Match Quality

Author

Listed:
  • Mozer, Reagan
  • Miratrix, Luke
  • Kaufman, Aaron Russell
  • Jason Anastasopoulos, L.

Abstract

Matching for causal inference is a well-studied problem, but standard methods fail when the units to match are text documents: the high-dimensional and rich nature of the data renders exact matching infeasible, causes propensity scores to produce incomparable matches, and makes assessing match quality difficult. In this paper, we characterize a framework for matching text documents that decomposes existing methods into (1) the choice of text representation and (2) the choice of distance metric. We investigate how different choices within this framework affect both the quantity and quality of matches identified through a systematic multifactor evaluation experiment using human subjects. Altogether, we evaluate over 100 unique text-matching methods along with 5 comparison methods taken from the literature. Our experimental results identify methods that generate matches with higher subjective match quality than current state-of-the-art techniques. We enhance the precision of these results by developing a predictive model to estimate the match quality of pairs of text documents as a function of our various distance scores. This model, which we find successfully mimics human judgment, also allows for approximate and unsupervised evaluation of new procedures in our context. We then employ the identified best method to illustrate the utility of text matching in two applications. First, we engage with a substantive debate in the study of media bias by using text matching to control for topic selection when comparing news articles from thirteen news sources. We then show how conditioning on text data leads to more precise causal inferences in an observational study examining the effects of a medical intervention.

Suggested Citation

  • Mozer, Reagan & Miratrix, Luke & Kaufman, Aaron Russell & Jason Anastasopoulos, L., 2020. "Matching with Text Data: An Experimental Evaluation of Methods for Matching Documents and of Measuring Match Quality," Political Analysis, Cambridge University Press, vol. 28(4), pages 445-468, October.
  • Handle: RePEc:cup:polals:v:28:y:2020:i:4:p:445-468_1
    as

    Download full text from publisher

    File URL: https://www.cambridge.org/core/product/identifier/S1047198720000017/type/journal_article
    File Function: link to article abstract page
    Download Restriction: no
    ---><---

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Jiaming Zeng & Michael F. Gensheimer & Daniel L. Rubin & Susan Athey & Ross D. Shachter, 2022. "Uncovering interpretable potential confounders in electronic medical records," Nature Communications, Nature, vol. 13(1), pages 1-14, December.
    2. Hao Chen & Dylan S. Small, 2022. "New multivariate tests for assessing covariate balance in matched observational studies," Biometrics, The International Biometric Society, vol. 78(1), pages 202-213, March.
    3. Sallin, Aurelién, 2021. "Estimating returns to special education: combining machine learning and text analysis to address confounding," Economics Working Paper Series 2109, University of St. Gallen, School of Economics and Political Science.
    4. Henrika Langen, 2022. "The Impact of the #MeToo Movement on Language at Court -- A text-based causal inference approach," Papers 2209.00409, arXiv.org, revised Sep 2023.
    5. Roman Senninger & Jens Blom‐Hansen, 2021. "Meet the critics: Analyzing the EU Commission's Regulatory Scrutiny Board through quantitative text analysis," Regulation & Governance, John Wiley & Sons, vol. 15(4), pages 1436-1453, October.
    6. Margaret E. Roberts & Brandon M. Stewart & Richard A. Nielsen, 2020. "Adjusting for Confounding with Text Matching," American Journal of Political Science, John Wiley & Sons, vol. 64(4), pages 887-903, October.
    7. Aur'elien Sallin, 2021. "Estimating returns to special education: combining machine learning and text analysis to address confounding," Papers 2110.08807, arXiv.org, revised Feb 2022.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:cup:polals:v:28:y:2020:i:4:p:445-468_1. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Kirk Stebbing (email available below). General contact details of provider: https://www.cambridge.org/pan .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.