IDEAS home Printed from https://ideas.repec.org/a/oup/biomet/v105y2018i2p431-446..html
   My bibliography  Save this article

Theoretical limits of microclustering for record linkage

Author

Listed:
  • J E Johndrow
  • K Lum
  • D B Dunson

Abstract

SUMMARYThere has been substantial recent interest in record linkage, where one attempts to group the records pertaining to the same entities from one or more large databases that lack unique identifiers. This can be viewed as a type of microclustering, with few observations per cluster and a very large number of clusters. We show that the problem is fundamentally hard from a theoretical perspective and, even in idealized cases, accurate entity resolution is effectively impossible unless the number of entities is small relative to the number of records and/or the separation between records from different entities is extremely large. These results suggest conservatism in interpretation of the results of record linkage, support collection of additional data to more accurately disambiguate the entities, and motivate a focus on coarser inference. For example, results from a simulation study suggest that sometimes one may obtain accurate results for population size estimation even when fine-scale entity resolution is inaccurate.

Suggested Citation

  • J E Johndrow & K Lum & D B Dunson, 2018. "Theoretical limits of microclustering for record linkage," Biometrika, Biometrika Trust, vol. 105(2), pages 431-446.
  • Handle: RePEc:oup:biomet:v:105:y:2018:i:2:p:431-446.
    as

    Download full text from publisher

    File URL: http://hdl.handle.net/10.1093/biomet/asy003
    Download Restriction: Access to full text is restricted to subscribers.
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Costa, Dora L. & Yetter, Noelle & DeSomer, Heather, 2020. "Wartime health shocks and the postwar socioeconomic status and mortality of union army veterans and their children," Journal of Health Economics, Elsevier, vol. 70(C).
    2. Sarah Tahamont & Zubin Jelveh & Aaron Chalfin & Shi Yan & Benjamin Hansen, 2019. "Administrative Data Linking and Statistical Power Problems in Randomized Experiments," NBER Working Papers 25657, National Bureau of Economic Research, Inc.
    3. Dora Costa & Noelle Yetter & Heather DeSomer, 2019. "Wartime Health Shocks and the Postwar Socioeconomic Status and Mortality of Union Army Veterans and their Children," NBER Working Papers 25480, National Bureau of Economic Research, Inc.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:oup:biomet:v:105:y:2018:i:2:p:431-446.. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Oxford University Press (email available below). General contact details of provider: https://academic.oup.com/biomet .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.