IDEAS home Printed from https://ideas.repec.org/a/spr/jbuscr/v16y2020i1d10.1007_s41549-020-00043-1.html
   My bibliography  Save this article

The Challenge of Pairing Big Datasets: Probabilistic Record Linkage Methods and Diagnosis of Their Empirical Viability

Author

Listed:
  • Yaohao Peng

    (Brazilian Secretariat for Economic Policy)

  • Lucas Ferreira Mation

    (Brazilian Institute of Applied Economic Research)

Abstract

In this paper, we evaluated the predictive performance of probabilistic record linkage algorithms, discussing the implications of different configurations of blocking keys, string similarity functions and phonetic code on the prediction’s overall performance and computational complexity. Furthermore, we carried out a bibliographical survey of the main deterministic and probabilistic record linkage methods, as well as of recent advances combining machine learning techniques and main packages and implementations available in open-source R language. The results can provide heuristics for problems of administrative records integration at the national level and have potential value for the formulation and evaluation of public policies.

Suggested Citation

  • Yaohao Peng & Lucas Ferreira Mation, 2020. "The Challenge of Pairing Big Datasets: Probabilistic Record Linkage Methods and Diagnosis of Their Empirical Viability," Journal of Business Cycle Research, Springer;Centre for International Research on Economic Tendency Surveys (CIRET), vol. 16(1), pages 35-57, April.
  • Handle: RePEc:spr:jbuscr:v:16:y:2020:i:1:d:10.1007_s41549-020-00043-1
    DOI: 10.1007/s41549-020-00043-1
    as

    Download full text from publisher

    File URL: http://link.springer.com/10.1007/s41549-020-00043-1
    File Function: Abstract
    Download Restriction: Access to the full text of the articles in this series is restricted.

    File URL: https://libkey.io/10.1007/s41549-020-00043-1?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Sayers, Adrian & Ben-Shlomo, Yoav & Blom, Ashley W. & Steele, Fiona, 2015. "Probabilistic record linkage," LSE Research Online Documents on Economics 64894, London School of Economics and Political Science, LSE Library.
    2. Bruce D. Meyer & Nikolas Mittag, 2019. "Using Linked Survey and Administrative Data to Better Measure Income: Implications for Poverty, Program Effectiveness, and Holes in the Safety Net," American Economic Journal: Applied Economics, American Economic Association, vol. 11(2), pages 176-204, April.
    3. Bruce D. Meyer & Nikolas Mittag, 2015. "Using Linked Survey and Administrative Data to Better Measure Income: Implications for Poverty, Program Effectiveness and Holes in the Safety Net," Upjohn Working Papers 15-242, W.E. Upjohn Institute for Employment Research.
    4. David Cesarini & Erik Lindqvist & Robert Östling & Björn Wallace, 2016. "Wealth, Health, and Child Development: Evidence from Administrative Data on Swedish Lottery Players," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 131(2), pages 687-738.
    5. Lahiri, Kajal & Wang, J. George, 2013. "Evaluating probability forecasts for GDP declines using alternative methodologies," International Journal of Forecasting, Elsevier, vol. 29(1), pages 175-190.
    6. Lahiri, Kajal & Song, Jae & Wixon, Bernard, 2008. "A model of Social Security Disability Insurance using matched SIPP/Administrative data," Journal of Econometrics, Elsevier, vol. 145(1-2), pages 4-20, July.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Meyer, Bruce D. & Mittag, Nikolas, 2019. "Combining Administrative and Survey Data to Improve Income Measurement," IZA Discussion Papers 12266, Institute of Labor Economics (IZA).
    2. Michele Lalla & Maddalena Cavicchioli, 2020. "Nonresponse and measurement errors in income: matching individual survey data with administrative tax data," Department of Economics 0170, University of Modena and Reggio E., Faculty of Economics "Marco Biagi".
    3. Lidia Ceriani & Vladimir Hlasny & Paolo Verme, 2021. "Bottom Incomes and the Measurement of Poverty: A Brief Assessment of the Literature," Working Papers 589, ECINEQ, Society for the Study of Economic Inequality.
    4. Martin Rama, 2019. "Challenges in Measuring Poverty and Understanding its Dynamics: A South Asian Perspective," Review of Income and Wealth, International Association for Research in Income and Wealth, vol. 65(S1), pages 2-32, November.
    5. Watson, C. Luke, 2021. "the General Equilibrium Incidence of the Earned Income Tax Credit," SocArXiv 8n3ag, Center for Open Science.
    6. Bruce D. Meyer & Nikolas Mittag, 2019. "Combining Administrative and Survey Data to Improve Income Measurement," NBER Working Papers 25738, National Bureau of Economic Research, Inc.
    7. Joshua D. Gottlieb & Maria Polyakova & Kevin Rinz & Hugh Shiplett & Victoria Udalova, 2020. "Who Values Human Capitalists' Human Capital? Healthcare Spending and Physician Earnings," Working Papers 20-23, Center for Economic Studies, U.S. Census Bureau.
    8. Gaetano Basso & Giovanni Peri, 2020. "Internal Mobility: The Greater Responsiveness of Foreign-Born to Economic Conditions," Journal of Economic Perspectives, American Economic Association, vol. 34(3), pages 77-98, Summer.
    9. James X. Sullivan, 2020. "A Cautionary Tale of Using Data From the Tail," Demography, Springer;Population Association of America (PAA), vol. 57(6), pages 2361-2368, December.
    10. Maira Colacce & Ivone Perazzo & Andrea Vigorito, 2020. "How accurately do mothers recall prenatal visits and gestational age? A validation of Uruguayan survey data," Demographic Research, Max Planck Institute for Demographic Research, Rostock, Germany, vol. 43(51), pages 1495-1508.
    11. Cullen,Claire Alexis, 2020. "Method Matters : Underreporting of Intimate Partner Violence in Nigeria and Rwanda," Policy Research Working Paper Series 9274, The World Bank.
    12. Mark Brooks & Rattiya S. Lippe & Hermann Waibel, 2020. "Comprehensive data quality studies as a component of poverty assessments," TVSEP Working Papers wp-019, Leibniz Universitaet Hannover, Institute for Environmental Economics and World Trade, Project TVSEP.
    13. Misty Heggeness & Marta Murray-Close, 2019. "Manning Up and Womaning Down: How Husbands and Wives Report Earnings When She Earns More," Opportunity and Inclusive Growth Institute Working Papers 28, Federal Reserve Bank of Minneapolis.
    14. Meyer, Bruce D. & Mittag, Nikolas, 2018. "Misreporting of Government Transfers: How Important Are Survey Design and Geography?," IZA Discussion Papers 12038, Institute of Labor Economics (IZA).
    15. Lahiri, Kajal & Monokroussos, George & Zhao, Yongchen, 2013. "The yield spread puzzle and the information content of SPF forecasts," Economics Letters, Elsevier, vol. 118(1), pages 219-221.
    16. Kerstin Bruckmeier & Katrin Hohmeyer & Stefan Schwarz, 2018. "Welfare receipt misreporting in survey data and its consequences for state dependence estimates: new insights from linked administrative and survey data," Journal for Labour Market Research, Springer;Institute for Employment Research/ Institut für Arbeitsmarkt- und Berufsforschung (IAB), vol. 52(1), pages 1-21, December.
    17. Daniel I. Tannenbaum, 2020. "The Effect of Child Support on Selection into Marriage and Fertility," Journal of Labor Economics, University of Chicago Press, vol. 38(2), pages 611-652.
    18. Hope Corman & Dhaval Dave & Nancy E. Reichman, 2018. "Evolution of the Infant Health Production Function," Southern Economic Journal, John Wiley & Sons, vol. 85(1), pages 6-47, July.
    19. Mittag, Nikolas, 2016. "Correcting for Misreporting of Government Benefits," IZA Discussion Papers 10266, Institute of Labor Economics (IZA).
    20. Albarrán, Pedro & Hidalgo-Hidalgo, Marisa & Iturbe-Ormaetxe, Iñigo, 2020. "Education and adult health: Is there a causal effect?," Social Science & Medicine, Elsevier, vol. 249(C).

    More about this item

    Keywords

    Record linkage; Blocking; Administrative records; Big data; R;
    All these keywords.

    JEL classification:

    • C52 - Mathematical and Quantitative Methods - - Econometric Modeling - - - Model Evaluation, Validation, and Selection
    • C55 - Mathematical and Quantitative Methods - - Econometric Modeling - - - Large Data Sets: Modeling and Analysis
    • C65 - Mathematical and Quantitative Methods - - Mathematical Methods; Programming Models; Mathematical and Simulation Modeling - - - Miscellaneous Mathematical Tools
    • C80 - Mathematical and Quantitative Methods - - Data Collection and Data Estimation Methodology; Computer Programs - - - General
    • C88 - Mathematical and Quantitative Methods - - Data Collection and Data Estimation Methodology; Computer Programs - - - Other Computer Software

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:spr:jbuscr:v:16:y:2020:i:1:d:10.1007_s41549-020-00043-1. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.springer.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.