IDEAS home Printed from https://ideas.repec.org/a/eee/exehis/v96y2025ics0014498325000038.html
   My bibliography  Save this article

Examining the role of training data for supervised methods of automated record linkage: Lessons for best practice in economic history

Author

Listed:
  • Feigenbaum, James J
  • Helgertz, Jonas
  • Price, Joseph

Abstract

During the past decade, scholars have produced a vast amount of research using linked historical individual-level data, shaping and changing our understanding of the past. This linked data revolution has been powered by methodological and computational advances, partly focused on supervised machine-learning methods that rely on training data. The importance of obtaining high-quality training data for the performance of the record linkage algorithm largely, however, remains unknown. This paper comprehensively examines the role of training data, and—by extension—improves our understanding of best practices in supervised methods of probabilistic record linkage. First, we compare the speed and costs of building training data using different methods. Second, we document high rates of conditional accuracy across the training data sets, rates that are especially high when built with access to more information. Third, we show that data constructed by record linking algorithms learning from different training-data-generation methods do not substantially differ in their accuracy, either overall or across demographic groups, though algorithms tend to perform best when their feature space aligns with the features used to build the training data. Lastly, we introduce errors in the training data and find that the examined record linking algorithms are remarkably capable of making accurate links even working with flawed training data.

Suggested Citation

  • Feigenbaum, James J & Helgertz, Jonas & Price, Joseph, 2025. "Examining the role of training data for supervised methods of automated record linkage: Lessons for best practice in economic history," Explorations in Economic History, Elsevier, vol. 96(C).
  • Handle: RePEc:eee:exehis:v:96:y:2025:i:c:s0014498325000038
    DOI: 10.1016/j.eeh.2025.101656
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0014498325000038
    Download Restriction: Full text for ScienceDirect subscribers only

    File URL: https://libkey.io/10.1016/j.eeh.2025.101656?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:exehis:v:96:y:2025:i:c:s0014498325000038. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/inca/622830 .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.