IDEAS home Printed from https://ideas.repec.org/a/eee/exehis/v80y2021ics0014498321000024.html
   My bibliography  Save this article

Combining family history and machine learning to link historical records: The Census Tree data set

Author

Listed:
  • Price, Joseph
  • Buckles, Kasey
  • Van Leeuwen, Jacob
  • Riley, Isaac

Abstract

A key challenge for research on many questions in the social sciences is that it is difficult to link records in a way that allows investigators to observe people at different points in their life or across generations. In this paper, we contribute to recent efforts to create these links with a new approach that relies on millions of record links created by individual contributors to a large, public, wiki-style family tree. We use these “true” links both to inform the decisions one needs to make when using automated methods to link records and as a training data set for use in a supervised machine learning approach. We describe our procedure and illustrate its potential by linking individuals across the 100% samples of the US censuses from 1900, 1910, and 1920. When linking adjacent censuses, we obtain an overall match rate of 62-65 percent (for over 88.9 million matches), with a false positive rate that is around 6-7 percent and with links that are similar to the population along observable characteristics. Thus, our method allows us to link records with a combination of a high match rate, precision, and representativeness that is beyond the current frontier. Finally, we demonstrate the potential of the data by estimating the degree of intergenerational transmission of literacy between father-son and mother-daughter pairs.

Suggested Citation

  • Price, Joseph & Buckles, Kasey & Van Leeuwen, Jacob & Riley, Isaac, 2021. "Combining family history and machine learning to link historical records: The Census Tree data set," Explorations in Economic History, Elsevier, vol. 80(C).
  • Handle: RePEc:eee:exehis:v:80:y:2021:i:c:s0014498321000024
    DOI: 10.1016/j.eeh.2021.101391
    as

    Download full text from publisher

    File URL: http://www.sciencedirect.com/science/article/pii/S0014498321000024
    Download Restriction: Full text for ScienceDirect subscribers only

    File URL: https://libkey.io/10.1016/j.eeh.2021.101391?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    As the access to this document is restricted, you may want to search for a different version of it.

    References listed on IDEAS

    as
    1. Fouka, Vasiliki, 2019. "How Do Immigrants Respond to Discrimination? The Case of Germans in the US During World War I," American Political Science Review, Cambridge University Press, vol. 113(2), pages 405-422, May.
    2. Raj Chetty & Nathaniel Hendren, 2018. "The Impacts of Neighborhoods on Intergenerational Mobility I: Childhood Exposure Effects," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 133(3), pages 1107-1162.
    3. Ran Abramitzky & Roy Mill & Santiago Pérez, 2020. "Linking individuals across historical sources: A fully automated approach," Historical Methods: A Journal of Quantitative and Interdisciplinary History, Taylor & Francis Journals, vol. 53(2), pages 94-111, April.
    4. Raj Chetty & John N. Friedman & Emmanuel Saez & Nicholas Turner & Danny Yagan, 2017. "Mobility Report Cards: The Role of Colleges in Intergenerational Mobility," Working Papers 2017-059, Human Capital and Economic Opportunity Working Group.
    5. Claudia Olivetti & M. Daniele Paserman, 2015. "In the Name of the Son (and the Daughter): Intergenerational Mobility in the United States, 1850-1940," American Economic Review, American Economic Association, vol. 105(8), pages 2695-2724, August.
    6. Ran Abramitzky & Leah Boustan & Katherine Eriksson & James Feigenbaum & Santiago Pérez, 2021. "Automated Linking of Historical Data," Journal of Economic Literature, American Economic Association, vol. 59(3), pages 865-918, September.
    7. Alexander, Rohan & Ward, Zachary, 2018. "Age at Arrival and Assimilation During the Age of Mass Migration," The Journal of Economic History, Cambridge University Press, vol. 78(3), pages 904-937, September.
    8. James Feigenbaum & Daniel P Gross, 2024. "Answering the Call of Automation: How the Labor Market Adjusted to Mechanizing Telephone Operation," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 139(3), pages 1879-1939.
    9. Beach, Brian & Ferrie, Joseph & Saavedra, Martin & Troesken, Werner, 2016. "Typhoid Fever, Water Quality, and Human Capital Formation," The Journal of Economic History, Cambridge University Press, vol. 76(1), pages 41-75, March.
    10. Joseph Price & Kasey Buckles & Jacob Van Leeuwen & Isaac Riley, 2019. "Combining Family History and Machine Learning to Link Historical Records," NBER Working Papers 26227, National Bureau of Economic Research, Inc.
    11. Mary F. Evans & Eric Helland & Jonathan Klick & Ashwin Patel, 2016. "The Developmental Effect Of State Alcohol Prohibitions At The Turn Of The Twentieth Century," Economic Inquiry, Western Economic Association International, vol. 54(2), pages 762-777, April.
    12. Catherine G. Massey, 2017. "Playing with matches: An assessment of accuracy in linked historical data," Historical Methods: A Journal of Quantitative and Interdisciplinary History, Taylor & Francis Journals, vol. 50(3), pages 129-143, July.
    13. Collins, William J. & Wanamaker, Marianne H., 2015. "The Great Migration in Black and White: New Evidence on the Selection and Sorting of Southern Migrants," The Journal of Economic History, Cambridge University Press, vol. 75(4), pages 947-992, December.
    14. James J. Feigenbaum, 2018. "Multiple Measures of Historical Intergenerational Mobility: Iowa 1915 to 1940," Economic Journal, Royal Economic Society, vol. 128(612), pages 446-481, July.
    15. repec:bla:ecinqu:v:51:y:2013:i:3:p:1795-1808 is not listed on IDEAS
    16. Sendhil Mullainathan & Jann Spiess, 2017. "Machine Learning: An Applied Econometric Approach," Journal of Economic Perspectives, American Economic Association, vol. 31(2), pages 87-106, Spring.
    17. Ran Abramitzky & Leah Platt Boustan & Katherine Eriksson, 2014. "A Nation of Immigrants: Assimilation and Economic Outcomes in the Age of Mass Migration," Journal of Political Economy, University of Chicago Press, vol. 122(3), pages 467-506.
    18. Solon, Gary, 1992. "Intergenerational Income Mobility in the United States," American Economic Review, American Economic Association, vol. 82(3), pages 393-408, June.
    Full references (including those not matched with items on IDEAS)

    Citations

    Citations are extracted by the CitEc Project, subscribe to its RSS feed for this item.
    as


    Cited by:

    1. Chan, Jeff, 2024. "The long-run effects of childhood exposure to market access shocks: Evidence from the US railroad network expansion," Explorations in Economic History, Elsevier, vol. 91(C).
    2. Wolfgang Keller & Carol H. Shiue, 2023. "Intergenerational Mobility of Daughters and Marital Sorting: New Evidence from Imperial China," NBER Working Papers 31695, National Bureau of Economic Research, Inc.
    3. Adrian Haws & David R. Just & Joseph Price, 2025. "Who (actually) gets the farm? Intergenerational farm succession in the United States," American Journal of Agricultural Economics, John Wiley & Sons, vol. 107(1), pages 3-26, January.
    4. Philipp Ager & Casper Worm Hansen & Peter Z. Lin, 2023. "Medical Technology and Life Expectancy: Evidence from the Antitoxin Treatment of Diphtheria," Working Papers 0241, European Historical Economics Society (EHES).
    5. Anna Aizer & Gabrielle Grafton & Santiago Pérez, 2025. "Daughters as Safety Net? Family Responses to Parental Employment Shocks: Evidence from Alcohol Prohibition," NBER Working Papers 33346, National Bureau of Economic Research, Inc.
    6. Youssouf Merouani & Faustine Perrin, 2022. "Gender and the long-run development process. A survey of the literature [Rethinking age heaping: A cautionary tale from nineteenth-century Italy]," European Review of Economic History, European Historical Economics Society, vol. 26(4), pages 612-641.
    7. Postel, Hannah M., 2022. "Record Linkage for Character-Based Surnames: Evidence from Chinese Exclusion," SocArXiv rckjp, Center for Open Science.
    8. Postel, Hannah M., 2023. "Record linkage for character-based surnames: Evidence from chinese exclusion," Explorations in Economic History, Elsevier, vol. 87(C).
    9. Anbinder, Tyler & Connor, Dylan & O Grada, Cormac & Wegge, Simone, 2021. "The Problem of False Positives in Automated Census Linking: Evidence from Nineteenth-Century New York's Irish Immigrants," CAGE Online Working Paper Series 568, Competitive Advantage in the Global Economy (CAGE).
    10. Hwang, Sam Il Myoung & Squires, Munir, 2024. "Linked samples and measurement error in historical US census data," Explorations in Economic History, Elsevier, vol. 93(C).

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Joseph Price & Kasey Buckles & Jacob Van Leeuwen & Isaac Riley, 2019. "Combining Family History and Machine Learning to Link Historical Records," NBER Working Papers 26227, National Bureau of Economic Research, Inc.
    2. Combes, Pierre-Philippe & Gobillon, Laurent & Zylberberg, Yanos, 2022. "Urban economics in a historical perspective: Recovering data with machine learning," Regional Science and Urban Economics, Elsevier, vol. 94(C).
    3. Dahl, Christian M. & Johansen, Torben S.D. & Sørensen, Emil N. & Wittrock, Simon, 2023. "HANA: A handwritten name database for offline handwritten text recognition," Explorations in Economic History, Elsevier, vol. 87(C).
    4. Collins, William J. & Zimran, Ariell, 2019. "The economic assimilation of Irish Famine migrants to the United States," Explorations in Economic History, Elsevier, vol. 74(C).
    5. Krzysztof Karbownik & Anthony Wray, 2019. "Educational, Labor-market and Intergenerational Consequences of Poor Childhood Health," NBER Working Papers 26368, National Bureau of Economic Research, Inc.
    6. Ran Abramitzky & Leah Platt Boustan & Elisa Jácome & Santiago Pérez, 2019. "Intergenerational Mobility of Immigrants over Two Centuries," Working Papers 2019-6, Princeton University. Economics Department..
    7. Inwood, Kris & Minns, Chris & Summerfield, Fraser, 2019. "Occupational income scores and immigrant assimilation. Evidence from the Canadian census," Explorations in Economic History, Elsevier, vol. 72(C), pages 114-122.
    8. Zachary Ward, 2023. "Intergenerational Mobility in American History: Accounting for Race and Measurement Error," American Economic Review, American Economic Association, vol. 113(12), pages 3213-3248, December.
    9. Martha J. Bailey & Connor Cole & Morgan Henderson & Catherine Massey, 2020. "How Well Do Automated Linking Methods Perform? Lessons from US Historical Data," Journal of Economic Literature, American Economic Association, vol. 58(4), pages 997-1044, December.
    10. Elisa Jácome & Ilyana Kuziemko & Suresh Naidu, 2021. "Mobility for All: Representative Intergenerational Mobility Estimates over the 20th Century," Working Papers 302, Princeton University, Department of Economics, Center for Economic Policy Studies..
    11. Ran Abramitzky & Leah Boustan & Katherine Eriksson & James Feigenbaum & Santiago Pérez, 2021. "Automated Linking of Historical Data," Journal of Economic Literature, American Economic Association, vol. 59(3), pages 865-918, September.
    12. Zhu, Ziming, 2022. "Like father like son? Intergenerational immobility in England, 1851-1911," Economic History Working Papers 117588, London School of Economics and Political Science, Department of Economic History.
    13. Combes, Pierre-Philippe & Gobillon, Laurent & Zylberberg, Yanos, 2022. "Urban economics in a historical perspective: Recovering data with machine learning," Regional Science and Urban Economics, Elsevier, vol. 94(C).
    14. Saavedra, Martin & Twinam, Tate, 2020. "A machine learning approach to improving occupational income scores," Explorations in Economic History, Elsevier, vol. 75(C).
    15. Chong Lu, 2022. "The effect of migration on rural residents’ intergenerational subjective social status mobility in China," Quality & Quantity: International Journal of Methodology, Springer, vol. 56(5), pages 3279-3308, October.
    16. Santiago Pérez, 2019. "Southern (American) Hospitality: Italians in Argentina and the US during the Age of Mass Migration," NBER Working Papers 26127, National Bureau of Economic Research, Inc.
    17. Bertrand Garbinti & Frédérique Savignac, 2020. "Accounting for Intergenerational Wealth Mobility in France over the 20th Century: Method and Estimations," Working papers 776, Banque de France.
    18. Valerie Michelman & Joseph Price & Seth D Zimmerman, 2022. "Old Boys’ Clubs and Upward Mobility Among the Educational Elite [Do Immigrants Assimilate More Slowly Today Than in the Past?]," The Quarterly Journal of Economics, Oxford University Press, vol. 137(2), pages 845-909.
    19. Juliana Jaramillo-Echeverri, 2024. "Movilidad social en la educación: el caso de la Universidad de los Andes en Colombia entre 1949 y 2018," Cuadernos de Historia Económica 61, Banco de la Republica de Colombia.
    20. Sotiris Kampanelis & Aldo Elizalde, 2024. "Lynching and economic opportunities: Evidence from the US South," Kyklos, Wiley Blackwell, vol. 77(4), pages 977-1003, November.

    More about this item

    Keywords

    Record linking; Genealogy data; Machine learning; Intergenerational transmission;
    All these keywords.

    JEL classification:

    • N01 - Economic History - - General - - - Development of the Discipline: Historiographical; Sources and Methods
    • N11 - Economic History - - Macroeconomics and Monetary Economics; Industrial Structure; Growth; Fluctuations - - - U.S.; Canada: Pre-1913
    • N12 - Economic History - - Macroeconomics and Monetary Economics; Industrial Structure; Growth; Fluctuations - - - U.S.; Canada: 1913-
    • C8 - Mathematical and Quantitative Methods - - Data Collection and Data Estimation Methodology; Computer Programs

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:eee:exehis:v:80:y:2021:i:c:s0014498321000024. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Catherine Liu (email available below). General contact details of provider: http://www.elsevier.com/locate/inca/622830 .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.