IDEAS home Printed from https://ideas.repec.org/a/plo/pone00/0241239.html
   My bibliography  Save this article

A machine learning approach to predict ethnicity using personal name and census location in Canada

Author

Listed:
  • Kai On Wong
  • Osmar R Zaïane
  • Faith G Davis
  • Yutaka Yasui

Abstract

Background: Canada is an ethnically-diverse country, yet its lack of ethnicity information in many large databases impedes effective population research and interventions. Automated ethnicity classification using machine learning has shown potential to address this data gap but its performance in Canada is largely unknown. This study conducted a large-scale machine learning framework to predict ethnicity using a novel set of name and census location features. Methods: Using census 1901, the multiclass and binary class classification machine learning pipelines were developed. The 13 ethnic categories examined were Aboriginal (First Nations, Métis, Inuit, and all-combined)), Chinese, English, French, Irish, Italian, Japanese, Russian, Scottish, and others. Machine learning algorithms included regularized logistic regression, C-support vector, and naïve Bayes classifiers. Name features consisted of the entire name string, substrings, double-metaphones, and various name-entity patterns, while location features consisted of the entire location string and substrings of province, district, and subdistrict. Predictive performance metrics included sensitivity, specificity, positive predictive value, negative predictive value, F1, Area Under the Curve for Receiver Operating Characteristic curve, and accuracy. Results: The census had 4,812,958 unique individuals. For multiclass classification, the highest performance achieved was 76% F1 and 91% accuracy. For binary classifications for Chinese, French, Italian, Japanese, Russian, and others, the F1 ranged 68–95% (median 87%). The lower performance for English, Irish, and Scottish (F1 ranged 63–67%) was likely due to their shared cultural and linguistic heritage. Adding census location features to the name-based models strongly improved the prediction in Aboriginal classification (F1 increased from 50% to 84%). Conclusions: The automated machine learning approach using only name and census location features can predict the ethnicity of Canadians with varying performance by specific ethnic categories.

Suggested Citation

  • Kai On Wong & Osmar R Zaïane & Faith G Davis & Yutaka Yasui, 2020. "A machine learning approach to predict ethnicity using personal name and census location in Canada," PLOS ONE, Public Library of Science, vol. 15(11), pages 1-16, November.
  • Handle: RePEc:plo:pone00:0241239
    DOI: 10.1371/journal.pone.0241239
    as

    Download full text from publisher

    File URL: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0241239
    Download Restriction: no

    File URL: https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0241239&type=printable
    Download Restriction: no

    File URL: https://libkey.io/10.1371/journal.pone.0241239?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Roland G. Fryer & Steven D. Levitt, 2004. "The Causes and Consequences of Distinctively Black Names," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 119(3), pages 767-805.
    2. Pablo Mateos & Paul A Longley & David O'Sullivan, 2011. "Ethnicity and Population Structure in Personal Naming Networks," PLOS ONE, Public Library of Science, vol. 6(9), pages 1-12, September.
    3. Wang, Lu & Rosenberg, Mark & Lo, Lucia, 2008. "Ethnicity and utilization of family physicians: A case study of Mainland Chinese immigrants in Toronto, Canada," Social Science & Medicine, Elsevier, vol. 67(9), pages 1410-1422, November.
    4. Imai, Kosuke & Khanna, Kabir, 2016. "Improving Ecological Inference by Predicting Individual Ethnicity from Voter Registration Records," Political Analysis, Cambridge University Press, vol. 24(2), pages 263-272, April.
    5. Jens Kandt & Paul A Longley, 2018. "Ethnicity estimation using family naming practices," PLOS ONE, Public Library of Science, vol. 13(8), pages 1-24, August.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Cummins, Neil, 2024. "Ethnic Wealth Inequality in England and Wales, 1858-2018," CEPR Discussion Papers 19398, C.E.P.R. Discussion Papers.
    2. Johannes Buggle & Thierry Mayer & Seyhun Orcan Sakalli & Mathias Thoenig, 2023. "The Refugee’s Dilemma: Evidence from Jewish Migration out of Nazi Germany," The Quarterly Journal of Economics, President and Fellows of Harvard College, vol. 138(2), pages 1273-1345.
    3. Chowdhury, Shyamal & Ooi, Evarn & Slonim, Robert, 2017. "Racial discrimination and white first name adoption: a field experiment in the Australian labour market," Working Papers 2017-15, University of Sydney, School of Economics.
    4. Mujcic, Redzo & Frijters, Paul, 2013. "Still Not Allowed on the Bus: It Matters If You're Black or White!," IZA Discussion Papers 7300, Institute of Labor Economics (IZA).
    5. Keith Head & Thierry Mayer, 2008. "Detection Of Local Interactions From The Spatial Pattern Of Names In France," Journal of Regional Science, Wiley Blackwell, vol. 48(1), pages 67-95, February.
    6. Lansley, Guy & Longley, Paul, 2016. "Deriving age and gender from forenames for consumer analytics," Journal of Retailing and Consumer Services, Elsevier, vol. 30(C), pages 271-278.
    7. Jan Hanousek & Štěpán Jurajda, 2018. "Názvy společností a jejich vliv na výkonnost firem [Corporate Names and Performance]," Politická ekonomie, Prague University of Economics and Business, vol. 2018(6), pages 671-688.
    8. Collado, M. Dolores & Ortuño Ortin, Ignacio & Romeu, Andrés, 2008. "Vertical Transmission of Consumption Behavior and the Distribution of Surnames," UMUFAE Economics Working Papers 2651, DIGITUM. Universidad de Murcia.
    9. Button, Patrick & Walker, Brigham, 2020. "Employment discrimination against Indigenous Peoples in the United States: Evidence from a field experiment," Labour Economics, Elsevier, vol. 65(C).
    10. Lisa Cook, 2014. "Violence and economic activity: evidence from African American patents, 1870–1940," Journal of Economic Growth, Springer, vol. 19(2), pages 221-257, June.
    11. Olivetti, Claudia & Paserman, M. Daniele & Salisbury, Laura, 2018. "Three-generation mobility in the United States, 1850–1940: The role of maternal and paternal grandparents," Explorations in Economic History, Elsevier, vol. 70(C), pages 73-90.
    12. Nicolás Ajzenman & Bruno Ferman & Sant’Anna Pedro C., 2023. "Discrimination in the Formation of Academic Networks: A Field Experiment on #EconTwitter," Working Papers 235, Red Nacional de Investigadores en Economía (RedNIE).
    13. Leonardo Bursztyn & Thomas Chaney & Tarek Alexander Hassan & Aakaash Rao, 2021. "The Immigrant Next Door: Long-Term Contact, Generosity, and Prejudice," NBER Working Papers 28448, National Bureau of Economic Research, Inc.
    14. Yann Algan & Clément Malgouyres & Thierry Mayer & Mathias Thoenig, 2022. "The Economic Incentives of Cultural Transmission: Spatial Evidence from Naming Patterns Across France [‘Cultural assimilation during the age of mass migration’]," The Economic Journal, Royal Economic Society, vol. 132(642), pages 437-470.
    15. Samuel Bazzi & Arya Gaduh & Alexander D. Rothenberg & Maisy Wong, 2019. "Unity in Diversity? How Intergroup Contact Can Foster Nation Building," American Economic Review, American Economic Association, vol. 109(11), pages 3978-4025, November.
    16. Ran Abramitzky & Leah Platt Boustan & Dylan Connor, 2020. "Leaving the Enclave: Historical Evidence on Immigrant Mobility from the Industrial Removal Office," Working Papers 2020-35, Princeton University. Economics Department..
    17. Nicodemo, Catia & Raya, Josep M., 2018. "Does Juan Carlos or Nelson Obtain a Larger Price Cut in the Spanish Housing Market?," IZA Discussion Papers 11811, Institute of Labor Economics (IZA).
    18. Matthew Gentzkow & Jesse M. Shapiro & Matt Taddy, 2019. "Measuring Group Differences in High‐Dimensional Choices: Method and Application to Congressional Speech," Econometrica, Econometric Society, vol. 87(4), pages 1307-1340, July.
    19. Bindler, Anna Louisa & Hjalmarsson, Randi & Machin, Stephen Jonathan & Rubio, Melissa, 2023. "Murphy's Law or luck of the Irish? Disparate treatment of the Irish in 19th century courts," LSE Research Online Documents on Economics 121339, London School of Economics and Political Science, LSE Library.
    20. Devah Pager, 2007. "The Use of Field Experiments for Studies of Employment Discrimination: Contributions, Critiques, and Directions for the Future," The ANNALS of the American Academy of Political and Social Science, , vol. 609(1), pages 104-133, January.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:plo:pone00:0241239. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: plosone (email available below). General contact details of provider: https://journals.plos.org/plosone/ .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.