IDEAS home Printed from https://ideas.repec.org/a/nat/natcom/v13y2022i1d10.1038_s41467-022-28818-3.html
   My bibliography  Save this article

Active label cleaning for improved dataset quality under resource constraints

Author

Listed:
  • Mélanie Bernhardt

    (Health Intelligence, Microsoft Research Cambridge)

  • Daniel C. Castro

    (Health Intelligence, Microsoft Research Cambridge)

  • Ryutaro Tanno

    (Health Intelligence, Microsoft Research Cambridge)

  • Anton Schwaighofer

    (Health Intelligence, Microsoft Research Cambridge)

  • Kerem C. Tezcan

    (Health Intelligence, Microsoft Research Cambridge)

  • Miguel Monteiro

    (Health Intelligence, Microsoft Research Cambridge)

  • Shruthi Bannur

    (Health Intelligence, Microsoft Research Cambridge)

  • Matthew P. Lungren

    (Stanford University)

  • Aditya Nori

    (Health Intelligence, Microsoft Research Cambridge)

  • Ben Glocker

    (Health Intelligence, Microsoft Research Cambridge)

  • Javier Alvarez-Valle

    (Health Intelligence, Microsoft Research Cambridge)

  • Ozan Oktay

    (Health Intelligence, Microsoft Research Cambridge)

Abstract

Imperfections in data annotation, known as label noise, are detrimental to the training of machine learning models and have a confounding effect on the assessment of model performance. Nevertheless, employing experts to remove label noise by fully re-annotating large datasets is infeasible in resource-constrained settings, such as healthcare. This work advocates for a data-driven approach to prioritising samples for re-annotation—which we term “active label cleaning". We propose to rank instances according to estimated label correctness and labelling difficulty of each sample, and introduce a simulation framework to evaluate relabelling efficacy. Our experiments on natural images and on a specifically-devised medical imaging benchmark show that cleaning noisy labels mitigates their negative impact on model training, evaluation, and selection. Crucially, the proposed approach enables correcting labels up to 4 × more effectively than typical random selection in realistic conditions, making better use of experts’ valuable time for improving dataset quality.

Suggested Citation

  • Mélanie Bernhardt & Daniel C. Castro & Ryutaro Tanno & Anton Schwaighofer & Kerem C. Tezcan & Miguel Monteiro & Shruthi Bannur & Matthew P. Lungren & Aditya Nori & Ben Glocker & Javier Alvarez-Valle &, 2022. "Active label cleaning for improved dataset quality under resource constraints," Nature Communications, Nature, vol. 13(1), pages 1-11, December.
  • Handle: RePEc:nat:natcom:v:13:y:2022:i:1:d:10.1038_s41467-022-28818-3
    DOI: 10.1038/s41467-022-28818-3
    as

    Download full text from publisher

    File URL: https://www.nature.com/articles/s41467-022-28818-3
    File Function: Abstract
    Download Restriction: no

    File URL: https://libkey.io/10.1038/s41467-022-28818-3?utm_source=ideas
    LibKey link: if access is restricted and if your library uses this service, LibKey will redirect you to where you can use your library subscription to access this item
    ---><---

    References listed on IDEAS

    as
    1. Ruairidh M. Battleday & Joshua C. Peterson & Thomas L. Griffiths, 2020. "Capturing human categorization of natural images by combining deep networks and cognitive models," Nature Communications, Nature, vol. 11(1), pages 1-14, December.
    Full references (including those not matched with items on IDEAS)

    Most related items

    These are the items that most often cite the same works as this one and are cited by the same works as this one.
    1. Bria Long & Judith E. Fan & Holly Huey & Zixian Chai & Michael C. Frank, 2024. "Parallel developmental changes in children’s production and recognition of line drawings of visual concepts," Nature Communications, Nature, vol. 15(1), pages 1-15, December.

    More about this item

    Statistics

    Access and download statistics

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:nat:natcom:v:13:y:2022:i:1:d:10.1038_s41467-022-28818-3. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    If CitEc recognized a bibliographic reference but did not link an item in RePEc to it, you can help with this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: Sonal Shukla or Springer Nature Abstracting and Indexing (email available below). General contact details of provider: http://www.nature.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.