IDEAS home Printed from https://ideas.repec.org/a/gam/jdataj/v9y2024i12p139-d1528760.html
   My bibliography  Save this article

Detective Gadget: Generic Iterative Entity Resolution over Dirty Data

Author

Listed:
  • Marcello Buoncristiano

    (Svelto!—Big Data-Cleaning and Analytics, 85100 Potenza, Italy
    These authors contributed equally to this work.)

  • Giansalvatore Mecca

    (Dipartimento di Ingegneria, Università degli Studi della Basilicata, 85100 Potenza, Italy
    These authors contributed equally to this work.)

  • Donatello Santoro

    (Dipartimento di Ingegneria, Università degli Studi della Basilicata, 85100 Potenza, Italy
    These authors contributed equally to this work.)

  • Enzo Veltri

    (Dipartimento di Ingegneria, Università degli Studi della Basilicata, 85100 Potenza, Italy
    These authors contributed equally to this work.)

Abstract

In the era of Big Data, entity resolution (ER), i.e., the process of identifying which records refer to the same entity in the real world, plays a critical role in data-integration tasks, especially in mission-critical applications where accuracy is mandatory, since we want to avoid integrating different entities or missing matches. However, existing approaches struggle with the challenges posed by rapidly changing data and the presence of dirtiness, which requires an iterative refinement during the time. We present Detective Gadget, a novel system for iterative ER that seamlessly integrates data-cleaning into the ER workflow. Detective Gadgetemploys an alias-based hashing mechanism for fast and scalable matching, check functions to detect and correct mismatches, and a human-in-the-loop framework to refine results through expert feedback. The system iteratively improves data quality and matching accuracy by leveraging evidence from both automated and manual decisions. Extensive experiments across diverse real-world scenarios demonstrate its effectiveness, achieving high accuracy and efficiency while adapting to evolving datasets.

Suggested Citation

  • Marcello Buoncristiano & Giansalvatore Mecca & Donatello Santoro & Enzo Veltri, 2024. "Detective Gadget: Generic Iterative Entity Resolution over Dirty Data," Data, MDPI, vol. 9(12), pages 1-32, November.
  • Handle: RePEc:gam:jdataj:v:9:y:2024:i:12:p:139-:d:1528760
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/2306-5729/9/12/139/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/2306-5729/9/12/139/
    Download Restriction: no
    ---><---

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jdataj:v:9:y:2024:i:12:p:139-:d:1528760. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.