IDEAS home Printed from https://ideas.repec.org/a/gam/jftint/v16y2024i8p281-d1450823.html
   My bibliography  Save this article

Masketeer: An Ensemble-Based Pseudonymization Tool with Entity Recognition for German Unstructured Medical Free Text

Author

Listed:
  • Martin Baumgartner

    (Center for Health and Bioresources, AIT Austrian Institute of Technology, 8020 Graz, Austria
    Institute of Neural Engineering, Graz University of Technology, 8010 Graz, Austria)

  • Karl Kreiner

    (Center for Health and Bioresources, AIT Austrian Institute of Technology, 8020 Graz, Austria)

  • Fabian Wiesmüller

    (Center for Health and Bioresources, AIT Austrian Institute of Technology, 8020 Graz, Austria
    Institute of Neural Engineering, Graz University of Technology, 8010 Graz, Austria
    Ludwig Boltzmann Institute for Digital Health and Prevention, 5020 Salzburg, Austria)

  • Dieter Hayn

    (Center for Health and Bioresources, AIT Austrian Institute of Technology, 8020 Graz, Austria
    Ludwig Boltzmann Institute for Digital Health and Prevention, 5020 Salzburg, Austria)

  • Christian Puelacher

    (Department of Internal Medicine III, Cardiology and Angiology, University Hospital Innsbruck, Medical University Innsbruck, 6020 Innsbruck, Austria)

  • Günter Schreier

    (Center for Health and Bioresources, AIT Austrian Institute of Technology, 8020 Graz, Austria
    Institute of Neural Engineering, Graz University of Technology, 8010 Graz, Austria)

Abstract

Background: The recent rise of large language models has triggered renewed interest in medical free text data, which holds critical information about patients and diseases. However, medical free text is also highly sensitive. Therefore, de-identification is typically required but is complicated since medical free text is mostly unstructured. With the Masketeer algorithm, we present an effective tool to de-identify German medical text. Methods: We used an ensemble of different masking classes to remove references to identifiable data from over 35,000 clinical notes in accordance with the HIPAA Safe Harbor Guidelines. To retain additional context for readers, we implemented an entity recognition scheme and corpus-wide pseudonymization. Results: The algorithm performed with a sensitivity of 0.943 and specificity of 0.933. Further performance analyses showed linear runtime complexity (O(n)) with both increasing text length and corpus size. Conclusions: In the future, large language models will likely be able to de-identify medical free text more effectively and thoroughly than handcrafted rules. However, such gold-standard de-identification tools based on large language models are yet to emerge. In the current absence of such, we hope to provide best practices for a robust rule-based algorithm designed with expert domain knowledge.

Suggested Citation

  • Martin Baumgartner & Karl Kreiner & Fabian Wiesmüller & Dieter Hayn & Christian Puelacher & Günter Schreier, 2024. "Masketeer: An Ensemble-Based Pseudonymization Tool with Entity Recognition for German Unstructured Medical Free Text," Future Internet, MDPI, vol. 16(8), pages 1-16, August.
  • Handle: RePEc:gam:jftint:v:16:y:2024:i:8:p:281-:d:1450823
    as

    Download full text from publisher

    File URL: https://www.mdpi.com/1999-5903/16/8/281/pdf
    Download Restriction: no

    File URL: https://www.mdpi.com/1999-5903/16/8/281/
    Download Restriction: no
    ---><---

    Corrections

    All material on this site has been provided by the respective publishers and authors. You can help correct errors and omissions. When requesting a correction, please mention this item's handle: RePEc:gam:jftint:v:16:y:2024:i:8:p:281-:d:1450823. See general information about how to correct material in RePEc.

    If you have authored this item and are not yet registered with RePEc, we encourage you to do it here. This allows to link your profile to this item. It also allows you to accept potential citations to this item that we are uncertain about.

    We have no bibliographic references for this item. You can help adding them by using this form .

    If you know of missing items citing this one, you can help us creating those links by adding the relevant references in the same way as above, for each refering item. If you are a registered author of this item, you may also want to check the "citations" tab in your RePEc Author Service profile, as there may be some citations waiting for confirmation.

    For technical questions regarding this item, or to correct its authors, title, abstract, bibliographic or download information, contact: MDPI Indexing Manager (email available below). General contact details of provider: https://www.mdpi.com .

    Please note that corrections may take a couple of weeks to filter through the various RePEc services.

    IDEAS is a RePEc service. RePEc uses bibliographic data supplied by the respective publishers.