Efficiently and Thoroughly Anonymizing a Transformer Language Model for Dutch Electronic Health Records: a Two-Step Method

Stella Verkijk, Piek Vossen

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

Abstract

Neural Network (NN) architectures are used more and more to model large amounts of data, such as text data available online. Transformer-based NN architectures have shown to be very useful for language modelling. Although many researchers study how such Language Models (LMs) work, not much attention has been paid to the privacy risks of training LMs on large amounts of data and publishing them online. This paper presents a new method for anonymizing a language model by presenting the way in which MedRoBERTa.nl, a Dutch language model for hospital notes, was anonymized. The two step method involves i) automatic anonymization of the training data and ii) semi-automatic anonymization of the LM's vocabulary. Adopting the fill-mask task where the model predicts what tokens are most probable to appear in a certain context, it was tested how often the model will predict a name in a context where a name should be. It was shown that it predicts a name-like token 0.2% of the time. Any name-like token that was predicted was never the name originally presented in the training data. By explaining how a LM trained on highly private real-world medical data can be safely published with open access, we hope that more language resources will be published openly and responsibly so the community can profit from them.

Original languageEnglish
Title of host publicationProceedings of the Thirteenth Language Resources and Evaluation Conference
EditorsNicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Jan Odijk, Stelios Piperidis
Place of PublicationMarseille
PublisherEuropean Language Resources Association (ELRA)
Pages1098-1103
Number of pages6
ISBN (Electronic)9791095546726
Publication statusPublished - Jun 2022
Event13th International Conference on Language Resources and Evaluation Conference, LREC 2022 - Marseille, France
Duration: 20 Jun 202225 Jun 2022

Conference

Conference13th International Conference on Language Resources and Evaluation Conference, LREC 2022
Country/TerritoryFrance
CityMarseille
Period20/06/2225/06/22

Bibliographical note

Funding Information:
We would like to thank the privacy office of the AUMC for taking the time to meet with us, taking an interest in our research and scrutinizing the process. Also, we would like to thank Edwin Geleijn for taking part in the meetings we had with the privacy office. The GPUs used for the creation of MedRoBERTa.nl were financed by the NWO Spinoza Project assigned to Piek Vossen (project number SPI 63-260).

Publisher Copyright:
© European Language Resources Association (ELRA), licensed under CC-BY-NC-4.0.

Keywords

  • Anonymization
  • Language Model
  • Medical Text Data

Fingerprint

Dive into the research topics of 'Efficiently and Thoroughly Anonymizing a Transformer Language Model for Dutch Electronic Health Records: a Two-Step Method'. Together they form a unique fingerprint.

Cite this