MedRoBERTa.nl: A Language Model for Dutch Electronic Health Records

Stella Verkijk*, Piek Vossen

*Corresponding author for this work

Research output: Contribution to JournalArticleAcademicpeer-review

438 Downloads (Pure)

Abstract

This paper presents MedRoBERTa.nl as the first Transformer-based language model for Dutch medical language. We show that using 13GB of text data from Dutch hospital notes, pre-training from scratch results in a better domain-specific language model than further pre-training RobBERT. When extending pre-training on RobBERT, we use a domain-specific vocabulary and re-train the embedding look-up layer. We show that MedRoBERTa.nl, the model that was trained from scratch, outperforms general language models for Dutch on a medical odd-one-out similarity task. MedRoBERTa.nl already reaches higher performance than general language models for Dutch on this task after only 10k pre-training steps. When fine-tuned, MedRobERTa.nl outperforms general language models for Dutch in a task classifying sentences from Dutch hospital notes that contain information about patients' mobility levels.

Original languageEnglish
Pages (from-to)141-159
Number of pages19
JournalComputational Linguistics in the Netherlands
Volume11
Early online date31 Dec 2021
Publication statusPublished - Dec 2021
Event31st Computational Linguistics in the Netherlands Journal, CLIN 2021 - Virtual, Online, Belgium
Duration: 9 Jul 2021 → …

Bibliographical note

Funding Information:
We would like to thank Edwin Geleijn and the a-proof team for making this research possible. We would like to thank Wietse de Vries and Pieter Delobelle for their advice on how to train the models. The GPUs used in this research were financed by the NWO Spinoza Project assigned to Piek Vossen (project number SPI 63-260). The pilot study that created the ICF dataset that we used to fine-tune our models on was financed by the Corona Research Fund (project number 2007793 - COVID 19 Textmining).

Funding Information:
The GPUs used in this research were financed by the NWO Spinoza Project assigned to Piek Vossen (project number SPI 63-260). The pilot study that created the ICF dataset that we used to fine-tune our models on was financed by the Corona Research Fund (project number 2007793 - COVID 19 Textmining).

Funding Information:
3. Ondersteunen van klinische beslissingen in de revalidatie van patiënten met Covid-19 m.b.v. datascience, funded by the UMC Covid fund

Publisher Copyright:
© 2021 Stella Verkijk, Piek Vossen

Funding

We would like to thank Edwin Geleijn and the a-proof team for making this research possible. We would like to thank Wietse de Vries and Pieter Delobelle for their advice on how to train the models. The GPUs used in this research were financed by the NWO Spinoza Project assigned to Piek Vossen (project number SPI 63-260). The pilot study that created the ICF dataset that we used to fine-tune our models on was financed by the Corona Research Fund (project number 2007793 - COVID 19 Textmining). The GPUs used in this research were financed by the NWO Spinoza Project assigned to Piek Vossen (project number SPI 63-260). The pilot study that created the ICF dataset that we used to fine-tune our models on was financed by the Corona Research Fund (project number 2007793 - COVID 19 Textmining). 3. Ondersteunen van klinische beslissingen in de revalidatie van patiënten met Covid-19 m.b.v. datascience, funded by the UMC Covid fund

Fingerprint

Dive into the research topics of 'MedRoBERTa.nl: A Language Model for Dutch Electronic Health Records'. Together they form a unique fingerprint.

Cite this