Abstract
This paper presents MedRoBERTa.nl as the first Transformer-based language model for Dutch medical language. We show that using 13GB of text data from Dutch hospital notes, pre-training from scratch results in a better domain-specific language model than further pre-training RobBERT. When extending pre-training on RobBERT, we use a domain-specific vocabulary and re-train the embedding look-up layer. We show that MedRoBERTa.nl, the model that was trained from scratch, outperforms general language models for Dutch on a medical odd-one-out similarity task. MedRoBERTa.nl already reaches higher performance than general language models for Dutch on this task after only 10k pre-training steps. When fine-tuned, MedRobERTa.nl outperforms general language models for Dutch in a task classifying sentences from Dutch hospital notes that contain information about patients' mobility levels.
Original language | English |
---|---|
Pages (from-to) | 141-159 |
Number of pages | 19 |
Journal | Computational Linguistics in the Netherlands |
Volume | 11 |
Early online date | 31 Dec 2021 |
Publication status | Published - Dec 2021 |
Event | 31st Computational Linguistics in the Netherlands Journal, CLIN 2021 - Virtual, Online, Belgium Duration: 9 Jul 2021 → … |
Bibliographical note
Funding Information:We would like to thank Edwin Geleijn and the a-proof team for making this research possible. We would like to thank Wietse de Vries and Pieter Delobelle for their advice on how to train the models. The GPUs used in this research were financed by the NWO Spinoza Project assigned to Piek Vossen (project number SPI 63-260). The pilot study that created the ICF dataset that we used to fine-tune our models on was financed by the Corona Research Fund (project number 2007793 - COVID 19 Textmining).
Funding Information:
The GPUs used in this research were financed by the NWO Spinoza Project assigned to Piek Vossen (project number SPI 63-260). The pilot study that created the ICF dataset that we used to fine-tune our models on was financed by the Corona Research Fund (project number 2007793 - COVID 19 Textmining).
Funding Information:
3. Ondersteunen van klinische beslissingen in de revalidatie van patiënten met Covid-19 m.b.v. datascience, funded by the UMC Covid fund
Publisher Copyright:
© 2021 Stella Verkijk, Piek Vossen
Funding
We would like to thank Edwin Geleijn and the a-proof team for making this research possible. We would like to thank Wietse de Vries and Pieter Delobelle for their advice on how to train the models. The GPUs used in this research were financed by the NWO Spinoza Project assigned to Piek Vossen (project number SPI 63-260). The pilot study that created the ICF dataset that we used to fine-tune our models on was financed by the Corona Research Fund (project number 2007793 - COVID 19 Textmining). The GPUs used in this research were financed by the NWO Spinoza Project assigned to Piek Vossen (project number SPI 63-260). The pilot study that created the ICF dataset that we used to fine-tune our models on was financed by the Corona Research Fund (project number 2007793 - COVID 19 Textmining). 3. Ondersteunen van klinische beslissingen in de revalidatie van patiënten met Covid-19 m.b.v. datascience, funded by the UMC Covid fund