Bag & Tag'em - A new Dutch stemmer

Anne Jonker, Corné de Ruijt, Jornt R. de Gruijl

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

Abstract

We propose a novel stemming algorithm that is both robust and accurate compared to state-of-the-art solutions, yet addresses several of the problems that current stemmers face in the Dutch language. The main issue is that most current stemmers cannot handle 3rd person singular forms of verbs and many irregular words and conjugations, unless a (nearly) brute-force approach is used. Our algorithm combines a new tagging module with a stemmer that uses tag-specific sets of rigid rules: the Bag & Tag'em (BT) algorithm. The tagging module is developed and evaluated using three algorithms: Multinomial Logistic Regression (MLR), Neural Network (NN) and Extreme Gradient Boosting (XGB). The stemming module's performance is compared with that of current state-of-the-art stemming algorithms for the Dutch Language. Even though there is still room for improvement, the new BT algorithm performs well in the sense that it is more accurate than the current stemmers and faster than brute-force-like algorithms. The code and data used for this paper can be found at: https://github.com/Anne-Jonker/Bag-Tag-em.

Original languageEnglish
Title of host publicationLREC 2020
Subtitle of host publicationProceedings of the 12th International Conference on Language Resources and Evaluation
EditorsNicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
PublisherEuropean Language Resources Association (ELRA)
Pages3868-3876
Number of pages9
ISBN (Electronic)9791095546344
ISBN (Print)9791095546344
Publication statusPublished - May 2020
Event12th International Conference on Language Resources and Evaluation, LREC 2020 - Marseille, France
Duration: 11 May 202016 May 2020

Conference

Conference12th International Conference on Language Resources and Evaluation, LREC 2020
Country/TerritoryFrance
CityMarseille
Period11/05/2016/05/20

Keywords

  • Dutch
  • PoS tagging
  • Stemming

Fingerprint

Dive into the research topics of 'Bag & Tag'em - A new Dutch stemmer'. Together they form a unique fingerprint.

Cite this