Research output per year
Research output per year
Lisa Beinborn, Yuval Pinter
Research output: Chapter in Book / Report / Conference proceeding › Conference contribution › Academic › peer-review
Subword tokenization has become the de-facto standard for tokenization, although comparative evaluations of subword vocabulary quality across languages are scarce. Existing evaluation studies focus on the effect of a tokenization algorithm on the performance in downstream tasks, or on engineering criteria such as the compression rate. We present a new evaluation paradigm that focuses on the cognitive plausibility of subword tokenization. We analyze the correlation of the tokenizer output with the response time and accuracy of human performance on a lexical decision task. We compare three tokenization algorithms across several languages and vocabulary sizes. Our results indicate that the UnigramLM algorithm yields less cognitively plausible tokenization behavior and a worse coverage of derivational morphemes, in contrast with prior work.
Original language | English |
---|---|
Title of host publication | Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing |
Editors | Houda Bouamor, Juan Pino, Kalika Bali |
Publisher | Association for Computational Linguistics (ACL) |
Pages | 4478-4486 |
Number of pages | 9 |
ISBN (Electronic) | 9798891760608 |
Publication status | Published - 2023 |
Event | 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023 - Hybrid, Singapore, Singapore Duration: 6 Dec 2023 → 10 Dec 2023 |
Conference | 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023 |
---|---|
Country/Territory | Singapore |
City | Hybrid, Singapore |
Period | 6/12/23 → 10/12/23 |
Lisa Beinborn’s work was supported by the Dutch National Science Organisation (NWO) through the VENI program (Vl.Veni.211C.039). Yuval Pinter’s work was supported by a Google gift intended for work on Meaningful Subword Text Tokenization. We thank the reviewers for their thoughtful comments. We thank Joshua Snell and Omri Uzan for comments on early drafts.
Funders | Funder number |
---|---|
Dutch National Science organisation | |
Nederlandse Organisatie voor Wetenschappelijk Onderzoek |
Research output: Working paper / Preprint › Preprint › Academic