Abstract
Subword tokenization has become the de-facto standard for tokenization, although comparative evaluations of subword vocabulary quality across languages are scarce. Existing evaluation studies focus on the effect of a tokenization algorithm on the performance in downstream tasks, or on engineering criteria such as the compression rate. We present a new evaluation paradigm that focuses on the cognitive plausibility of subword tokenization. We analyze the correlation of the tokenizer output with the response time and accuracy of human performance on a lexical decision task. We compare three tokenization algorithms across several languages and vocabulary sizes. Our results indicate that the UnigramLM algorithm yields less cognitively plausible tokenization behavior and a worse coverage of derivational morphemes, in contrast with prior work.
| Original language | English |
|---|---|
| Title of host publication | Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing |
| Editors | Houda Bouamor, Juan Pino, Kalika Bali |
| Publisher | Association for Computational Linguistics (ACL) |
| Pages | 4478-4486 |
| Number of pages | 9 |
| ISBN (Electronic) | 9798891760608 |
| Publication status | Published - 2023 |
| Event | 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023 - Hybrid, Singapore, Singapore Duration: 6 Dec 2023 → 10 Dec 2023 |
Conference
| Conference | 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023 |
|---|---|
| Country/Territory | Singapore |
| City | Hybrid, Singapore |
| Period | 6/12/23 → 10/12/23 |
Bibliographical note
Publisher Copyright:©2023 Association for Computational Linguistics.
Funding
Lisa Beinborn’s work was supported by the Dutch National Science Organisation (NWO) through the VENI program (Vl.Veni.211C.039). Yuval Pinter’s work was supported by a Google gift intended for work on Meaningful Subword Text Tokenization. We thank the reviewers for their thoughtful comments. We thank Joshua Snell and Omri Uzan for comments on early drafts.
| Funders |
|---|
| Dutch National Science organisation |
| Nederlandse Organisatie voor Wetenschappelijk Onderzoek |
Fingerprint
Dive into the research topics of 'Analyzing Cognitive Plausibility of Subword Tokenization'. Together they form a unique fingerprint.Research output
- 16 Citations
- 1 Preprint
-
Analyzing Cognitive Plausibility of Subword Tokenization
Beinborn, L. & Pinter, Y., 20 Oct 2023, p. 1-9, 9 p.Research output: Working paper / Preprint › Preprint › Academic
Open AccessFile63 Downloads (Pure)
Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver