Analyzing Cognitive Plausibility of Subword Tokenization

Lisa Beinborn, Yuval Pinter

Research output: Working paper / PreprintPreprintAcademic

37 Downloads (Pure)

Abstract

Subword tokenization has become the de-facto standard for tokenization, although comparative evaluations of subword vocabulary quality across languages are scarce. Existing evaluation studies focus on the effect of a tokenization algorithm on the performance in downstream tasks, or on engineering criteria such as the compression rate. We present a new evaluation paradigm that focuses on the cognitive plausibility of subword tokenization. We analyze the correlation of the tokenizer output with the response time and accuracy of human performance on a lexical decision task. We compare three tokenization algorithms across several languages and vocabulary sizes. Our results indicate that the UnigramLM algorithm yields less cognitively plausible tokenization behavior and a worse coverage of derivational morphemes, in contrast with prior work.
Original languageEnglish
Pages1-9
Number of pages9
DOIs
Publication statusPublished - 20 Oct 2023

Bibliographical note

EMNLP 2023 (main). Published in arXiv.

Keywords

  • cs.CL

Fingerprint

Dive into the research topics of 'Analyzing Cognitive Plausibility of Subword Tokenization'. Together they form a unique fingerprint.
  • Analyzing Cognitive Plausibility of Subword Tokenization

    Beinborn, L. & Pinter, Y., 2023, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Bouamor, H., Pino, J. & Bali, K. (eds.). Association for Computational Linguistics (ACL), p. 4478-4486 9 p. 272

    Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

    Open Access

Cite this