Analyzing Cognitive Plausibility of Subword Tokenization

Lisa Beinborn, Yuval Pinter

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

Abstract

Subword tokenization has become the de-facto standard for tokenization, although comparative evaluations of subword vocabulary quality across languages are scarce. Existing evaluation studies focus on the effect of a tokenization algorithm on the performance in downstream tasks, or on engineering criteria such as the compression rate. We present a new evaluation paradigm that focuses on the cognitive plausibility of subword tokenization. We analyze the correlation of the tokenizer output with the response time and accuracy of human performance on a lexical decision task. We compare three tokenization algorithms across several languages and vocabulary sizes. Our results indicate that the UnigramLM algorithm yields less cognitively plausible tokenization behavior and a worse coverage of derivational morphemes, in contrast with prior work.

Original languageEnglish
Title of host publicationProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
EditorsHouda Bouamor, Juan Pino, Kalika Bali
PublisherAssociation for Computational Linguistics (ACL)
Pages4478-4486
Number of pages9
ISBN (Electronic)9798891760608
Publication statusPublished - 2023
Event2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023 - Hybrid, Singapore, Singapore
Duration: 6 Dec 202310 Dec 2023

Conference

Conference2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023
Country/TerritorySingapore
CityHybrid, Singapore
Period6/12/2310/12/23

Bibliographical note

Publisher Copyright:
©2023 Association for Computational Linguistics.

Funding

Lisa Beinborn’s work was supported by the Dutch National Science Organisation (NWO) through the VENI program (Vl.Veni.211C.039). Yuval Pinter’s work was supported by a Google gift intended for work on Meaningful Subword Text Tokenization. We thank the reviewers for their thoughtful comments. We thank Joshua Snell and Omri Uzan for comments on early drafts.

FundersFunder number
Dutch National Science organisation
Google
Nederlandse Organisatie voor Wetenschappelijk Onderzoek

    Fingerprint

    Dive into the research topics of 'Analyzing Cognitive Plausibility of Subword Tokenization'. Together they form a unique fingerprint.
    • Analyzing Cognitive Plausibility of Subword Tokenization

      Beinborn, L. & Pinter, Y., 20 Oct 2023, p. 1-9, 9 p.

      Research output: Working paper / PreprintPreprintAcademic

      Open Access
      File
      38 Downloads (Pure)

    Cite this