Skip to main navigation Skip to search Skip to main content

Interpreting the Methods that Interpret Language Models

  • Jonathan Ben Kamp

Research output: PhD ThesisPhD-Thesis - Research and graduation internal

13 Downloads (Pure)

Abstract

Transformer-based language models achieve strong performance across a wide range of tasks, but their internal decision-making processes remain opaque. Model interpretability (or explainability) addresses this opacity by providing explanations of model behaviour, but current methods are imperfect and may individually deliver only partial insights. In particular, current model explanations can be considered approximations of a latent true explanation. As a result, interpretability research is challenged by a series of fundamental aspects that characterise these approximated explanations and directly influence their reliability and generalisability. This thesis investigates these challenges through a series of empirical studies on the interpretability of artificial language classifiers trained on a range of different tasks. A particular focus is laid on post-hoc feature attribution: a type of model explanation method that provides token-level insights about the input text that receives a model prediction. Generally, different feature attribution methods tend to assign varying relevance scores to different input tokens for the same model prediction, implying that the explanations are misaligned. The underlying reasons beyond this misalignment can be traced back to biases, i.e. inherent systematic deviations of models and methods. By studying the methodological triangulation perspective centred around latent model explanations, this thesis addresses bias, shedding light on the interaction between different modelling configurations, different explanation methods, and the different metrics that are used to interpret these methods. The empirical investigation of this thesis spans five experimental chapters that address interpretability from complementary angles. The first study investigates the task-specific knowledge of models through global explanations of model bias. Specifically, it analyses the robustness of token- and sentence-level classifiers through adversarial perturbations and through analyses of subpopulations of the data. The remaining chapters focus on different aspects of local (input-level) explanations, such as the estimation of comprehensive sets of explanation tokens, and the role of syntactic, lexical, and positional biases that emerge through token relevance scores. These analyses involve different natural language experiments to explore the behaviour of explanation methods in real-world scenarios, but also involve controlled experimental setups that enable the isolation of bias from confounding factors. The final studies explore the metrics used to evaluate the explanation methods, showing that commonly used faithfulness metrics, which aim to measure the distance between the observed and the true model explanation, may capture unintended effects specifically in low-performing or task-dependent training scenarios. Across these studies, this work demonstrates that the field of interpretability should not only concern the interpretability of the underlying model, but that explanation methods should be subjected to careful interpretations themselves. Methodological choices, which span from the granularity settings of inputs and explanations to the evaluation metrics of bias and faithfulness, matter for the reliability of such methods. While the true model explanation may remain concealed, the findings of this thesis show that methodological awareness can reduce the distance from current approximations, while revealing hidden biases and offering nuanced conclusions.
Original languageEnglish
QualificationPhD
Awarding Institution
  • Vrije Universiteit Amsterdam
Supervisors/Advisors
  • Fokkens, Antske, Supervisor
  • Beinborn, Lisa, Co-supervisor, -
Award date18 Sept 2026
Print ISBNs9789465375816
DOIs
Publication statusPublished - 18 Sept 2026

Keywords

  • Natural Language Processing
  • Model Interpretability
  • Explainable Artificial Intelligence
  • Robustness Testing
  • Feature Attribution
  • Faithfulness & Plausibility
  • Input Rationalisation
  • Attention Regularisation

Fingerprint

Dive into the research topics of 'Interpreting the Methods that Interpret Language Models'. Together they form a unique fingerprint.

Cite this