Benchmarking Municipal AI Chatbot Performance: Mixed Methods Insights into Competence, Integrity, and Algorithmic Discrimination in Dutch Public Administration

Research output: Working paper / PreprintPreprintAcademic

Abstract

This study introduces the Public-sector Chatbot Performance (PCP) framework, a novel and comprehensive approach to systematically assess AI chatbot performance in public administration. The framework evaluates both technical competence—factual accuracy, completeness, and source reliability—and normative integrity, including lawfulness, transparency, equality, and privacy. To demonstrate applicability of the PCP framework, we benchmark the full set of municipal chatbot systems currently deployed in Dutch local governments, alongside two leading proprietary large language models (LLMs): ChatGPT-4o and Gemini 2.5 Pro. Using a pragmatic mixed methods approach, we developed 26 prompts with systematic user-based variation to explore algorithmic bias, resulting in a dataset of n=326 user-chatbot interactions. Quantitative analysis revealed that ChatGPT-4o achieved a composite performance score of 95.7%, significantly outperforming all municipal systems. Municipal chatbots exhibited notable shortcomings in competence and integrity, with some failing to meet basic standards of lawful and equal service provision. Exploratory qualitative analysis further uncovered algorithmic opacity, discretionary advice in violation of Dutch good governance regulations, and discriminatory responses based on “ethnic” usernames. These insights challenge assumptions about neutrality in public sector AI and underscore the need for ethical benchmarks in chatbot evaluation. The PCP framework offers actionable guidance for policymakers, technologists, and scholars committed to responsible digital governance.
Original languageEnglish
PublisherSocArXiv
Pages1
Number of pages50
DOIs
Publication statusPublished - 16 Sept 2025

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

  1. SDG 16 - Peace, Justice and Strong Institutions
    SDG 16 Peace, Justice and Strong Institutions

Keywords

  • Artificial Intelligence (AI)
  • Municipal Chatbots
  • Mixed Methods Research
  • Performance
  • Benchmarking
  • Evaluation
  • Digital Governance
  • Algorithmic Discrimination
  • Ethics in Artificial Intelligence

VU Research Profile

  • Governance for Society

Fingerprint

Dive into the research topics of 'Benchmarking Municipal AI Chatbot Performance: Mixed Methods Insights into Competence, Integrity, and Algorithmic Discrimination in Dutch Public Administration'. Together they form a unique fingerprint.

Cite this