An Empirical Characterization of Outages and Incidents in Public Services for Large Language Models

Xiaoyu Chu, Sacheendra Talluri, Qingxian Lu, Alexandru Iosup

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

Abstract

People and businesses increasingly rely on public LLM services, such as ChatGPT, DALL·E, and Claude. Understanding their outages, and particularly measuring their failure-recovery processes, is becoming a stringent problem. However, only limited studies exist in this emerging area. Addressing this problem, in this work we conduct an empirical characterization of outages and failure-recovery in public LLM services. We collect and prepare datasets for 8 commonly used LLM services across 3 major LLM providers, including market-leads OpenAI and Anthropic. We conduct a detailed analysis of failure recovery statistical properties, temporal patterns, co-occurrence, and the impact range of outage-causing incidents. We make over 10 observations, among which: (1) Failures in OpenAI's ChatGPT take longer to resolve but occur less frequently than those in Anthropic's Claude;(2) OpenAI and Anthropic service failures exhibit strong weekly and monthly periodicity; and (3) OpenAI services offer better failure-isolation than Anthropic services. Our research explains LLM failure characteristics and thus enables optimization in building and using LLM systems. FAIR data and code are publicly available on https://zenodo.org/records/14018219 and https://github.com/atlarge-research/llm-service-analysis.

Original languageEnglish
Title of host publicationICPE '25
Subtitle of host publicationProceedings of the 16th ACM/SPEC International Conference on Performance Engineering
PublisherAssociation for Computing Machinery, Inc
Pages69-80
Number of pages12
ISBN (Electronic)9798400710735
DOIs
Publication statusPublished - 2025
Event16th ACM/SPEC International Conference on Performance, ICPE 2025 - Toronto, Canada
Duration: 5 May 20259 May 2025

Conference

Conference16th ACM/SPEC International Conference on Performance, ICPE 2025
Country/TerritoryCanada
CityToronto
Period5/05/259/05/25

Bibliographical note

Publisher Copyright:
© 2025 Copyright held by the owner/author(s).

Keywords

  • anthropic
  • character.ai
  • failure characterization
  • failure-recovery
  • llm
  • openai
  • operational data analytics
  • reliability

Fingerprint

Dive into the research topics of 'An Empirical Characterization of Outages and Incidents in Public Services for Large Language Models'. Together they form a unique fingerprint.

Cite this