Abstract
People and businesses increasingly rely on public LLM services, such as ChatGPT, DALL·E, and Claude. Understanding their outages, and particularly measuring their failure-recovery processes, is becoming a stringent problem. However, only limited studies exist in this emerging area. Addressing this problem, in this work we conduct an empirical characterization of outages and failure-recovery in public LLM services. We collect and prepare datasets for 8 commonly used LLM services across 3 major LLM providers, including market-leads OpenAI and Anthropic. We conduct a detailed analysis of failure recovery statistical properties, temporal patterns, co-occurrence, and the impact range of outage-causing incidents. We make over 10 observations, among which: (1) Failures in OpenAI's ChatGPT take longer to resolve but occur less frequently than those in Anthropic's Claude;(2) OpenAI and Anthropic service failures exhibit strong weekly and monthly periodicity; and (3) OpenAI services offer better failure-isolation than Anthropic services. Our research explains LLM failure characteristics and thus enables optimization in building and using LLM systems. FAIR data and code are publicly available on https://zenodo.org/records/14018219 and https://github.com/atlarge-research/llm-service-analysis.
| Original language | English |
|---|---|
| Title of host publication | ICPE '25 |
| Subtitle of host publication | Proceedings of the 16th ACM/SPEC International Conference on Performance Engineering |
| Publisher | Association for Computing Machinery, Inc |
| Pages | 69-80 |
| Number of pages | 12 |
| ISBN (Electronic) | 9798400710735 |
| DOIs | |
| Publication status | Published - 2025 |
| Event | 16th ACM/SPEC International Conference on Performance, ICPE 2025 - Toronto, Canada Duration: 5 May 2025 → 9 May 2025 |
Conference
| Conference | 16th ACM/SPEC International Conference on Performance, ICPE 2025 |
|---|---|
| Country/Territory | Canada |
| City | Toronto |
| Period | 5/05/25 → 9/05/25 |
Bibliographical note
Publisher Copyright:© 2025 Copyright held by the owner/author(s).
Keywords
- anthropic
- character.ai
- failure characterization
- failure-recovery
- llm
- openai
- operational data analytics
- reliability