Abstract
Reliable job execution is important in High Performance Computing clusters. Understanding the failure distribution and failure pattern of jobs helps HPC cluster managers design better systems, and users design fault tolerant systems. Machine learning is an increasingly popular workload for HPC clusters are used for. But, there is little information on machine learning job failure characteristics on HPC clusters, and how they differ from the previous workload such clusters were used for. The goal of our work is to improve the understanding of machine learning job failures in HPC clusters. We collect and analyze job data spanning the whole of 2022, and over 2∼million jobs. We analyze basic statistical characteristics, the time pattern of failures, resource waste caused by failures, and their autocorrelation. Some of our findings are that machine learning jobs fail at a higher rate than non-ML jobs, and waste much more CPU-time per job when they fail.
| Original language | English |
|---|---|
| Title of host publication | ICPE 2023 Companinion |
| Subtitle of host publication | Companion of the 2023 ACM/SPEC International Conference on Performance Engineering |
| Publisher | Association for Computing Machinery, Inc |
| Pages | 263-268 |
| Number of pages | 6 |
| ISBN (Electronic) | 9798400700729 |
| DOIs | |
| Publication status | Published - Apr 2023 |
| Event | 14th Annual ACM/SPEC International Conference on Performance Engineering, ICPE 2023 - Coimbra, Portugal Duration: 15 Apr 2023 → 19 Apr 2023 |
Conference
| Conference | 14th Annual ACM/SPEC International Conference on Performance Engineering, ICPE 2023 |
|---|---|
| Country/Territory | Portugal |
| City | Coimbra |
| Period | 15/04/23 → 19/04/23 |
Bibliographical note
Funding Information:We thank the Dutch national supercomputing center SURF for providing us the data. We thank the China Scholarship Council ?CSC? for supporting Xiaoyu Chu. We thank the projects NWO Top2 OffSense, EU H2020 GraphMassivizer, and EU MCSA-RISE CLOUDSTARS for co-funding this project.
Publisher Copyright:
© 2023 Owner/Author.
Funding
We thank the Dutch national supercomputing center SURF for providing us the data. We thank the China Scholarship Council ?CSC? for supporting Xiaoyu Chu. We thank the projects NWO Top2 OffSense, EU H2020 GraphMassivizer, and EU MCSA-RISE CLOUDSTARS for co-funding this project.
| Funders | Funder number |
|---|---|
| China Scholarship Council | |
| European Commission | 101086248 |
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 8 Decent Work and Economic Growth
Keywords
- failure characterization
- HPC datacenters
- job failure
- machine learning
- reliability
- time correlation failures
Fingerprint
Dive into the research topics of 'How Do ML Jobs Fail in Datacenters? Analysis of a Long-Term Dataset from an HPC Cluster'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver