Skip to main navigation Skip to search Skip to main content

How Do ML Jobs Fail in Datacenters? Analysis of a Long-Term Dataset from an HPC Cluster

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

Abstract

Reliable job execution is important in High Performance Computing clusters. Understanding the failure distribution and failure pattern of jobs helps HPC cluster managers design better systems, and users design fault tolerant systems. Machine learning is an increasingly popular workload for HPC clusters are used for. But, there is little information on machine learning job failure characteristics on HPC clusters, and how they differ from the previous workload such clusters were used for. The goal of our work is to improve the understanding of machine learning job failures in HPC clusters. We collect and analyze job data spanning the whole of 2022, and over 2∼million jobs. We analyze basic statistical characteristics, the time pattern of failures, resource waste caused by failures, and their autocorrelation. Some of our findings are that machine learning jobs fail at a higher rate than non-ML jobs, and waste much more CPU-time per job when they fail.

Original languageEnglish
Title of host publicationICPE 2023 Companinion
Subtitle of host publicationCompanion of the 2023 ACM/SPEC International Conference on Performance Engineering
PublisherAssociation for Computing Machinery, Inc
Pages263-268
Number of pages6
ISBN (Electronic)9798400700729
DOIs
Publication statusPublished - Apr 2023
Event14th Annual ACM/SPEC International Conference on Performance Engineering, ICPE 2023 - Coimbra, Portugal
Duration: 15 Apr 202319 Apr 2023

Conference

Conference14th Annual ACM/SPEC International Conference on Performance Engineering, ICPE 2023
Country/TerritoryPortugal
CityCoimbra
Period15/04/2319/04/23

Bibliographical note

Funding Information:
We thank the Dutch national supercomputing center SURF for providing us the data. We thank the China Scholarship Council ?CSC? for supporting Xiaoyu Chu. We thank the projects NWO Top2 OffSense, EU H2020 GraphMassivizer, and EU MCSA-RISE CLOUDSTARS for co-funding this project.

Publisher Copyright:
© 2023 Owner/Author.

Funding

We thank the Dutch national supercomputing center SURF for providing us the data. We thank the China Scholarship Council ?CSC? for supporting Xiaoyu Chu. We thank the projects NWO Top2 OffSense, EU H2020 GraphMassivizer, and EU MCSA-RISE CLOUDSTARS for co-funding this project.

FundersFunder number
China Scholarship Council
European Commission101086248

    UN SDGs

    This output contributes to the following UN Sustainable Development Goals (SDGs)

    1. SDG 8 - Decent Work and Economic Growth
      SDG 8 Decent Work and Economic Growth

    Keywords

    • failure characterization
    • HPC datacenters
    • job failure
    • machine learning
    • reliability
    • time correlation failures

    Fingerprint

    Dive into the research topics of 'How Do ML Jobs Fail in Datacenters? Analysis of a Long-Term Dataset from an HPC Cluster'. Together they form a unique fingerprint.

    Cite this