GradeML: Towards holistic performance analysis for machine learning workflows

Tim Hegeman, Matthijs Jansen, Alexandru Iosup, Animesh Trivedi

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

Abstract

Today, machine learning (ML) workloads are nearly ubiquitous. Over the past decade, much effort has been put into making ML model-training fast and efficient, e.g., by proposing new ML frameworks (such as TensorFlow, PyTorch), leveraging hardware support (TPUs, GPUs, FPGAs), and implementing new execution models (pipelines, distributed training). Matching this trend, considerable effort has also been put into performance analysis tools focusing on ML model-training. However, as we identify in this work, ML model training rarely happens in isolation and is instead one step in a larger ML workflow. Therefore, it is surprising that there exists no performance analysis tool that covers the entire life-cycle of ML workflows. Addressing this large conceptual gap, we envision in this work a holistic performance analysis tool for ML workflows. We analyze the state-of-practice and the state-of-the-art, presenting quantitative evidence about the performance of existing performance tools. We formulate our vision for holistic performance analysis of ML workflows along four design pillars: a unified execution model, lightweight collection of performance data, efficient data aggregation and presentation, and close integration in ML systems. Finally, we propose first steps towards implementing our vision as GradeML, a holistic performance analysis tool for ML workflows. Our preliminary work and experiments are open source at https://github.com/atlarge-research/grademl.

Original languageEnglish
Title of host publicationICPE 2021
Subtitle of host publicationCompanion of the ACM/SPEC International Conference on Performance Engineering
PublisherAssociation for Computing Machinery, Inc
Pages57-63
Number of pages7
ISBN (Electronic)9781450383318
DOIs
Publication statusPublished - Apr 2021
Event2021 ACM/SPEC International Conference on Performance Engineering, ICPE 2021 - Virtual, Online, France
Duration: 19 Apr 202121 Apr 2021

Conference

Conference2021 ACM/SPEC International Conference on Performance Engineering, ICPE 2021
Country/TerritoryFrance
CityVirtual, Online
Period19/04/2121/04/21

Bibliographical note

Funding Information:
Work supported by NWO projects MagnaData and OffSense.

Publisher Copyright:
© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.

Copyright:
Copyright 2021 Elsevier B.V., All rights reserved.

Funding

Work supported by NWO projects MagnaData and OffSense.

Keywords

  • Data gathering
  • GradeML
  • Machine learning workflow
  • MLDevOps
  • Modeling
  • Performance analysis

Fingerprint

Dive into the research topics of 'GradeML: Towards holistic performance analysis for machine learning workflows'. Together they form a unique fingerprint.

Cite this