Abstract
Today, machine learning (ML) workloads are nearly ubiquitous. Over the past decade, much effort has been put into making ML model-training fast and efficient, e.g., by proposing new ML frameworks (such as TensorFlow, PyTorch), leveraging hardware support (TPUs, GPUs, FPGAs), and implementing new execution models (pipelines, distributed training). Matching this trend, considerable effort has also been put into performance analysis tools focusing on ML model-training. However, as we identify in this work, ML model training rarely happens in isolation and is instead one step in a larger ML workflow. Therefore, it is surprising that there exists no performance analysis tool that covers the entire life-cycle of ML workflows. Addressing this large conceptual gap, we envision in this work a holistic performance analysis tool for ML workflows. We analyze the state-of-practice and the state-of-the-art, presenting quantitative evidence about the performance of existing performance tools. We formulate our vision for holistic performance analysis of ML workflows along four design pillars: a unified execution model, lightweight collection of performance data, efficient data aggregation and presentation, and close integration in ML systems. Finally, we propose first steps towards implementing our vision as GradeML, a holistic performance analysis tool for ML workflows. Our preliminary work and experiments are open source at https://github.com/atlarge-research/grademl.
Original language | English |
---|---|
Title of host publication | ICPE 2021 |
Subtitle of host publication | Companion of the ACM/SPEC International Conference on Performance Engineering |
Publisher | Association for Computing Machinery, Inc |
Pages | 57-63 |
Number of pages | 7 |
ISBN (Electronic) | 9781450383318 |
DOIs | |
Publication status | Published - Apr 2021 |
Event | 2021 ACM/SPEC International Conference on Performance Engineering, ICPE 2021 - Virtual, Online, France Duration: 19 Apr 2021 → 21 Apr 2021 |
Conference
Conference | 2021 ACM/SPEC International Conference on Performance Engineering, ICPE 2021 |
---|---|
Country/Territory | France |
City | Virtual, Online |
Period | 19/04/21 → 21/04/21 |
Bibliographical note
Funding Information:Work supported by NWO projects MagnaData and OffSense.
Publisher Copyright:
© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.
Copyright:
Copyright 2021 Elsevier B.V., All rights reserved.
Funding
Work supported by NWO projects MagnaData and OffSense.
Keywords
- Data gathering
- GradeML
- Machine learning workflow
- MLDevOps
- Modeling
- Performance analysis