mCAP: Memory-Centric Partitioning for Large-Scale Pipeline-Parallel DNN Training

Henk Dreuning*, Henri E. Bal, Rob V.van Nieuwpoort

*Corresponding author for this work

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

165 Downloads (Pure)

Abstract

Memory usage is becoming an increasingly pressing bottleneck in the training process of Deep Neural Networks (DNNs), especially when training on Graphics Processing Units (GPUs). Existing solutions for multi-GPU training setups partition the neural network over the GPUs in a way that favors training throughput over memory usage, and thus maximum trainable network size. We propose mCAP, a partitioning solution for pipeline-parallel DNN training that focuses specifically on memory usage. It evenly distributes Deep Learning models over the available resources with respect to per-device peak memory usage. Our partitioning approach uses a novel incremental profiling strategy to extract per-layer memory usage statistics. A model-based predictor uses the profiling data to recommend a partitioning that balances peak memory usage. Our approach is DL-framework agnostic and orthogonal to existing memory optimizations found in large-scale DNN training systems. Our results show that our approach enables training of neural networks that are 1.55 times larger than existing partitioning solutions in terms of the number of parameters.

Original languageEnglish
Title of host publicationEuro-Par 2022: Parallel Processing
Subtitle of host publication28th International Conference on Parallel and Distributed Computing, Glasgow, UK, August 22–26, 2022, Proceedings
EditorsJosé Cano, Phil Trinder
PublisherSpringer Science and Business Media Deutschland GmbH
Pages155-170
Number of pages16
ISBN (Electronic)9783031125973
ISBN (Print)9783031125966
DOIs
Publication statusPublished - 2022
Event28th International European Conference on Parallel and Distributed Computing, Euro-Par 2022 - Glasgow, United Kingdom
Duration: 22 Aug 202226 Aug 2022

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume13440 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference28th International European Conference on Parallel and Distributed Computing, Euro-Par 2022
Country/TerritoryUnited Kingdom
CityGlasgow
Period22/08/2226/08/22

Bibliographical note

Funding Information:
the anonymous reviewers for their valuable feedback. This work is part of the Efficient Deep Learning (EDL) programme (grant number P16-25), financed by the Dutch Research Council (NWO). This work was carried out on the Dutch national e-infrastructure with the support of SURF Cooperative. The datasets generated during and/or analysed during the current study are available in the Figshare repository: https://doi.org/10.6084/m9.figshare.20000960 [4].

Publisher Copyright:
© 2022, Springer Nature Switzerland AG.

Keywords

  • Deep Learning
  • HPC
  • Pipeline Parallelism

Fingerprint

Dive into the research topics of 'mCAP: Memory-Centric Partitioning for Large-Scale Pipeline-Parallel DNN Training'. Together they form a unique fingerprint.

Cite this