FA2: Fast, Accurate Autoscaling for Serving Deep Learning Inference with SLA Guarantees

Kamran Razavi, Manisha Luthra, Boris Koldehofe, Max Muhlhauser, Lin Wang

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

Abstract

Deep learning (DL) inference has become an essential building block in modern intelligent applications. Due to the high computational intensity of DL, it is critical to scale DL inference serving systems in response to fluctuating workloads to achieve resource efficiency. Meanwhile, intelligent applications often require strict service level agreements (SLAs), which need to be guaranteed when the system is scaled. The problem is complex and has been tackled only in simple scenarios so far.This paper describes FA2, a fast and accurate autoscaler concept for DL inference serving systems. In contrast to related works, FA2 adopts a general, contrived two-phase approach. Specifically, it starts by capturing the autoscaling challenges in a comprehensive graph-based model. Then, FA2 applies targeted graph transformation and makes autoscaling decisions with an efficient algorithm based on dynamic programming. We implemented FA2 and built and evaluated a prototype. Compared with state-of-the-art autoscaling solutions, our experiments showed FA2 to achieve significant resource reduction (19% under CPUs and 25% under GPUs, on average) in combination with low SLA violations (less than 1.5%). FA2 performed close to the theoretical optimum, matching exactly the optimal decisions (with the least required resources) in 96.8% of all the cases in our evaluation.

Original languageEnglish
Title of host publication2022 IEEE 28th Real-Time and Embedded Technology and Applications Symposium (RTAS)
Subtitle of host publication[Proceedings]
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages146-159
Number of pages14
ISBN (Electronic)9781665499989
ISBN (Print)9781665499996
DOIs
Publication statusPublished - 29 Jun 2022
Event28th IEEE Real-Time and Embedded Technology and Applications Symposium, RTAS 2022 - Milan, Italy
Duration: 4 May 20226 May 2022

Publication series

NameProceedings of the IEEE Real-Time and Embedded Technology and Applications Symposium, RTAS
Volume2022-May
ISSN (Print)1545-3421

Conference

Conference28th IEEE Real-Time and Embedded Technology and Applications Symposium, RTAS 2022
Country/TerritoryItaly
CityMilan
Period4/05/226/05/22

Bibliographical note

Funding Information:
We would like to thank the anonymous reviewers and our anonymous shepherd for their valuable comments and suggestions. This work has been funded by the German Research Foundation (DFG) within the Collaborative Research Center (CRC) 1053 MAKI.

Publisher Copyright:
© 2022 IEEE.

Funding

We would like to thank the anonymous reviewers and our anonymous shepherd for their valuable comments and suggestions. This work has been funded by the German Research Foundation (DFG) within the Collaborative Research Center (CRC) 1053 MAKI.

Keywords

  • Autoscaling
  • Inference Serving Systems

Fingerprint

Dive into the research topics of 'FA2: Fast, Accurate Autoscaling for Serving Deep Learning Inference with SLA Guarantees'. Together they form a unique fingerprint.

Cite this