Tardis: A Fault-Tolerant Design for Network Control Planes

Zhenyu Zhou, Theophilus A. Benson, Marco Canini, Balakrishnan Chandrasekaran

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

79 Downloads (Pure)

Abstract

Guaranteeing high availability of networks virtually hinges on the ability to handle and recover from bugs and failures. Yet, despite the advances in verification, testing, and debugging, production networks remain susceptible to large-scale failures - - often due to deterministic bugs. This paper explores the use of input transformations as a viable method for recovering from such deterministic bugs. In particular, we introduce an online system, Tardis, for overcoming deterministic faults by using a blend of program analysis and runtime program data to systematically determine the fault-triggering input events and using domain-specific models to automatically generate transformations of the fault-triggering inputs that are both safe and semantically equivalent. We evaluated Tardison several production network control plane applications (CPAs), including six SDN CPAs and several popular BGP CPAs using 71 realistic bugs. We observe that Tardisimproves recovery time by 7.44%, introduces a 25% CPU and 0.5% memory overhead, and recovers from 77.26% of the injected realistic and representative bugs, more than twice that of existing solutions.

Original languageEnglish
Title of host publicationSOSR 2021
Subtitle of host publicationProceedings of the 2021 ACM SIGCOMM Symposium on SDN Research (SOSR)
PublisherAssociation for Computing Machinery, Inc
Pages108-121
Number of pages14
ISBN (Electronic)9781450390842
DOIs
Publication statusPublished - Oct 2021
Event2021 ACM SIGCOMM Symposium on SDN Research, SOSR 2021 - Virtual, Online, United States
Duration: 20 Sept 202121 Sept 2021

Conference

Conference2021 ACM SIGCOMM Symposium on SDN Research, SOSR 2021
Country/TerritoryUnited States
CityVirtual, Online
Period20/09/2121/09/21

Bibliographical note

Funding Information:
We thank the anonymous reviewers and our shepherd, Ryan Beckett, for their insightful comments. We also thank Ayush Bhardwaj for helping us with designing our experiments. This work was supported by NSF award CNS-1749785.

Publisher Copyright:
© 2021 ACM.

Funding

We thank the anonymous reviewers and our shepherd, Ryan Beckett, for their insightful comments. We also thank Ayush Bhardwaj for helping us with designing our experiments. This work was supported by NSF award CNS-1749785.

Keywords

  • control plane
  • failure recovery
  • Software Defined Networks
  • transformation

Fingerprint

Dive into the research topics of 'Tardis: A Fault-Tolerant Design for Network Control Planes'. Together they form a unique fingerprint.

Cite this