FIRestarter: Practical Software Crash Recovery with Targeted Library-level Fault Injection

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

8 Downloads (Pure)

Abstract

Despite advances in software testing, many bugs still plague deployed software, leading to crashes and thus service disruption in high-availability production applications. Existing crash recovery solutions are either limited to transient faults or require manual annotations to target predetermined persistent bugs. Moreover, existing solutions are generally inefficient, hindering practical deployment.In this paper, we present FIRestarter (Fault Injection-based Restarter), an efficient and automatic crash recovery solution for commodity user applications. To eliminate the need for manual annotations, FIRestarter injects targeted software faults at the library interface to automatically trigger error handling code for standard library calls already part of the application. In particular, when a crash occurs, we roll back the application state before the last recoverable library call, inject a fault, and restart execution forcing the call to immediately return a predetermined error code. This strategy allows the application to automatically bypass the crashing code upon such a restart and exploits existing error-handling code to recover from even persistent bugs. Moreover, since library calls lie pervasively throughout the code, our design provides a large recovery surface despite the automated approach. Finally, FIRestarter's recovery windows are small and frequent compared to traditional checkpoint-restart, which enables new optimizations such as the ability to support rollback by means of hybrid hardware/software transactional memory instrumentation and improve performance. We apply FIRestarter to a number of event-driven server applications and show our solution achieves near-instantaneous, state-preserving crash recovery in the face of even persistent crashes. On popular web servers, our evaluation results show a recovery surface of at least 77%, with low performance overhead of at most 17%.

Original languageEnglish
Title of host publication51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2021
Subtitle of host publication[Proceedings]
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages363-375
Number of pages13
ISBN (Electronic)9781665435727
DOIs
Publication statusPublished - 6 Aug 2021
Event51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2021 - Virtual, Online, Taiwan, Province of China
Duration: 21 Jun 202124 Jun 2021

Conference

Conference51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2021
Country/TerritoryTaiwan, Province of China
CityVirtual, Online
Period21/06/2124/06/21

Bibliographical note

Funding Information:
This work was supported by the European Union's Horizon 2020 research and innovation programme under grant agreements No. 786669 (ReAct) and No. 825377 (UNICORE), by the Netherlands Organisation for Scientific Research through grants NWO 639.021.753 VENI "PantaRhei" and NWO 628.001.030 "TROPICS", and by the Office of Naval Research (ONR) under awards N00014-16-1-2261 and N00014-17-1-2788.

Funding Information:
ACKNOWLEDGEMENTS We would like to thank our shepherd, Domenico Cotroneo, and the anonymous reviewers for their valuable feedback. This work was supported by the European Union’s Horizon 2020 research and innovation programme under grant agreements No. 786669 (ReAct) and No. 825377 (UNICORE), by the Netherlands Organisation for Scientific Research through grants NWO 639.021.753 VENI “PantaRhei” and NWO 628.001.030 “TROPICS”, and by the Office of Naval Research (ONR) under awards N00014-16-1-2261 and N00014-17-1-2788. This paper reflects only the authors’ view. The funding agencies are not responsible for any use that may be made of the information it contains.

Publisher Copyright:
© 2021 IEEE.

Funding

This work was supported by the European Union's Horizon 2020 research and innovation programme under grant agreements No. 786669 (ReAct) and No. 825377 (UNICORE), by the Netherlands Organisation for Scientific Research through grants NWO 639.021.753 VENI "PantaRhei" and NWO 628.001.030 "TROPICS", and by the Office of Naval Research (ONR) under awards N00014-16-1-2261 and N00014-17-1-2788. ACKNOWLEDGEMENTS We would like to thank our shepherd, Domenico Cotroneo, and the anonymous reviewers for their valuable feedback. This work was supported by the European Union’s Horizon 2020 research and innovation programme under grant agreements No. 786669 (ReAct) and No. 825377 (UNICORE), by the Netherlands Organisation for Scientific Research through grants NWO 639.021.753 VENI “PantaRhei” and NWO 628.001.030 “TROPICS”, and by the Office of Naval Research (ONR) under awards N00014-16-1-2261 and N00014-17-1-2788. This paper reflects only the authors’ view. The funding agencies are not responsible for any use that may be made of the information it contains.

Keywords

  • Adaptive Transactions
  • Crash Recovery
  • Persistent Fault Recovery
  • Persistent Faults
  • Recoverability
  • Reliability
  • Survivability
  • Transactional Memory

Fingerprint

Dive into the research topics of 'FIRestarter: Practical Software Crash Recovery with Targeted Library-level Fault Injection'. Together they form a unique fingerprint.

Cite this