Abstract
Despite advances in software testing, many bugs still plague deployed software, leading to crashes and thus service disruption in high-availability production applications. Existing crash recovery solutions are either limited to transient faults or require manual annotations to target predetermined persistent bugs. Moreover, existing solutions are generally inefficient, hindering practical deployment.In this paper, we present FIRestarter (Fault Injection-based Restarter), an efficient and automatic crash recovery solution for commodity user applications. To eliminate the need for manual annotations, FIRestarter injects targeted software faults at the library interface to automatically trigger error handling code for standard library calls already part of the application. In particular, when a crash occurs, we roll back the application state before the last recoverable library call, inject a fault, and restart execution forcing the call to immediately return a predetermined error code. This strategy allows the application to automatically bypass the crashing code upon such a restart and exploits existing error-handling code to recover from even persistent bugs. Moreover, since library calls lie pervasively throughout the code, our design provides a large recovery surface despite the automated approach. Finally, FIRestarter's recovery windows are small and frequent compared to traditional checkpoint-restart, which enables new optimizations such as the ability to support rollback by means of hybrid hardware/software transactional memory instrumentation and improve performance. We apply FIRestarter to a number of event-driven server applications and show our solution achieves near-instantaneous, state-preserving crash recovery in the face of even persistent crashes. On popular web servers, our evaluation results show a recovery surface of at least 77%, with low performance overhead of at most 17%.
Original language | English |
---|---|
Title of host publication | 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2021 |
Subtitle of host publication | [Proceedings] |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 363-375 |
Number of pages | 13 |
ISBN (Electronic) | 9781665435727 |
DOIs | |
Publication status | Published - 6 Aug 2021 |
Event | 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2021 - Virtual, Online, Taiwan, Province of China Duration: 21 Jun 2021 → 24 Jun 2021 |
Conference
Conference | 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2021 |
---|---|
Country/Territory | Taiwan, Province of China |
City | Virtual, Online |
Period | 21/06/21 → 24/06/21 |
Bibliographical note
Funding Information:This work was supported by the European Union's Horizon 2020 research and innovation programme under grant agreements No. 786669 (ReAct) and No. 825377 (UNICORE), by the Netherlands Organisation for Scientific Research through grants NWO 639.021.753 VENI "PantaRhei" and NWO 628.001.030 "TROPICS", and by the Office of Naval Research (ONR) under awards N00014-16-1-2261 and N00014-17-1-2788.
Funding Information:
ACKNOWLEDGEMENTS We would like to thank our shepherd, Domenico Cotroneo, and the anonymous reviewers for their valuable feedback. This work was supported by the European Union’s Horizon 2020 research and innovation programme under grant agreements No. 786669 (ReAct) and No. 825377 (UNICORE), by the Netherlands Organisation for Scientific Research through grants NWO 639.021.753 VENI “PantaRhei” and NWO 628.001.030 “TROPICS”, and by the Office of Naval Research (ONR) under awards N00014-16-1-2261 and N00014-17-1-2788. This paper reflects only the authors’ view. The funding agencies are not responsible for any use that may be made of the information it contains.
Publisher Copyright:
© 2021 IEEE.
Funding
This work was supported by the European Union's Horizon 2020 research and innovation programme under grant agreements No. 786669 (ReAct) and No. 825377 (UNICORE), by the Netherlands Organisation for Scientific Research through grants NWO 639.021.753 VENI "PantaRhei" and NWO 628.001.030 "TROPICS", and by the Office of Naval Research (ONR) under awards N00014-16-1-2261 and N00014-17-1-2788. ACKNOWLEDGEMENTS We would like to thank our shepherd, Domenico Cotroneo, and the anonymous reviewers for their valuable feedback. This work was supported by the European Union’s Horizon 2020 research and innovation programme under grant agreements No. 786669 (ReAct) and No. 825377 (UNICORE), by the Netherlands Organisation for Scientific Research through grants NWO 639.021.753 VENI “PantaRhei” and NWO 628.001.030 “TROPICS”, and by the Office of Naval Research (ONR) under awards N00014-16-1-2261 and N00014-17-1-2788. This paper reflects only the authors’ view. The funding agencies are not responsible for any use that may be made of the information it contains.
Keywords
- Adaptive Transactions
- Crash Recovery
- Persistent Fault Recovery
- Persistent Faults
- Recoverability
- Reliability
- Survivability
- Transactional Memory