Transparent Fault-tolerance in Parallel Orca Programs

M.F. Kaashoek, R. Michiels, H.E. Bal, A.S. Tanenbaum

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

Abstract

With the advent of large-scale parallel computing systems, making parallel programs fault-tolerant becomes an important problem, because the probability of a failure increases with the number of processors. In this paper, we describe a very simple scheme for rendering a class of parallel Orca programs fault-tolerant. Also, we discuss our experience with implementing this scheme on Amoeba. Our approach works for parallel applications that are not interactive. The approach is based on making a globally consistent checkpoint from time to time and rolling back to the last checkpoint when a processor fails. Making a consistent global checkpoint is easy in Orca, because its implementation is based on reliable broadcast. The advantages of our approach are its simplicity, ease of implementation, low overhead, and transparency to the Orca programmer.
Original languageEnglish
Title of host publicationProceedings of the Symposium on Experiences with Distributed and Multiprocessor Systems III
Pages297-312
Publication statusPublished - 1992

Fingerprint

Parallel processing systems
Fault tolerance
Transparency

Cite this

Kaashoek, M. F., Michiels, R., Bal, H. E., & Tanenbaum, A. S. (1992). Transparent Fault-tolerance in Parallel Orca Programs. In Proceedings of the Symposium on Experiences with Distributed and Multiprocessor Systems III (pp. 297-312)
Kaashoek, M.F. ; Michiels, R. ; Bal, H.E. ; Tanenbaum, A.S. / Transparent Fault-tolerance in Parallel Orca Programs. Proceedings of the Symposium on Experiences with Distributed and Multiprocessor Systems III. 1992. pp. 297-312
@inproceedings{81c2f9d5f0034f22bfe3ca0d9d915c5e,
title = "Transparent Fault-tolerance in Parallel Orca Programs",
abstract = "With the advent of large-scale parallel computing systems, making parallel programs fault-tolerant becomes an important problem, because the probability of a failure increases with the number of processors. In this paper, we describe a very simple scheme for rendering a class of parallel Orca programs fault-tolerant. Also, we discuss our experience with implementing this scheme on Amoeba. Our approach works for parallel applications that are not interactive. The approach is based on making a globally consistent checkpoint from time to time and rolling back to the last checkpoint when a processor fails. Making a consistent global checkpoint is easy in Orca, because its implementation is based on reliable broadcast. The advantages of our approach are its simplicity, ease of implementation, low overhead, and transparency to the Orca programmer.",
author = "M.F. Kaashoek and R. Michiels and H.E. Bal and A.S. Tanenbaum",
year = "1992",
language = "English",
pages = "297--312",
booktitle = "Proceedings of the Symposium on Experiences with Distributed and Multiprocessor Systems III",

}

Kaashoek, MF, Michiels, R, Bal, HE & Tanenbaum, AS 1992, Transparent Fault-tolerance in Parallel Orca Programs. in Proceedings of the Symposium on Experiences with Distributed and Multiprocessor Systems III. pp. 297-312.

Transparent Fault-tolerance in Parallel Orca Programs. / Kaashoek, M.F.; Michiels, R.; Bal, H.E.; Tanenbaum, A.S.

Proceedings of the Symposium on Experiences with Distributed and Multiprocessor Systems III. 1992. p. 297-312.

Research output: Chapter in Book / Report / Conference proceedingConference contributionAcademicpeer-review

TY - GEN

T1 - Transparent Fault-tolerance in Parallel Orca Programs

AU - Kaashoek, M.F.

AU - Michiels, R.

AU - Bal, H.E.

AU - Tanenbaum, A.S.

PY - 1992

Y1 - 1992

N2 - With the advent of large-scale parallel computing systems, making parallel programs fault-tolerant becomes an important problem, because the probability of a failure increases with the number of processors. In this paper, we describe a very simple scheme for rendering a class of parallel Orca programs fault-tolerant. Also, we discuss our experience with implementing this scheme on Amoeba. Our approach works for parallel applications that are not interactive. The approach is based on making a globally consistent checkpoint from time to time and rolling back to the last checkpoint when a processor fails. Making a consistent global checkpoint is easy in Orca, because its implementation is based on reliable broadcast. The advantages of our approach are its simplicity, ease of implementation, low overhead, and transparency to the Orca programmer.

AB - With the advent of large-scale parallel computing systems, making parallel programs fault-tolerant becomes an important problem, because the probability of a failure increases with the number of processors. In this paper, we describe a very simple scheme for rendering a class of parallel Orca programs fault-tolerant. Also, we discuss our experience with implementing this scheme on Amoeba. Our approach works for parallel applications that are not interactive. The approach is based on making a globally consistent checkpoint from time to time and rolling back to the last checkpoint when a processor fails. Making a consistent global checkpoint is easy in Orca, because its implementation is based on reliable broadcast. The advantages of our approach are its simplicity, ease of implementation, low overhead, and transparency to the Orca programmer.

M3 - Conference contribution

SP - 297

EP - 312

BT - Proceedings of the Symposium on Experiences with Distributed and Multiprocessor Systems III

ER -

Kaashoek MF, Michiels R, Bal HE, Tanenbaum AS. Transparent Fault-tolerance in Parallel Orca Programs. In Proceedings of the Symposium on Experiences with Distributed and Multiprocessor Systems III. 1992. p. 297-312