Fault-tolerant Scheduling of Fine-grained Tasks in Grid Environments

G. Wrzesinska, R.V. van Nieuwpoort, J. Maassen, T. Kielmann, H.E. Bal

Research output: Contribution to JournalArticleAcademicpeer-review

Abstract

Divide-and-conquer is a well-suited programming paradigm for parallel Grid applications. Our Satin system efficiently schedules the fine-grained tasks of a divide-andconquer application across multiple clusters in a grid. To accommodate long-running applications, we present a fault-tolerance mechanism for Satin that has negligible overhead during normal execution, while minimizing the amount of redundant work done after a crash of one or more nodes. We study the impact of our fault-tolerance mechanism on application efficiency, both on the Dutch DAS-2 system and using the European testbed of the EC-funded project GridLab. © 2006 SAGE Publications.
Original languageEnglish
Pages (from-to)103-114
JournalHigh performance computing applications
Volume20
Issue number1
DOIs
Publication statusPublished - 2006

Fingerprint

Fault-tolerant
Scheduling
Grid
Fault tolerance
Fault Tolerance
Divide and conquer
Crash
Testbeds
Testbed
Divides
Schedule
Programming
Paradigm
Vertex of a graph

Bibliographical note

WrzesinskaHPA05

Cite this

@article{67d66e4d29d745c3a6263d2ee31770c1,
title = "Fault-tolerant Scheduling of Fine-grained Tasks in Grid Environments",
abstract = "Divide-and-conquer is a well-suited programming paradigm for parallel Grid applications. Our Satin system efficiently schedules the fine-grained tasks of a divide-andconquer application across multiple clusters in a grid. To accommodate long-running applications, we present a fault-tolerance mechanism for Satin that has negligible overhead during normal execution, while minimizing the amount of redundant work done after a crash of one or more nodes. We study the impact of our fault-tolerance mechanism on application efficiency, both on the Dutch DAS-2 system and using the European testbed of the EC-funded project GridLab. {\circledC} 2006 SAGE Publications.",
author = "G. Wrzesinska and {van Nieuwpoort}, R.V. and J. Maassen and T. Kielmann and H.E. Bal",
note = "WrzesinskaHPA05",
year = "2006",
doi = "10.1177/1094342006062528",
language = "English",
volume = "20",
pages = "103--114",
journal = "High performance computing applications",
issn = "1094-3420",
publisher = "SAGE Publications Inc.",
number = "1",

}

Fault-tolerant Scheduling of Fine-grained Tasks in Grid Environments. / Wrzesinska, G.; van Nieuwpoort, R.V.; Maassen, J.; Kielmann, T.; Bal, H.E.

In: High performance computing applications, Vol. 20, No. 1, 2006, p. 103-114.

Research output: Contribution to JournalArticleAcademicpeer-review

TY - JOUR

T1 - Fault-tolerant Scheduling of Fine-grained Tasks in Grid Environments

AU - Wrzesinska, G.

AU - van Nieuwpoort, R.V.

AU - Maassen, J.

AU - Kielmann, T.

AU - Bal, H.E.

N1 - WrzesinskaHPA05

PY - 2006

Y1 - 2006

N2 - Divide-and-conquer is a well-suited programming paradigm for parallel Grid applications. Our Satin system efficiently schedules the fine-grained tasks of a divide-andconquer application across multiple clusters in a grid. To accommodate long-running applications, we present a fault-tolerance mechanism for Satin that has negligible overhead during normal execution, while minimizing the amount of redundant work done after a crash of one or more nodes. We study the impact of our fault-tolerance mechanism on application efficiency, both on the Dutch DAS-2 system and using the European testbed of the EC-funded project GridLab. © 2006 SAGE Publications.

AB - Divide-and-conquer is a well-suited programming paradigm for parallel Grid applications. Our Satin system efficiently schedules the fine-grained tasks of a divide-andconquer application across multiple clusters in a grid. To accommodate long-running applications, we present a fault-tolerance mechanism for Satin that has negligible overhead during normal execution, while minimizing the amount of redundant work done after a crash of one or more nodes. We study the impact of our fault-tolerance mechanism on application efficiency, both on the Dutch DAS-2 system and using the European testbed of the EC-funded project GridLab. © 2006 SAGE Publications.

U2 - 10.1177/1094342006062528

DO - 10.1177/1094342006062528

M3 - Article

VL - 20

SP - 103

EP - 114

JO - High performance computing applications

JF - High performance computing applications

SN - 1094-3420

IS - 1

ER -