Abstract
Divide-and-conquer is a well-suited programming paradigm for parallel Grid applications. Our Satin system efficiently schedules the fine-grained tasks of a divide-andconquer application across multiple clusters in a grid. To accommodate long-running applications, we present a fault-tolerance mechanism for Satin that has negligible overhead during normal execution, while minimizing the amount of redundant work done after a crash of one or more nodes. We study the impact of our fault-tolerance mechanism on application efficiency, both on the Dutch DAS-2 system and using the European testbed of the EC-funded project GridLab. © 2006 SAGE Publications.
Original language | English |
---|---|
Pages (from-to) | 103-114 |
Journal | High performance computing applications |
Volume | 20 |
Issue number | 1 |
DOIs | |
Publication status | Published - 2006 |