Abstract
This paper describes a distributed algorithm for scheduling parallel programs represented by (macro-) dataflow graphs on multicomputer systems such that they are executed in a fault-tolerant way. Fault tolerance is based on dynamic redundancy comprising checkpointing, self-diagnosis and rollback recovery. The schedule is computed dynamically during the runtime of the process system. It works in a completely distributed way by making nodes which have finished a task responsible for allocating their ready task successors. The basic idea for achieving fault tolerance is to keep all input data sets of a task as checkpoints on different nodes in such a way that after a node failure the lost task can automatically be restarted on a remaining intact node. So, fail-soft behavior is realized in a fully distributed and user-transparent way. The algorithm is described in detail for the 1-fault case and some performance measurements on a multi-transputer system are given. Furthermore a graphical programming environment is presented which supports the programmer in all phases of program design by applying the abstract dataflow model of parallel computation.
| Original language | English |
|---|---|
| Title of host publication | Fault-Tolerant Parallel and Distributed Systems |
| Number of pages | 15 |
| Place of Publication | Boston, MA |
| Publisher | Springer US |
| Publication date | 1998 |
| Pages | 357-371 |
| ISBN (Print) | 978-1-4613-7488-6 |
| ISBN (Electronic) | 978-1-4615-5449-3 |
| DOIs | |
| Publication status | Published - 1998 |