Abstract
This paper describes a distributed algorithm for scheduling parallel programs represented by (macro-) dataflow graphs on multicomputer systems such that they are executed in a fault-tolerant way. Fault tolerance is based on dynamic redundancy comprising checkpointing, self-diagnosis and rollback recovery. The schedule is computed dynamically during the runtime of the process system. It works in a completely distributed way by making nodes which have finished a task responsible for allocating their ready task successors. The basic idea for achieving fault tolerance is to keep all input data sets of a task as checkpoints on different nodes in such a way that after a node failure the lost task can automatically be restarted on a remaining intact node. So, fail-soft behavior is realized in a fully distributed and user-transparent way. The algorithm is described in detail for the 1-fault case and some performance measurements on a multi-transputer system are given. Furthermore a graphical programming environment is presented which supports the programmer in all phases of program design by applying the abstract dataflow model of parallel computation.
Original language | English |
---|---|
Title of host publication | Fault-Tolerant Parallel and Distributed Systems |
Number of pages | 15 |
Place of Publication | Boston, MA |
Publisher | Springer US |
Publication date | 1998 |
Pages | 357-371 |
ISBN (Print) | 978-1-4613-7488-6 |
ISBN (Electronic) | 978-1-4615-5449-3 |
DOIs | |
Publication status | Published - 1998 |