Fault-Tolerant Dynamic Task Scheduling Based on Dataflow Graphs

Erik Maehle, Franz-J. Markus


This paper describes a distributed algorithm for scheduling parallel programs represented by (macro-) dataflow graphs on multicomputer systems such that they are executed in a fault-tolerant way. Fault tolerance is based on dynamic redundancy comprising checkpointing, self-diagnosis and rollback recovery. The schedule is computed dynamically during the runtime of the process system. It works in a completely distributed way by making nodes which have finished a task responsible for allocating their ready task successors. The basic idea for achieving fault tolerance is to keep all input data sets of a task as checkpoints on different nodes in such a way that after a node failure the lost task can automatically be restarted on a remaining intact node. So, fail-soft behavior is realized in a fully distributed and user-transparent way. The algorithm is described in detail for the 1-fault case and some performance measurements on a multi-transputer system are given. Furthermore a graphical programming environment is presented which supports the programmer in all phases of program design by applying the abstract dataflow model of parallel computation.
Original languageEnglish
Title of host publicationFault-Tolerant Parallel and Distributed Systems
Number of pages15
Place of PublicationBoston, MA
PublisherSpringer US
Publication date1998
ISBN (Print)978-1-4613-7488-6
ISBN (Electronic)978-1-4615-5449-3
Publication statusPublished - 1998


Dive into the research topics of 'Fault-Tolerant Dynamic Task Scheduling Based on Dataflow Graphs'. Together they form a unique fingerprint.

Cite this