Fault-tolerant routing, reconfiguration and backward error recovery for parallel systems

Bernd Bieker*, Geert Deconinck, Erik Maehle, Johan Vounckxt

*Corresponding author for this work

Abstract

Despite the improvements in hardware design parallel systems lack on dependability due to the huge amount of components they consist of. One possibility to introduce fault-tolerance into such systems is backward error recovery where failed modules can be replaced by spares. This work describes an approach to build a fault-tolerant parallel system. Therefore system reconfiguration and recovery based on checkpointing and rollback is presented as well as a fault-tolerant routing algorithm. The enhancement of the acceptance of fault-tolerance is reached by the integration of a user-transparent routing, reconfiguration, checkpointing and rollback protocol. Furthermore, the restriction to a fail-silent failure model (used in many approaches) is released in our work towards a fail-time-bounded behavior.

Original languageEnglish
JournalComputer Systems Science and Engineering
Volume12
Issue number4
Pages (from-to)245-253
Number of pages9
ISSN0267-6192
Publication statusPublished - 01.07.1997

Fingerprint

Dive into the research topics of 'Fault-tolerant routing, reconfiguration and backward error recovery for parallel systems'. Together they form a unique fingerprint.

Cite this