Fault-tolerant routing, reconfiguration and backward error recovery for parallel systems

Bernd Bieker*, Geert Deconinck, Erik Maehle, Johan Vounckxt

*Korrespondierende/r Autor/-in für diese Arbeit

Abstract

Despite the improvements in hardware design parallel systems lack on dependability due to the huge amount of components they consist of. One possibility to introduce fault-tolerance into such systems is backward error recovery where failed modules can be replaced by spares. This work describes an approach to build a fault-tolerant parallel system. Therefore system reconfiguration and recovery based on checkpointing and rollback is presented as well as a fault-tolerant routing algorithm. The enhancement of the acceptance of fault-tolerance is reached by the integration of a user-transparent routing, reconfiguration, checkpointing and rollback protocol. Furthermore, the restriction to a fail-silent failure model (used in many approaches) is released in our work towards a fail-time-bounded behavior.

OriginalspracheEnglisch
ZeitschriftComputer Systems Science and Engineering
Jahrgang12
Ausgabenummer4
Seiten (von - bis)245-253
Seitenumfang9
ISSN0267-6192
PublikationsstatusVeröffentlicht - 01.07.1997

Fingerprint

Untersuchen Sie die Forschungsthemen von „Fault-tolerant routing, reconfiguration and backward error recovery for parallel systems“. Zusammen bilden sie einen einzigartigen Fingerprint.

Zitieren