User-Transparent Checkpointing and Restart for Parallel Computers

Bernd Bieker, Erik Maehle

Abstract

The continuously increasing demand on computing power can only be satisfied by parallel computers. Due to the large number of components in such systems the probability of a fault increases as well. Fault-tolerance techniques can be used to circumvent system downtimes and thus improve dependability Within this paper a technique for user-transparent backward error recovery for message passing systems is presented. Since application programming for parallel systems is a difficult task, a user-transparent approach for the introduction of fault-tolerance is the goal. This approach not only relieves the programmer from additional burdens for fault tolerance, it also helps to administer a parallel system by forcing long running applications to stop and letting higher priority programs execute. Afterwards the stopped application can continue without any loss in the overall execution time. A sample implementation is available on a real parallel system (Parsytec PowerXplorer running the operating system software Parix). The basic concept and measurements of the produced runtime-overhead are presented within this paper.
OriginalspracheEnglisch
TitelFault-Tolerant Parallel and Distributed Systems
Seitenumfang15
ErscheinungsortBoston, MA
Herausgeber (Verlag)Springer US
Erscheinungsdatum1998
Seiten385-399
ISBN (Print)978-1-4613-7488-6
ISBN (elektronisch)978-1-4615-5449-3
DOIs
PublikationsstatusVeröffentlicht - 1998

Fingerprint

Untersuchen Sie die Forschungsthemen von „User-Transparent Checkpointing and Restart for Parallel Computers“. Zusammen bilden sie einen einzigartigen Fingerprint.

Zitieren