User-Transparent Checkpointing and Restart for Parallel Computers

Bernd Bieker, Erik Maehle

Abstract

The continuously increasing demand on computing power can only be satisfied by parallel computers. Due to the large number of components in such systems the probability of a fault increases as well. Fault-tolerance techniques can be used to circumvent system downtimes and thus improve dependability Within this paper a technique for user-transparent backward error recovery for message passing systems is presented. Since application programming for parallel systems is a difficult task, a user-transparent approach for the introduction of fault-tolerance is the goal. This approach not only relieves the programmer from additional burdens for fault tolerance, it also helps to administer a parallel system by forcing long running applications to stop and letting higher priority programs execute. Afterwards the stopped application can continue without any loss in the overall execution time. A sample implementation is available on a real parallel system (Parsytec PowerXplorer running the operating system software Parix). The basic concept and measurements of the produced runtime-overhead are presented within this paper.
Original languageEnglish
Title of host publicationFault-Tolerant Parallel and Distributed Systems
Number of pages15
Place of PublicationBoston, MA
PublisherSpringer US
Publication date1998
Pages385-399
ISBN (Print)978-1-4613-7488-6
ISBN (Electronic)978-1-4615-5449-3
DOIs
Publication statusPublished - 1998

Fingerprint

Dive into the research topics of 'User-Transparent Checkpointing and Restart for Parallel Computers'. Together they form a unique fingerprint.

Cite this