Abstract
The continuously increasing demand on computing power can only be satisfied by parallel computers. Due to the large number of components in such systems the probability of a fault increases as well. Fault-tolerance techniques can be used to circumvent system downtimes and thus improve dependability Within this paper a technique for user-transparent backward error recovery for message passing systems is presented. Since application programming for parallel systems is a difficult task, a user-transparent approach for the introduction of fault-tolerance is the goal. This approach not only relieves the programmer from additional burdens for fault tolerance, it also helps to administer a parallel system by forcing long running applications to stop and letting higher priority programs execute. Afterwards the stopped application can continue without any loss in the overall execution time. A sample implementation is available on a real parallel system (Parsytec PowerXplorer running the operating system software Parix). The basic concept and measurements of the produced runtime-overhead are presented within this paper.
Original language | English |
---|---|
Title of host publication | Fault-Tolerant Parallel and Distributed Systems |
Number of pages | 15 |
Place of Publication | Boston, MA |
Publisher | Springer US |
Publication date | 1998 |
Pages | 385-399 |
ISBN (Print) | 978-1-4613-7488-6 |
ISBN (Electronic) | 978-1-4615-5449-3 |
DOIs | |
Publication status | Published - 1998 |