Abstract
Cooperative applications are widely used, e.g. as parallel calculations or distributed information processing systems. Whereby such applications meet the users demand and offer a performance improvement, the susceptibility to faults of any used computer node is raised. Often a single fault may cause a complete application failure. On the other hand, we the redundancy in distributed systems can be utilized for fast fault detection and recovery. So, we followed an approach that is based on duplication of each application process to detect crashes and faulty functions of single computer nodes. We concentrate on two aspects of efficient fault-tolerance - fast fault detection and recovery without delaying the application progress significantly. The contribution of this work is first a new fault detecting protocol for duplicated processes. Secondly, we enhance a roll forward recovery scheme so that it is applicable to a set of cooperative processes in conformity to the protocol.
Original language | English |
---|---|
Title of host publication | Proceedings IEEE International Computer Performance and Dependability Symposium. IPDS 2000 |
Number of pages | 10 |
Publisher | IEEE |
Publication date | 01.01.2000 |
Pages | 48-57 |
ISBN (Print) | 0-7695-0553-8 |
DOIs | |
Publication status | Published - 01.01.2000 |
Event | The 4th IEEE International Computer Performance and Dependability Symposium - Chicago, United States Duration: 27.03.2000 → 29.03.2000 Conference number: 56663 |