A Scalable Implementation of Fault Tolerance for Massively Parallel Systems

Geert Deconinck, J Vounckx, Rudy Lauwereins, Jörn Altmann, F Balbach, Mario Cin, J G. Sil-va, Henrique Madeira, B Bieker, E Maehle

Abstract

For massively parallel systems, the probability of cr s~Yslenc failure clue to u random hardware fault becomes statistically very significant because of the huge number of components. Besides, filult injection experiments show that multiple failures go undetected, leading to incorrect results. Hence, massively parallel systems reguirc abilities to tolerate: these faults that will occur. The FTMPS project presents a scalable implementation to integrate the different steps to,laull tolerance into existing HPC systems . On the initial parallel .system only 4017v of (randomly injected),faulls do not cause the application to crash or produce wrong results . 1n. the resulting FTMPS prototype more than. 80%, of these ftiults are correctly detected and recovered. Resulting overhead for the application is only between 10 and 20%. Evaluation. of the different, co-operating fault tolerance modules shows the,llexibility and the ,.scalability of the approach . The huge number of components in a massively parallel system significantly increases the probability ol'a single component failure . However, the Failure of a single entity may not cause the whole system to become useless . Hence, massively parallel systems require fault tolerance ; i .e . they require the ability to cope with these faults that, statistically, will occur . ESPRIT project 6731 (FTMPS) implemented a practical approach to Fault Tolerarrt Massively Parallel Vsteins [1, 21 . In this paper, the structure of the developed FTMPS software modules and tools, their scalable implementation and their important results are explained . Section I explains the structure of the FTMPS modules and the target system . Besides, the fault injection experiments and field data highlight the motivations . Section 2 elaborates the different fault tolerance modules : local error detection and system level diagnosis trigger the system reconfiguration modules . The application recovery is based on checkpointing and rollback . Support for the operator is given via a set of front-end tools . For the different modules, emphasis is on the scalability of the approach and on the results . Section 3 proves how the integrated, yet modular and flexible FTMPS approach significantly improved the fault tolerance capabilities of massively parallel system : the resulting prototype is able to handle a significantly larger percentage (randomly injected) faults correctly than the initial system . IA The FTMPS approach The integrated FTMPS software modules consist of several building blocks for achieving fault tolerance as shown in Figure 1 . The cooperating software modules run on the host and on the different nodes of the massively parallel target system . Error detection and local diagnosis are done on every processing element within the parallel multiprocessor . These modules run concurrently to the applications . Application recovery is based on checkpointing and rollback . The application itself starts the user-driven checkpointing (UDCP) or the hybrid checkpointing (HCP) . These local diagnosis and checkpointing modules have counterparts running at the host : a global diagnosis module and checkpoint.-controller Figure 1 : FTMPS building blocks . responsible for the recovery-line management . In addition, a recovery controller is responsible for the system reconfiguration after a permanent failure of a component : possibly the application processes are retnapped to spare nodes and new routing tables must be set up . An inter/irce to the operator the application controller (AC) is provided by the operator site software (OSS) . This OSS keeps track of' the relations of' failures and applications by means of the error log controller (ELC) . In addition a statistical tool for the evaluation of the databases is available as well as a SYSICtil visualisation tool . These different modules of' the FTMPS software will be described in more detail in section 2 . The entire FTMPS software was set up to he adaptable to a wide range of massively parallel systems . Therefore, a unifying system model (USM) was introduced [1, 2] : systems that can be represented by the USM can be used as a target for the FTMPS software . The USM is based on two parts : the data-net (D-net) arid the control-net (C-net) . The latter one is used by the system software (initialisation, monitoring, etc .) whereas the former one is used for the applications . The I)-net is divided into partitions for the applications (space sharing) . Every partition consists of' one or more reconfiguration entities (REs), which are the smallest entities that are used for reconfiguration . An RE can contain spare processing elements for replacing a failed node within that RE. If no (more) spares are available, the entire RE is indicate(] as being failed and will he replaced by an entire spare RE . The FTMPS concepts are valid for different massively parallel systems . The prototypes of the FTMPS modules have been developed on two different Parsytec machines, the GCcl-Xplorer base(] on a 2D-grid of T805-transputers, and the (JC/PP-PowerXplorer based on a 2D-grid of PowerPC-601 and T805-transputcrs . These massively parallel systems are connected via a host to the user/operator environment and the disks . In this paper, we only consider the 1 -ault tolerance aspects of the multiprocessor, and consider the host and disks to be reliable . Two considerations drive this decision . First, the number of processors (and the probability of a fault.) is much larger (m the massively parallel system than on the host . Second, there exist. a lot of well known fault tolerance methods for uniprocessors, and to implement stable storage . Alternatively, if no faulttolerant host is available, extra fault tolerance measure~ should be applied to the control-net . The communication concept used within the target system is synchronous message passing ; the processing elements are able to handle processes at least at two priority levels . Target applications come from scientific number-crunching domains without real time constraints . 1 .2 Fault injection experiment as motivation for fault tolerance In the FTMPS project, fault injection has been used to experimentally evaluate (lie target system . Faults were injected in the parallel machines used (Parsytec PowerPC based PowerXplorers) at the beginning and at the end of the project, so that the improvement brought by the FTMPS software modules and fools could he measured . To inject faults a software-based fault injector was developed, called Xception . It relies on the advanced debugging facilities, included in the trap handling subsystem of the PowerPC-601 processor, and works in two phases, First, it uses the breakpoint mechanism to interrupt the normal program flow when a user-chosen trigger condition is reached (for instance, (r certain address is accessed or a tithe-out has expired) . Second, it interferes with the execution of one of the next instructions such that it simulates a fault. i n one of' the functional units of the processor or main memory . For instance, to inject a fault in the integer arithmetic and logic unit (ALU) of the processor, Xception works as follows . When the trigger condition is reached, it executes the program in single step mode until an instruction that uses the ALU is executed (e.g . an addition), and changes the destination register in a user-specified way . A typical change is a random hit flip . Then, the program continues at full speed . This technique has several advantages . Being totally software based, it can be easily adapted to many systems, as long as the processor used has the required built-in debug capabilities, as all modern processors do . Besides, the program subjected to the injection is executed at full speed, and does not have to be changed in any way . For a detailed description of the injector see [3] . Of the experiments made at. the beginning of the project, Table 1 shows the results for two programs : Matmult, a simple program that multiplies matrices, and Ising, a bigger program that simulates the spin of' particles . The outcome of the experiments was classified according to the following four categories : Nothing detected, correct results . It corresponds to those faults that are absorbed by the natural redundancy of the machine or the application . Nothing detected, wrong results . The worst situation : since nothing unusual happens, the user thinks that it was a good run, but unfortunately the output is wrong . If the results do not appear "strange" to the user, the program is not rerun . " Error cletectecl. The program is aborted with an error notification, e.g . indicating a memory protection fault . System crash . The system hangs and has to he rebooted . Correct Wrong Detected Crash Mattnult 23%, 25% 48% 4%I _ Isi11 157% 6°G~ 35% ,2% Table 1 : Experiments with a standard machine: 3000 faults for Matmult, 4000 for Ising. All faults were transient, and consisted of two simultaneous bit flips affecting one machine instruction. These results just show that faults indeed have bad consequences, but. says nothing about the fault rate to expect in a machine. For that, we can look at statistics I or the MTBF (mean time between failures) published by several computing centres that run massively parallel machines . For instance, the Oak Ridge National Laboratory (ORNL), in the USA, has published the following data about two Intel Machines, an XP/S5 with 64 processors and a XP/S 150 with 1024 nodes : Feb . 1995 March 1995 Aril 1995
Original languageEnglish
Pages205-212
Number of pages8
Publication statusPublished - 01.03.1996
Event Proc. Second Int. Euromicro Conf. On Massively Parallel Computing Systems MPCS - Ischia , Italy
Duration: 06.05.199609.05.1996

Conference

Conference Proc. Second Int. Euromicro Conf. On Massively Parallel Computing Systems MPCS
Country/TerritoryItaly
CityIschia
Period06.05.9609.05.96

Fingerprint

Dive into the research topics of 'A Scalable Implementation of Fault Tolerance for Massively Parallel Systems'. Together they form a unique fingerprint.

Cite this