TY - CONF
T1 - Transparent Migration and Rollback for Unmodified Applications in Workstation Clusters
AU - Petri, Stefan
AU - Bolz, Matthias
AU - Langendörfer, Horst
PY - 1998
Y1 - 1998
N2 - Programmers and users of compute intensive scientific appli cations often do not want to (or even cannot) code load balancing and fault tolerance i to their programs. The P BEAM system [PL95, PSLS96] uses a global virtual name space to pro vide migration and rollback transparency in user space for distr ibuted groups of processes on workstations. Applications always use the same virtual n ames for the operating system objects, independent of their current real locat i n. The system calls are interposed and their parameters translated between the nam spaces. Unlike other migration mechanisms, P BEAM does not require the applications to be written for a specific programming model or communication library. The first approach to execute applications in the virtual nam e space was to link the programs with a modified system library. Now, in this pape r w describe design and implementation of a separate system call interposi tion process [Bol97] that accesses the application via the debugging interface. The m ain advantage of this approach is that it can handle even unmodified (e. g. commerci ally bought) application programs. We compare measured performance figures wi th previous similar approaches [MS88, PSLS96] and the modified system library. 1 Motivation and Introduction Networks of Workstations (NOWs) become increasingly attractive as platfor ms for parallel compute intensive applications, because their price/performance ratio i s s gnificantly better than that of massively parallel systems (MPPs) [KHG96, KLB98]. At the time of writing funded by DFG contract SFB 342 at Institute fo r Computer Science, Munich University of Technology Informatik-Bericht 98-02, TU Braunschweig – 1 However, in contrast to MPPs most NOWs operate in multi user mode, the nodes a re shared between applications, and interactive users may not be disturbed by res ource hungry background computations. These constraints make it desirable to use a dynamic load balancing facility that moves work from overloaded to idle nodes, or evacuat es machines that are claimed for other purposes, e. g. interactive users. Additionally, t he high probability of machine failures [LCP91, CCS95] necessitates fault tolerance m easures for long running applications. Programmers and users of compute intensive scientific applications are mainly interested in getting their problem solved. They expect load balancing and fault toler ance as services of the underlying operating or run time system and do not want to care about them in their application code. A solution for these problems is an application tr ansparent migration and checkpointing system. Unfortunately, off-the-shelf workstation pl atforms provide these services only minimally, if at all. In the next section, we give a brief overview of our P BEAM load balancing system for distributed applications on clusters of workstations. Because the basic probl ems of “freezing” the application’s state and reviving it afterwards is the same i n both cases, checkpointing or migration, P BEAM, like other systems (e. g. [LS92, Ste96]), handles them both. After that we focus on our approach to apply P BEAM transparently to unmodified binary application programs, and show some performance figures. We conclude aft er a short comparison to related work. For a broader discussion of concepts and more technical details beyond this paper we refer to [PL95, PSLS96]. A shorter version version of this report has been publis hed n the Proceedings of the 2 nd Workshop on Runtime Support for Parallel Processing at the 12th International Parallel Processing Symposium (IPPS’98) [PBL98]. 2 Overview of P BEAM The goals of this project are to provide application transparent process migration and checkpointing / rollback for distributed applications on clusters of workstations, ru ning in user space on unmodified Unix systems. The state of a distributed computation consists of the states of its processes and the states of the communication links between them. We distinguish between a proces s’ internal andexternalstate [PL95]. The internal state consists of the address space and register contents, including program counter and stack pointer. The external state comprises the process’ relation to the world outside its address space, the allocated res ourc , related processes and the communication peers. A process can manipulate its external s tate only through services of the system kernel. These services operate on objects like fi les, processes, communication endpoints, etc. They are invoked through system calls, whose arguments name the objects to operate on (file names, file descriptors, process num bers, transport addresses, etc.). These names depend on the current location and time of the process execution. If a process is moved from one node to another, or an application is restarted from a c heckpoint, the names for the objects will change. However, to provide transparency, t he apInformatik-Bericht 98-02, TU Braunschweig – 2 Unix Kernel Unix Kernel Application Application virtual name space virtual name space Figure 1 The virtual name space between kernel and application. plication must be able to work with the same names regardless of being moved i n space or time. Therefore, P BEAM introduces a system wide virtual name space for process IDs, transport addresses and file names. A virtual process table and virtual address t abl are maintained for mapping between the virtual and current real names. The applicat ions’ system calls are interposed, the parameters are translated from the virt ual into the underlying current real name space, and then the real system service is executed. In the same manner the return values are translated back from the real into the virtual na me space. To avoid name collisions, the processes are assigned globally unique virtual proces s IDs. Through the system call interposition also the changes to the external state are tracked. While interposing system calls with a modified system kernel promises the bes t tran parency and performance, the dependency on a special operating system version is pr ohibitive to let a system be used widely [Jon92, PL95]. An early design decision was to not do any system kernel modifications [PL95, Jon92]. Basically, there are two possibilities to perform the system call inte rposition outside the kernel. Our first approach was a modified system call library. The system call functions are replaced by or wrapped into our own versions [CM92, LS92, Jon92]. It requi res that the application can be linked with the modified library. The other possibility is to control the application processes by aseparate control ler process via the debugging interface. Most modern Unix flavors offer extensions to the “traditional” debugger interface that allow to stop a controlled process just before and after execution of a system call, i. e. when entering respectively exit ing kernel mode [Sun90, FG91]. The main advantage of this approach is that it can work with applications that are available in binary form only, already completely linked, e. g. commercially bought programs. In the P BEAM system we have separated the system call interposition component from the name space administration (the P BEAM demon “pbeamd” in fig. 2), thus making it easy to switch between different versions of both. To explore scalability, we have also implemented a centralized and a distributed version of the name space admi nistration [PSB+97]. Other components of the system are a scheduler and a dispatcher that executes the decisions made by the scheduler. The system call interposition component has three main tasks, it gets its name f rom the first one: (i) During normal operation, it does the name space translations, as desc ribed above. (ii) When doing a checkpoint or migration, it has to capture the internal and external state, save it onto disk, via TCP to a checkpoint server, or to another m achine Informatik-Bericht 98-02, TU Braunschweig - 3
AB - Programmers and users of compute intensive scientific appli cations often do not want to (or even cannot) code load balancing and fault tolerance i to their programs. The P BEAM system [PL95, PSLS96] uses a global virtual name space to pro vide migration and rollback transparency in user space for distr ibuted groups of processes on workstations. Applications always use the same virtual n ames for the operating system objects, independent of their current real locat i n. The system calls are interposed and their parameters translated between the nam spaces. Unlike other migration mechanisms, P BEAM does not require the applications to be written for a specific programming model or communication library. The first approach to execute applications in the virtual nam e space was to link the programs with a modified system library. Now, in this pape r w describe design and implementation of a separate system call interposi tion process [Bol97] that accesses the application via the debugging interface. The m ain advantage of this approach is that it can handle even unmodified (e. g. commerci ally bought) application programs. We compare measured performance figures wi th previous similar approaches [MS88, PSLS96] and the modified system library. 1 Motivation and Introduction Networks of Workstations (NOWs) become increasingly attractive as platfor ms for parallel compute intensive applications, because their price/performance ratio i s s gnificantly better than that of massively parallel systems (MPPs) [KHG96, KLB98]. At the time of writing funded by DFG contract SFB 342 at Institute fo r Computer Science, Munich University of Technology Informatik-Bericht 98-02, TU Braunschweig – 1 However, in contrast to MPPs most NOWs operate in multi user mode, the nodes a re shared between applications, and interactive users may not be disturbed by res ource hungry background computations. These constraints make it desirable to use a dynamic load balancing facility that moves work from overloaded to idle nodes, or evacuat es machines that are claimed for other purposes, e. g. interactive users. Additionally, t he high probability of machine failures [LCP91, CCS95] necessitates fault tolerance m easures for long running applications. Programmers and users of compute intensive scientific applications are mainly interested in getting their problem solved. They expect load balancing and fault toler ance as services of the underlying operating or run time system and do not want to care about them in their application code. A solution for these problems is an application tr ansparent migration and checkpointing system. Unfortunately, off-the-shelf workstation pl atforms provide these services only minimally, if at all. In the next section, we give a brief overview of our P BEAM load balancing system for distributed applications on clusters of workstations. Because the basic probl ems of “freezing” the application’s state and reviving it afterwards is the same i n both cases, checkpointing or migration, P BEAM, like other systems (e. g. [LS92, Ste96]), handles them both. After that we focus on our approach to apply P BEAM transparently to unmodified binary application programs, and show some performance figures. We conclude aft er a short comparison to related work. For a broader discussion of concepts and more technical details beyond this paper we refer to [PL95, PSLS96]. A shorter version version of this report has been publis hed n the Proceedings of the 2 nd Workshop on Runtime Support for Parallel Processing at the 12th International Parallel Processing Symposium (IPPS’98) [PBL98]. 2 Overview of P BEAM The goals of this project are to provide application transparent process migration and checkpointing / rollback for distributed applications on clusters of workstations, ru ning in user space on unmodified Unix systems. The state of a distributed computation consists of the states of its processes and the states of the communication links between them. We distinguish between a proces s’ internal andexternalstate [PL95]. The internal state consists of the address space and register contents, including program counter and stack pointer. The external state comprises the process’ relation to the world outside its address space, the allocated res ourc , related processes and the communication peers. A process can manipulate its external s tate only through services of the system kernel. These services operate on objects like fi les, processes, communication endpoints, etc. They are invoked through system calls, whose arguments name the objects to operate on (file names, file descriptors, process num bers, transport addresses, etc.). These names depend on the current location and time of the process execution. If a process is moved from one node to another, or an application is restarted from a c heckpoint, the names for the objects will change. However, to provide transparency, t he apInformatik-Bericht 98-02, TU Braunschweig – 2 Unix Kernel Unix Kernel Application Application virtual name space virtual name space Figure 1 The virtual name space between kernel and application. plication must be able to work with the same names regardless of being moved i n space or time. Therefore, P BEAM introduces a system wide virtual name space for process IDs, transport addresses and file names. A virtual process table and virtual address t abl are maintained for mapping between the virtual and current real names. The applicat ions’ system calls are interposed, the parameters are translated from the virt ual into the underlying current real name space, and then the real system service is executed. In the same manner the return values are translated back from the real into the virtual na me space. To avoid name collisions, the processes are assigned globally unique virtual proces s IDs. Through the system call interposition also the changes to the external state are tracked. While interposing system calls with a modified system kernel promises the bes t tran parency and performance, the dependency on a special operating system version is pr ohibitive to let a system be used widely [Jon92, PL95]. An early design decision was to not do any system kernel modifications [PL95, Jon92]. Basically, there are two possibilities to perform the system call inte rposition outside the kernel. Our first approach was a modified system call library. The system call functions are replaced by or wrapped into our own versions [CM92, LS92, Jon92]. It requi res that the application can be linked with the modified library. The other possibility is to control the application processes by aseparate control ler process via the debugging interface. Most modern Unix flavors offer extensions to the “traditional” debugger interface that allow to stop a controlled process just before and after execution of a system call, i. e. when entering respectively exit ing kernel mode [Sun90, FG91]. The main advantage of this approach is that it can work with applications that are available in binary form only, already completely linked, e. g. commercially bought programs. In the P BEAM system we have separated the system call interposition component from the name space administration (the P BEAM demon “pbeamd” in fig. 2), thus making it easy to switch between different versions of both. To explore scalability, we have also implemented a centralized and a distributed version of the name space admi nistration [PSB+97]. Other components of the system are a scheduler and a dispatcher that executes the decisions made by the scheduler. The system call interposition component has three main tasks, it gets its name f rom the first one: (i) During normal operation, it does the name space translations, as desc ribed above. (ii) When doing a checkpoint or migration, it has to capture the internal and external state, save it onto disk, via TCP to a checkpoint server, or to another m achine Informatik-Bericht 98-02, TU Braunschweig - 3
UR - https://www.semanticscholar.org/paper/Transparent-Migration-and-Rollback-for-Unmodified-Petri-Bolz/d120195f7f9c0b4cc92e071a10d6c02b0df011af
M3 - Conference Papers
SP - 1
EP - 23
ER -