Publications
A flexible framework for fault tolerance in the grid
Abstract
This paper presents a failure detection service (FDS) and a flexible failure handling framework (Grid-WFS) as a fault tolerance mechanism on the Grid. The FDS enables the detection of both task crashes and user-defined exceptions. A major challenge in providing such a generic failure detection service on the Grid is to detect those failures without requiring any modification to both the Grid protocol and the local policy of each Grid node. This paper describes how to overcome the challenge by using a notification mechanism which is based on the interpretation of notification messages being delivered from the underlying Grid resources. The Grid-WFS built on top of FDS allows users to achieve failure recovery in a variety of ways depending on the requirements and constraints of their applications. Central to the framework is flexibility in handling failures. This paper describes how to achieve the flexibility by …
- Date
- January 1, 1970
- Authors
- Soonwook Hwang, Carl Kesselman
- Journal
- Journal of Grid Computing
- Volume
- 1
- Pages
- 251-272
- Publisher
- Kluwer Academic Publishers