Application-level checkpointing techniques for parallel programs

Abstract

In its simplest form, checkpointing is the act of saving a program’s computation state in a form external to the running program, e.g. the computation state is saved to a filesystem. The checkpoint files can then be used to resume computation upon failure of the original process(s), hopefully with minimal loss of computing work. A checkpoint can be taken using a variety of techniques in every level of the system, from utilizing special hardware/architectural checkpointing features through modification of the user’s source code. This survey will discuss the various techniques used in application-level checkpointing, with special attention being paid to techniques for checkpointing parallel and distributed applications.

Date: 2006
Authors: John Walters, Vipin Chaudhary
Conference: Distributed Computing and Internet Technology
Pages: 221-234
Publisher: Springer Berlin/Heidelberg

View Paper

Information Sciences Institute

Publications

Application-level checkpointing techniques for parallel programs

Abstract