Replication-based fault tolerance for MPI applications

Abstract

As computational clusters increase in size, their mean time to failure reduces drastically. Typically, checkpointing is used to minimize the loss of computation. Most checkpointing techniques, however, require central storage for storing checkpoints. This results in a bottleneck and severely limits the scalability of checkpointing, while also proving to be too expensive for dedicated checkpointing networks and storage systems. We propose a scalable replication-based MPI checkpointing facility. Our reference implementation is based on LAM/MPI; however, it is directly applicable to any MPI implementation. We extend the existing state of fault-tolerant MPI with asynchronous replication, eliminating the need for central or network storage. We evaluate centralized storage, a Sun-X4500-based solution, an EMC storage area network (SAN), and the Ibrix commercial parallel file system and show that they are not scalable …

Date: 2008
Authors: John Paul Walters, Vipin Chaudhary
Journal: IEEE Transactions on Parallel and Distributed Systems
Volume: 20
Issue: 7
Pages: 997-1010
Publisher: IEEE

View Paper

Information Sciences Institute

Publications

Replication-based fault tolerance for MPI applications

Abstract