Publications

Failure prediction and localization in large scientific workflows

Abstract

Scientific workflows provide a portable representation for scientific applications' coordinated input, output, and execution management for highly parallel executions of interdependent computations, as well as support for sharing and validating the results. As scientific workflows scale to hundreds of thousands of distinct tasks, failures due to software and hardware faults become increasingly common. Real-time execution monitoring provides a foundation for improving the transparency and resilience of the workflows in the face of stochastic and systematic faults. Building on previous work on early detection of these failure scenarios, we describe methods for guiding remediation to stochastic errors through predictions of the impact on application performance. To complement this analysis, we also describe techniques for isolating systematic sources of failures. We evaluate our methods on a representative sample of …

Date
November 14, 2011
Authors
Taghrid Samak, Dan Gunter, Monte Goode, Ewa Deelman, Gaurang Mehta, Fabio Silva, Karan Vahi
Book
Proceedings of the 6th workshop on Workflows in support of large-scale science
Pages
107-116