Graph neural networks for detecting anomalies in scientific workflows

Abstract

Identifying and addressing anomalies in complex, distributed systems can be challenging for reliable execution of scientific workflows. We model these workflows as directed acyclic graphs (DAGs), where the nodes and edges of the DAGs represent jobs and their dependencies, respectively. We develop graph neural networks (GNNs) to learn patterns in the DAGs and to detect anomalies at the node (job) and graph (workflow) levels. We investigate workflow-specific GNN models that are trained on a particular workflow and workflow-agnostic GNN models that are trained across the workflows. Our GNN models, which incorporate both individual job features and topological information from the workflow, show improved accuracy and efficiency compared to conventional learning methods for detecting anomalies. While joint trained with multiple scientific workflows, our GNN models reached an accuracy more than 80 …

Date: January 1, 1970
Authors: Hongwei Jin, Krishnan Raghavan, George Papadimitriou, Cong Wang, Anirban Mandal, Mariam Kiran, Ewa Deelman, Prasanna Balaprakash
Journal: The International Journal of High Performance Computing Applications
Volume: 37
Issue: 3-4
Pages: 394-411
Publisher: SAGE Publications

View Paper