Publications

SWARM: Reimagining scientific workflow management systems in a distributed world

Abstract

Modern scientific workflows process massive amounts of data from diverse instruments and sensors, leveraging geographically distributed, heterogeneous compute and storage resources—from leadership-class systems to edge devices—connected by high-performance networks. The diversity of resources introduces challenges in harnessing their full potential, with resilience issues arising across applications, system software, networks, storage, and hardware. Today, workflow management systems (WMS) coordinate the execution of computation and data management tasks across target resources. However, WMS’s centralized nature makes them vulnerable to faults and scalability issues that may result in failures of entire computational campaigns. This paper introduces a novel agentic framework for workflow management, fully distributing and decentralizing the WMS functions and modeling them as swarm …

Date
December 26, 2024
Authors
Prasanna Balaprakash, Krishnan Raghavan, Franck Cappello, Ewa Deelman, Anirban Mandal, Hongwei Jin, Imtiaz Mahmud, Komal Thareja, Shixun Wu, Pawel Zuk, Mariam Kiran, Zizhong Chen, Sheng Di, Kesheng Wu
Journal
The International Journal of High Performance Computing Applications
Pages
10943420251339317
Publisher
SAGE Publications