
Designing Scientific Software One Workflow at a Time
Ewa Deelman (PI), Yolanda Gil (co-PI)
Funded by the National Science Foundation under CCF-0725332
Computational workflows have recently emerged as
an effective paradigm to manage large-scope terascale scientific analyses and
are a crucial technology to scale up to petascale levels. Workflows were used
for decades to manage business processes in complex organizations by providing a
formalism to specify tasks, their dependencies, their requirements, and
products, and to track task execution over time. Similarly, computational
workflows represent computations that are often executed in geographically
distributed settings, their interdependencies, their requirements, and their
data products. In the last few years, research focused on the creation and
execution of computational workflows has resulted in great gains in
productivity, feasibility, and scalability of quite complex scientific
analyses. Existing workflow systems have been demonstrated in a variety of
scientific applications where workflow creation draws from catalogs of hundreds
of distributed software components and data sources, where the generation of
workflows of thousands of interrelated computing processes is automated, and
where the execution of workflows takes place on high-end computing resources and
often spans several months. Some workflow systems have been deployed for
routine use in scientific collaboratories in many scientific disciplines
including astronomy, earthquake science, physics, and biology
The goal of this work is to develop the foundations for a science of design of scientific processes embodied in the new artifact that is the computational workflow.
Current areas of research:
Tutorial materials that cover workflow design issues are also available.
Examples of workflows under current investigation can be found here. The examples contain workflows showing only compute job dependencies as well as workflows showing data dependencies as well.
To articulate the tasks involved in the design of workflows in terms of other paradigms that scientists are traditionally more familiar with, we developed a comparison of workflow design with distributed and parallel programming. This comparison describes several challenges facing programmers of heterogenous distributed muti-core systems that are becoming more common in scientific computing. When programming such complex systems, programmers are burdened with a variety of concerns, such as computation partitioning across functional units, data movement and synchronization, managing a diversity of programming models for different devices, and reusing existing legacy and library software. We observe that many of these challenges are also faced in programming applications for large-scale, heterogeneous distributed computing environments, and solutions used in practice as well as future research directions in workflow systems for distributed computing can be adapted to support programmers that develop code for multi-core systems. Further, optimization decisions are inherently complex due to large search spaces of possible solutions and the difficulty of predicting performance on increasingly complex architectures. Cognitive techniques are well-suited for managing systems of such complexity. We investigated how recent trends of using cognitive techniques for code mapping and optimization support this point, and how cognitive techniques could provide a fundamentally new programming paradigm for complex heterogeneous systems, where programmers design self-configuring applications and the system automates optimization decisions and manages the allocation of heterogeneous resources to codes. This work is reported in a journal article that appeared in the February 2008 issue of the Proceedings of IEEE.
An investigation into desirable properties of
workflows that must be incorporated in good workflow design produced an article
with a compilation of principles and approaches to validate workflows. In
particular, we analyzed the general applicability and utility of two techniques:
1) knowledge-rich descriptions of individual process model components and their
constraints; and 2) verification of partial
To highlight the advantages of using semantic representations in workflow design and to disseminate results and foster adoption of workflow technologies, we have created a summary overview in the form of a journal article that describes the need for workflow systems in current cyberinfrastructure environments, the benefits that existing workflow systems have already demonstrated, and the possible additional benefits if workflow systems are adopted as common cyberinfrastructure components and are extended with semantic representations. Workflow systems today can assist scientists by automating non-experiment critical tasks, systematically exploring the hypothesis space, managing parallelism and execution in distributed shared resources, and enabling low-cost reproducibility. If more broadly adopted, workflow systems will have an empowering effect leveling terms of the scientific processes supported. Today, semantic representations of scientific datasets are becoming more commonly used in cyberinfrastructure architectures to enable integration and reasoning over data. Similarly, knowledge-rich representations of workflows capture scientific principles and constraints that will enable a variety of artificial intelligence techniques to be brought to bear for validation, automation, hypothesis generation, and guarantees of data quality and pedigree. Knowledge-rich workflow systems open the doors to significant new capabilities for automated discovery, ever more integrative research that broadens the scope of scientific endeavors, education in science at all levels, and novel paradigms for interaction of scientists with cyberinfrastructure to fully exploit its capabilities. This summary overview will appear in the November 2008 issue of the Scientific Programming journal.
A tutorial on computational workflows that included a section on workflow design is available at http://www.isi.edu/~gil/AAAI08TutorialSlides.