Designing Scientific Software One Workflow at a Time

 

      Ewa Deelman (PI), Yolanda Gil (co-PI)

 

 

Funded  by the National Science Foundation under CCF-0725332

 

Project Goal

 

Computational workflows have recently emerged as an effective paradigm to manage large-scope terascale scientific analyses and are a crucial technology to scale up to petascale levels.  Workflows were used for decades to manage business processes in complex organizations by providing a formalism to specify tasks, their dependencies, their requirements, and products, and to track task execution over time. Similarly, computational workflows represent computations that are often executed in geographically distributed settings, their interdependencies, their requirements, and their data products.  In the last few years, research focused on the creation and execution of computational workflows has resulted in great gains in productivity, feasibility, and scalability of quite complex scientific analyses.  Existing workflow systems have been demonstrated in a variety of scientific applications where workflow creation draws from catalogs of hundreds of distributed software components and data sources, where the generation of workflows of thousands of interrelated computing processes is automated, and where the execution of workflows takes place on high-end computing resources and often spans several months.  Some workflow systems have been deployed for routine use in scientific collaboratories in many scientific disciplines including astronomy, earthquake science, physics, and biology (e.g., National Virtual Observatory, the Southern California Earthquake Center, and the Laser Interferometer Gravitational-wave Observatory).

 

The goal of this work is to develop the foundations for a science of design of scientific processes embodied in the new artifact that is the computational workflow.

Current areas of research:

  1. gathering of existing workflows

  2. comparison of workflows with other programming paradigms

  3. investigation of what makes a “good” workflow

  4. examination of benefits of using semantics for workflow descriptions.

Tutorial materials that cover workflow design issues are also available.

Gathering of existing workflows

Examples of workflows under current investigation can be found here.  The examples contain workflows showing only compute job dependencies as well as workflows showing data dependencies as well.

Comparison of the workflow programming model with other programming paradigms

To articulate the tasks involved in the design of workflows in terms of other paradigms that scientists are traditionally more familiar with, we developed a comparison of workflow design with distributed and parallel programming.  This comparison describes several challenges facing programmers of heterogenous distributed muti-core systems that are becoming more common in scientific computing.  When programming such complex systems, programmers are burdened with a variety of concerns, such as computation partitioning across functional units, data movement and synchronization, managing a diversity of programming models for different devices, and reusing existing legacy and library software.  We observe that many of these challenges are also faced in programming applications for large-scale, heterogeneous distributed computing environments, and solutions used in practice as well as future research directions in workflow systems for distributed computing can be adapted to support programmers that develop code for multi-core systems.  Further, optimization decisions are inherently complex due to large search spaces of possible solutions and the difficulty of predicting performance on increasingly complex architectures.  Cognitive techniques are well-suited for managing systems of such complexity.  We investigated how recent trends of using cognitive techniques for code mapping and optimization support this point, and how cognitive techniques could provide a fundamentally new programming paradigm for complex heterogeneous systems, where programmers design self-configuring applications and the system automates optimization decisions and manages the allocation of heterogeneous resources to codes.  This work is reported in a journal article that appeared in the February 2008 issue of the Proceedings of IEEE.

 

Good Workflows

An investigation into desirable properties of workflows that must be incorporated in good workflow design produced an article with a compilation of principles and approaches to validate workflows. In particular, we analyzed the general applicability and utility of two techniques: 1) knowledge-rich descriptions of individual process model components and their constraints; and 2) verification of partial (user authored) process models based on artificial intelligence planning techniques. These techniques have been used in determining whether partial process models make sense within the background knowledge that the systems have, notifying the user of issues to be resolved in the current model, and suggesting to the user what actions could be taken next. We developed a compilation of validation checks that systems can make to assist users, showing desirable properties that are independent of the workflow representation and of the kind of user activity during workflow design supported by the system.  The result of this work is a journal article that will appear in the Journal of Theoretical and Experimental Artificial Intelligence.

 

Semantics in Workflow Design

To highlight the advantages of using semantic representations in workflow design and to disseminate results and foster adoption of workflow technologies, we have created a summary overview in the form of a journal article that describes the need for workflow systems in current cyberinfrastructure environments, the benefits that existing workflow systems have already demonstrated, and the possible additional benefits if workflow systems are adopted as common cyberinfrastructure components and are extended with semantic representations.  Workflow systems today can assist scientists by automating non-experiment critical tasks, systematically exploring the hypothesis space, managing parallelism and execution in distributed shared resources, and enabling low-cost reproducibility.  If more broadly adopted, workflow systems will have an empowering effect leveling terms of the scientific processes supported.  Today, semantic representations of scientific datasets are becoming more commonly used in cyberinfrastructure architectures to enable integration and reasoning over data. Similarly, knowledge-rich representations of workflows capture scientific principles and constraints that will enable a variety of artificial intelligence techniques to be brought to bear for validation, automation, hypothesis generation, and guarantees of data quality and pedigree. Knowledge-rich workflow systems open the doors to significant new capabilities for automated discovery, ever more integrative research that broadens the scope of scientific endeavors, education in science at all levels, and novel paradigms for interaction of scientists with cyberinfrastructure to fully exploit its capabilities. This summary overview will appear in the November 2008 issue of the Scientific Programming journal. 

 

Tutorial Materials

A tutorial on computational workflows that included a section on workflow design is available at http://www.isi.edu/~gil/AAAI08TutorialSlides.