WINGS is a workflow system that assists scientists with the design of computational experiments. A computational experiment specifies how selected datasets are to be processed by a series of software components in a particular configuration. Earth scientists use computational experiments to estimate seismic hazard through simulations of earthquake forecasts. Biologists use computational experiments for analysis of gene expression microarray data or molecular interaction networks and pathways. Social scientists analyze large social networks to discover structural regularities based on mining relations among individuals.
We use workflows to represent computational experiments. Workflows represent application components and their dependencies in terms of dataflow among them. Workflow systems have been developed to assist users with some aspect of the process, for example to assemble workflows out of large component libraries, to optimize execution performance, and for workflow sharing. None of these systems provides comprehensive support for workflow design and exploration. To learn more about the state of the art in workflow systems, please visit http://www.isi.edu/nsf-workflows06.
Our WINGS workflow system uses workflow representations to manage various aspects of workflow creation and execution. WINGS reasons about dataset and component constraints to create and validate workflows and to generate metadata for new data products. WINGS has novel capabilities that support the design of computational experiments as workflows. WINGS can:
The architecture of WINGS is shown here:
WINGS operates at the domain level, where users never see any details on the execution environment. Users can design workflow templates by selecting components and specifying their dataflow. WINGS assists users to validate templates by enforcing the constraints specified for the workflow components. Other users can explore workflows by reusing existing templates to create workflow requests or seeds. WINGS assists them by automatically and systematically generating possible workflows that are consistent with the request. This capability is the focus of this paper. WINGS generates workflows in three stages. First it selects components for each step in the workflow. Then datasets from the catalogs to elaborate the initial workflow request and its template. Finally it configures the parameters for each component in the workflow. Once the workflow candidates are fully elaborated, WINGS expands the workflow to specify any parallel computations over dataset collections (but not where they will take place). For all the new data products, it generates metadata attributes by propagating metadata from the input data through the descriptions and constraints specified for each of the components. The entire workflow creation and generation process is annotated in detail in a provenance catalog for later inspection. WINGS can convert any valid and complete workflow into a format for submission to an execution engine. Currently WINGS can submit workflows in a scripted format for execution in the local host, or submit workflows to Pegasus which will manage their execution in shared distributed resources in an efficient and scalable manner. Pegasus selects computational resources, optimizes the workflow structure, submits the workflow for execution, monitors its progress, and resolves possible failures.
WINGS uses W3C's OWL, RDF, and SWRL to represent workflows and their associated constraints. The core classes are defined in OWL, and workflow templates, requests, and candidates are expressed in RDF. Many constraints are represented as RDF triples, others that are more complex are represented in SWRL.
WINGS assumes external catalogs of software components and datasets that can be accessed through services. This is an important feature, since scientific environments consist of distributed services to access the data and algorithms necessary for data analysis. Many collaborations are set up where each institution may contribute resources (data, instruments, computers, software) by making them available to others while being maintained by the institution. Examples include the National Virtual Observatory, the Earth System Grid, the Biomedical Informatics Research Network, and the Cancer Biomedical Informatics Grid. Data providers may provide services to access data sources. There can be many organizations playing the role of data providers, and as a result data may be accessible in various catalogs that are in distributed remote locations. Other organizations may provide algorithms, services, models, or implemented codes that can process data and can be used as components of workflows. Each provider maintains the resources they contribute. Therefore, the WINGS architecture is designed to interface with external data and component catalogs.