Pegasus is a flexible framework that enables the mapping of complex scientific workflows onto the grid.
 

Grid applications today are no longer monolithic codes, rather they are being built from existing application components.  In general, these applications are defined by workflows, where the activities in the workflow are individual application components and the dependencies between the activities reflect the data and/or control flow dependencies between the components.

 

A workflow can be described in an abstract form, in which the workflow activities are independent of the Grid resources used to execute the activities. We denote this workflow an abstract workflow. Abstracting away the resource descriptions allows the workflows to be portable. One can describe the workflow in terms of computations that need to take place without identifying particular resources that can perform this computation.

A simple abstract workflow with two jobs: d1 and d2. The first jobs takes file a as input and produces file b, the second takes that file b and produces the final result c.
 

Pegasus takes the abstract workflow and maps it to the available grid resources. Pegasus may map the entire workflow at once or portions of it. You may want to map only parts of the workflow at a time to be able to adapt the subsequent mappings to the changing execution environment.

During the mapping, Pegasus checks if the workflow is feasible (whether the input data for the workflow exists). Pegasus may also reduce the abstract  workflow based on the available intermediate data products, thus eliminating redundant computations.

Here, Pegasus finds that file b is already available and thus reduces the workflow assuming that it is easier to access the data than reproduce it.

 

The concrete workflow also includes the necessary data movement to stage data in and out of the computations. Other nodes in the concrete workflow also may include data publication activities, where newly derived data products are published into the Grid environment.

The concrete workflow can be given to Condor's DAGMan for execution.

Final concrete (executable) workflow, that stages the input data to site B, executes the computation, transfers the data to the user-specified location U and registers the results so that they can be found again.

 

Since it is often inefficient to map out the entire workflow ahead of time, Pegasus supports a deferred mode that fist sets a planning horizon (by partitioning the workflow) and then maps and executes individual partitions before refining the rest of the workflow.

 

Pegasus possess well-defined APIs and clients for:
    ▪ Information gathering that provides information about the available resources
    ▪ Replica query mechanism that locates replicas of data files
    ▪ Transformation catalog query mechanism that locates executables and their properties
    ▪ Resource selection mechanism that performs
            ▪ Compute site selection
            ▪ Replica selection
    ▪ Data transfer mechanism that stages data in and out of the computation
Pegasus can support a variety of workflow executors