Research

My main area of research is developing solutions for the management of scientific workflows in distributed environments.

Data analysis within the scientific collaborations is a large-scale and rigorous process where large amounts of data (in the order of Terabytes) is analyzed. Applications are being built not as monolithic entities designed by a single individual, but rather as complex workflows composed of application components. Often times these components are designed, developed, and tested collaboratively. Because of the size of the data and the complexity of the analysis, large computer clusters are being used to store the data sets and execute the workflows. As the size of the data and of the analysis grows, scientific collaborations are pooling their resources into distributed systems such as the grid. The grid is a distributed system that seamlessly connects resources across the wide area network: compute resources, storage, instruments, etc., and provides the software, such as the Globus Toolkit, to securely submit jobs remotely, to transfer data, to operate apparatus, and perform other remote operations.

Managing the data and analysis in a systematic, robust and collaborative fashion is now one of the foci of many IT projects. The National Science Foundation is currently funding several projects such the Grids Physics Network (GriPhyN), the National Virtual Laboratory (NVO), and the Southern California Earthquake Center/IT (SCEC/IT) to provide software that will aid scientists in discovering data and metadata (descriptive information about data products), in setting up and executing complex analysis, and in storing and sharing information about newly derived results.

My work focuses on designing software solutions that aid scientists in a variety of disciplines to easily execute complex analysis on distributed and heterogeneous resources. In particular I have been working on the development of the Pegasus system that can map abstract workflows onto the grid. Abstract workflows describe the analysis in terms of logical transformations and data without identifying the resources needed to execute the workflow.

Mapping the abstract workflow description to an executable form involves finding the resources that are available and can perform the computations, the data that is used in the workflow, and the necessary software. Pegasus consults various Grid information services to find the above information. Pegasus also reuses existing intermediate data products where possible, thus potentially reducing the workflow. As part of the mapping Pegasus augments the workflow with data transfer nodes to stage data in and out of the computations, data registration nodes that can update various catalog on the grid and recently also nodes that can stage-in statically linked binaries. The result of the mapping is a concrete workflow that can be executed on the grid by software systems such as Condor's DAGMan.

Some of the issues that I am exploring are:

The Pegasus team is composed of Gaurang Mehta, Mei-Hui Su and Karan Vahi.

A PhD student Gurmeet Singh is also working on the project.


This work is supported by the National Science Foundation under the following projects: SDCI-Pegasus, GriPhyN, iVDGL, NVO and SCEC.