Current parallelizing compilers preserve the semantics of the
original serial program by preserving the data dependences.
These compilers attempt to identify independent pieces of
computation (two pieces of computation are independent if
neither writes a piece of data that the other accesses), then
generate code that executes independent pieces concurrently.
A significant limitation of this approach is the difficulty
of performing dependence analysis that is precise enough to
expose substantial amounts of concurrency. While researchers
have been able to develop reasonably effective algorithms for
loop nests that manipulate dense matrices
using affine access functions, there has been little
progress towards the successful automatic analysis of programs that
manipulate pointer-based data structures. Researchers have attempted
to build totally automatic systems, but the most
promising approaches require the programmer to provide annotations
that specify information about the global topology of the manipulated
data structures. A second, more fundamental limitation of
data-dependence based approaches is an inability to parallelize
computations that manipulate graph-like data structures. The aliases
inherently present in these data structures preclude the static
discovery of independent pieces of code, forcing the compiler to
generate serial code. Finally, preserving the data dependences for
programs that periodically update shared data structures can
artificially limit the amount of exposed concurrency, since
tasks must delay updates until they are sure that each update
will not change the relative order of reads and writes to the
shared data structure.
Commutativity analysis is a static analysis framework for detecting
commuting operations. Two operations commute when they generate the
same result regardless of the order in which they execute.
Commutativity analysis eliminates many of the limitations
of existing data-dependence based approaches Instead of preserving
the relative order of individual reads and writes to single words
of memory, commutativity analysis aggregates both data and computation
into large grain units. It then statically analyzes the computation at
this granularity to discover when pieces of the computation commute i.e.
generate the same result regardless of the order in which they execute.
If all of the operations required to perform a given computation commute,
the compiler can automatically generate parallel code. While the resulting
parallel program may violate the data dependences of the original serial
program, it is still guaranteed to generate the same result.
This approach has several interesting properties. Because commutativity
analysis does not rely on information about the topology of the manipulated
data structures, a compiler that uses commutativity analysis does not need
to analyze the parts of the code that build the data structure. Commutativity
analysis is therefore suitable for incomplete computations, such as
applications that manipulate persistent data (e.g. object-oriented database
applications) or applications that manipulate geographically dispersed data
(e.g. mobile computations in the World Wide Web). In these cases, the code
that originally built the data structure may not be available or may no
longer exist. Commutativity analysis is also particularly suited for
computations that manipulate graphs. This is an important aspect of
commutativity analysis because computations that manipulate these data
structures are inherently beyond the reach of data dependence analysis.
We took a systems-oriented approach to evaluate the ability of
commutativity analysis to extract and exploit significant amounts
of parallelism in complete applications. We built a parallelizing
compiler that uses commutativity analysis as its primary analysis
paradigm. This compiler takes as input an unannotated sequential
program written in a subset of C++ and generates parallel code for
a shared-memory multiprocessor. Using this parallelizing compiler,
we automatically parallelize several applications, including the
Barnes-Hut N-body solver, a liquid water simulation code, and a
seismic modeling code. For all of these applications, commutativity
analysis is able to expose a substantial amount of concurrency and
the generated parallel code exhibits very good performance. On 16
processors, the automatically parallelized version of Barnes-Hut
runs between 11 and 12 times faster than the original sequential
version; for the water simulation application, the automatically
parallelized version runs between 7 and 8 times faster than the
original sequential version; for the seismic modeling application,
the automatically parallelized version runs about 12 times faster
than the original sequential version.
These experimental results are very encouraging. These applications
are very dynamic in nature - they either manipulate sophisticated
pointer-based data structures or exhibit very irregular data access
patterns. Exploiting coarse-grain parallelism in applications with
these characteristics has been recognized as a very hard problem.
We are not aware of any other compiler or compilation technique capable
of extracting significant amounts of concurrency for these computations.
Furthermore, the positive experimental results provide concrete evidence
of the practical significance of commutativity analysis as a technique
for automatic parallelizing compilers.