The purpose of the performance tuning experiment
is to apply the optimizations you have learned in class to a
computational kernel, and measure the performance impact of the
optimizations you perform. There are two distinct parts to the
assignment. First, you will optimize the sequential version of a
computational kernel, using the code optimizations presented in class
(or other optimizations of your choice) to improve the code's
utilization of registers and cache. For this part, we will provide a
high-level sequential description of the kernel, written in Fortran or
C,
and instrumented to output performance measurements. In the
second part, you will optimize the performance of a parallel version of
the same kernel, written in Fortran with OpenMP directives. (For
this assigment, you can use Fortran or C, depending on your
preference.) This document describes the assignment in more
detail.
The experiments will be conducted on the HPC
system, on a dual-core, dual-processor system. (This instruction will
be updated to tell you how to force execution on such nodes.)
A makefile
Makefile
has been created to help you create the files.
create input:
make create-input
sequential Fortran version:
make lu-seq
OpenMP Fortran version: make
lu-par
sequential C version (NEW!):
make lu-Cseq
OpenMP C version (NEW!): make
lu-par
To run the sequential Fortran version, type the following:
lu-seq <problem size>
To run the parallel Fortran version, type the
following: lu-par <problem
size> <number of procs>
To run the sequential C version, type the following:
lu-Cseq <problem size>
To run the parallel C version, type the
following: lu-Cpar <problem
size> <number of procs>
Data input sizes
We will use three problem sizes: 32, 128 and
1253.
To create each one, type the following:
create-input <problem^M
size>
Each will have different performance properties. We recommend 32
for testing, as it will have a much shorter execution time than the
others.
Submitting jobs to run on hpc
Prior to now, we have been using the frontend machine for the HPC
cluster. Now we want to run on specific nodes,
V20z: AMD Dual Dualcore Opteron 2.0 GHz with
4GB Memory. You should compile the code on the frontend machine.
You can use the following command line to execute on
these systems interactively:
qsub -I -d . -l nodes=1:ppn=4:V20z
To run in batch mode, which is preferred for timing, you'll need to
create a PBS file, such as the one provided
here .
Measuring performance
We measure performance using the Fortan90 intrinsic system_clock. For
C, it is using the standard library clock().
Reporting the experiment
To turn in your code, leave the final results in the subdirectory
entitled
submit.
You can submit up to 3 versions for sequential and 3 for parallel, one
per input data set size. Or, you can find the best sequential and
parallel version that works best for all 3 data sets. Here is how
to name your results:
best_seq32.F, best_seq128.F, best_seq1253.F, best_par32.F,
best_par128.F, best_par1253.F
Please include a README file indicating how many processors you used
for the parallel versions. Also include any other information
relevant to the experiment. Your report also should include
answers to the following questions:
- What is the speedup of your optimized sequential code with
respect
to the sample sequential code?
- What is the speedup of your parallel code with respect to the
sample
parallel code?
- What optimizations did you apply? For each optimization, discuss
the
gains (or losses) in performance, and how did the optimization
parameters
(for example, tile sizes or unroll factors) affect performance.
- Describe any interactions between optimizations.
- How did data set size impact your results?
- Based on your experiments, can you identify the performance
bottlenecks
of this computational kernel? How are they related to the architecture?
Grading
- Correct optimized sequential code:
10%
- Correct parallelization:
20%
- Sequential speedup:
20%
- Parallel speedup:
20%
- Report:
30%