CS503 - Performance Tuning Experiment

                                                                                    DUE: Wednesday, March 26, 2008, 8PM PST

The purpose of the performance tuning experiment is to apply the optimizations you have learned in class to a computational kernel, and measure the performance impact of the optimizations you perform.  There are two distinct parts to the assignment.  First, you will optimize the sequential version of a computational kernel, using the code optimizations presented in class (or other optimizations of your choice) to improve the code's utilization of registers and cache. For this part, we will provide a high-level sequential description of the kernel, written in Fortran or C, and instrumented to output performance measurements.  In the second part, you will optimize the performance of a parallel version of the same kernel, written in Fortran with OpenMP directives.  (For this assigment, you can use Fortran or C, depending on your preference.)  This document describes the assignment in more detail.

Computational kernel

The computational kernel to be used in the experiment is LU Decomposition, an algorithm for the well-known Gaussian elimination method of  factorizing a matrix. In LU decomposition a square matrix is factored into two matrices L and U, where L is lower triangular with ones on its diagonal, and U is upper diagonal.


c LU DECOMPOSITION (WITHOUT PIVOTING)

    DO K = 1, N-1

        DO I =
K+1, N
            A(I,K) = A(I,K)/A(K,K)
        ENDDO
        DO I = K+1, N

            DO J = K+1, N
                A(I,J) = A(I,J) - A(I,K) * A(K,J)
            ENDDO
        ENDDO
    ENDDO



The initial versions of the computational kernels are available at lu-C, lu-seq, and lu-par. The input files of a particular size can be created by calling create-input. These versions have been tested and executed in the environment where the experiment will be conducted.  The OpenMP version only creates the correct number of threads.  You will need to use OpenMP directives to parallelize the computation (see OpenMP and lecture notes for more information).

You are free to employ whatever optimizations you can think of to reorder computation or memory accesses, but you cannot change the algorithm. 

Environment

The experiments will be conducted on the HPC system, on a dual-core, dual-processor system. (This instruction will be updated to tell you how to force execution on such nodes.)
A makefile Makefile has been created to help you create the files.
create input:         make create-input
sequential Fortran version:         make lu-seq

OpenMP Fortran version:         make lu-par


sequential C version (NEW!):         make lu-Cseq

OpenMP C version (NEW!):         make lu-par

To run the sequential Fortran version, type the following:    lu-seq <problem size>
To run the parallel Fortran version, type the following:        lu-par <problem size> <number of procs>
To run the sequential C version, type the following:    lu-Cseq <problem size>
To run the parallel C version, type the following:        lu-Cpar <problem size> <number of procs>

Data input sizes

We will use three problem sizes:     32, 128 and 1253. 
To create each one, type the following:         create-input <problem^M size>
Each will have different performance properties.  We recommend 32 for testing, as it will have a much shorter execution time than the others.

Submitting jobs to run on hpc

Prior to now, we have been using the frontend machine for the HPC cluster. Now we want to run on specific nodes, V20z: AMD Dual Dualcore Opteron 2.0 GHz with 4GB Memory. You should compile the code on the frontend machine. You can use the following command line to execute on these systems interactively: 
        qsub -I -d . -l nodes=1:ppn=4:V20z
To run in batch mode, which is preferred for timing, you'll need to create a PBS file, such as the one provided here .

Measuring performance

We measure performance using the Fortan90 intrinsic system_clock. For C, it is using the standard library clock().

Reporting the experiment

To turn in your code, leave the final results in the subdirectory entitled submit.  You can submit up to 3 versions for sequential and 3 for parallel, one per input data set size.  Or, you can find the best sequential and parallel version that works best for all 3 data sets.  Here is how to name your results:
                 best_seq32.F, best_seq128.F, best_seq1253.F, best_par32.F, best_par128.F, best_par1253.F
Please include a README file indicating how many processors you used for the parallel versions.  Also include any other information relevant to the experiment.  Your report also should include answers to the following questions:
  1. What is the speedup of your optimized sequential code with respect to the sample sequential code?
  2. What is the speedup of your parallel code with respect to the sample parallel code?
  3. What optimizations did you apply? For each optimization, discuss the gains (or losses) in performance, and  how did the optimization parameters (for example, tile sizes or unroll factors) affect performance.
  4. Describe any interactions between optimizations.
  5. How did data set size impact your results?
  6. Based on your experiments, can you identify the performance bottlenecks of this computational kernel? How are they related to the architecture?

Grading