Benchmarking the Performance of Implicit and Explicit Parallel Implementations of the Matrix Transpose
The repository contains the code for the benchmarking of the performance of the matrix transpose operation comparing the sequential, implicit parallel using the SIMD instructions and explicit parallel using the OpenMP library. The cache performance and the memory bandwidth are measured on the HPC cluster using the Likwid performance analysis tool.
GCC 9.1.0(modulegcc91on the HPC Cluster)OpenMP 4.5Likwid 4.3.4GNU Make 3.8.2- PBS job scheduler for the HPC cluster
git clone https://github.com/timmfy/parco-d1chmod +x scripts/*./scripts/run_locally.sh [OPTIONS]where [OPTIONS] are among the following:
--block-size-list <int,int,...> Run the test for the list of block sizes 2^<int> (default: 2^4), a parameter
of the block size for cache blocking/cache oblivious algorithms
-h, --help Display this information
--group <string> Set the performance group for likwid (default: CACHES)
--threads-list <int,int,...> Run the test for the list of numbers of threads 2^<int> (default: 2^2, max: 2^6) (a parameter for OpenMP)
--profiling <string> Run the specified test with profiling (seq, imp, omp)
--size-list <int,int,...> Run the test for the list of sizes 2^<int> (default: 2^10, max: 2^12)
--runs <int> Set the number of runs (default: 1)
--symm <int> Generate a symmetric matrix (1 to generate the symmetric matrix, 0 to generate a random one, default: 0)
With no options, the script runs the test for the block size 24 x 24, the threads 22, the matrix size 210 x 210, and does 1 run.
./scripts/run_locally.shExample that runs the test for the block sizes 24, 25, 26 and for 22, 23, 24 threads , the matrix sizes 210, 211, 212, and does 5 runs:
./scripts/run_locally.sh --block-size-list 4,5,6 --threads-list 2,3,4 --size-list 10,11,12 --runs 5Example that runs the test for the block size 26, the matrix size 212, and does 20 runs with likwid-perfctr profiling of the group CACHES for the implicit parallel implementation:
./scripts/run_locally.sh --profiling imp --runs 20 --block-size-list 6 --size-list 12 --group CACHESNote that the profiling option requires the likwid to be installed (see the section Running on the HPC Cluster).
It is recommended to use the Makefile to compile the code:
make all N=<size>where <size> is the size of the square matrix (a power of 2) that is used for the benchmarking.
Without the Makefile, the following sequence of commands can be used:
gcc -fopenmp -DN=<size> -O2 -Iinclude -c src/seq.c -o seq.o
gcc -fopenmp -DN=<size> -O2 -ftree-vectorize -funroll-loops -fopt-info -Iinclude -c src/imp_par.c -o imp_par.o
gcc -fopenmp -DN=<size> -Iinclude -c src/omp_par.c -o omp_par.o
gcc -fopenmp -DN=<size> -Iinclude -c src/seq_test.c -o seq_test.o
gcc -fopenmp -DN=<size> -Iinclude -c src/imp_par_test.c -o imp_par_test.o
gcc -fopenmp -DN=<size> -Iinclude -c src/omp_par_test.c -o omp_par_test.o
gcc -fopenmp -DN=<size> -Iinclude -c src/main.c -o main.o
gcc -fopenmp main.o seq.o imp_par.o omp_par.o seq_test.o imp_par_test.o omp_par_test.o -o mainThen, run with the following command:
./main --block-size <blockSize> --runs <runs> --symm <symm> <tests> --threads <threads>If the code was compiled with the Makefile, run using
./bin/main --block-size <blockSize> --runs <runs> --symm <symm> <tests> --threads <threads>where <blockSize> is the block size for the cache blocking/cache oblivious algorithms, <runs> is the number of runs, <threads> is the number of threads for OpenMP, and <symm> is a flag that generates a symmetric matrix.
is a flag that specifies which tests to run (seq (-s) for the sequential test, imp (-i) for the implicit parallel test, omp (-o) for the OpenMP test). The tests can be combined (e.g., -si for the sequential and implicit parallel tests or -sio for all tests).
For example, the following command runs the sequential and OMP parallel tests with the block size 16, 5 runs, and 4 threads:
./main --block-size 4 --runs 5 --symm 0 -si --threads 2The standard output is the time in seconds that it took to transpose the matrix.
Running the code on the HPC cluster requires either an interactive session or the use of a PBS scheduler.
The following command requests an interactive session with 64 cores (which is the maximum number of cores that is used for the benchmarks in the project)
qsub -I -q short_cpuQ -l select=1:ncpus=64:mem=1gbDownload the necessary modules and create an alias for the gcc command
module load gcc91
module load likwid-4.3.4
gcc() { gcc-9.1.0 "$@"; }Set the number of threads for OpenMP
export OMP_NUM_THREADS=64Then, it is possible to proceed the same way as running locally either with the scripts or the manual compilation.
The following command creates a pbs script parco-d1-job.pbs that runs the code with the specified options
./scripts/generate_pbs.sh [OPTIONS]where [OPTIONS] are the same as for the local execution.
Then, submit the job to the scheduler
qsub parco-d1-job.pbsThe output is stored in the following files:
parco-d1-job.outcontains the output of the job (if run on the HPC cluster)parco-d1-job.errcontains the compiler messages and errors (if run on the HPC cluster)summary_comparison.txtcontains the average speedup for each configuration and input parameters (if run with multiple runs using a script)summary_profiling.txtcontains the performance metrics for the configuration that was run with profiling (if run with profiling using a script) Theparco-d1-job.outfile (if exists) contains also the redirected output of the code execution (for example, the gcc optimization information).