STREAM memory benchmark on CentOS 8 with GCC and Intel compiler

Posted on Thu 13 February 2020 by Pavlo Khmel

The STREAM benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels.

More information on: https://www.cs.virginia.edu/stream/

GCC

Install gcc and download source code:

yum install gcc -y
mkdir STREAM
cd STREAM
curl -O http://www.cs.virginia.edu/stream/FTP/Code/stream.c
curl -O http://www.cs.virginia.edu/stream/FTP/Code/mysecond.c

Compile.

-DSTREAM_ARRAY_SIZE=100000000 - array size 100M (default size of 10M)

-DNTIMES=20 - runs each kernel "NTIMES" times and reports the best result (default 10)

gcc -O3 -fopenmp -DSTREAM_ARRAY_SIZE=100000000 -DNTIMES=20 stream.c -o stream

Run:

# ./stream
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 100000000 (elements), Offset = 0 (elements)
Memory per array = 762.9 MiB (= 0.7 GiB).
Total memory required = 2288.8 MiB (= 2.2 GiB).
Each kernel will be executed 20 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 8
Number of Threads counted = 8
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 21951 microseconds.
   (= 21951 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           78635.2     0.020807     0.020347     0.021635
Scale:          53727.5     0.030174     0.029780     0.030693
Add:            61043.9     0.039484     0.039316     0.039651
Triad:          61609.6     0.039064     0.038955     0.039306
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

OMP_NUM_THREADS can be used to control number of threads. Example:

OMP_NUM_THREADS=16 ./stream

Intel compiler

Download Intel Parallel Studio XE 2020

In this example Intel Parallel Studio XE 2020 Cluster Edition

tar xf parallel_studio_xe_2020_cluster_edition.tgz 
cd parallel_studio_xe_2020_cluster_edition/
./install.sh 

Download source code:

mkdir STREAM
cd STREAM
curl -O http://www.cs.virginia.edu/stream/FTP/Code/stream.c
curl -O http://www.cs.virginia.edu/stream/FTP/Code/mysecond.c

Compile:

source /opt/intel/compilers_and_libraries_2020.0.166/linux/bin/compilervars.sh intel64
icc -qopenmp-link=static -qopenmp -O3 -DSTREAM_ARRAY_SIZE=80000000 -DNTIMES=20 -o stream_i2020u0 stream.c

Run:

# ./stream
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 8
Number of Threads counted = 8
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 1844 microseconds.
   (= 1844 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           79920.0     0.002019     0.002002     0.002038
Scale:          80206.6     0.002010     0.001995     0.002046
Add:            83165.3     0.002897     0.002886     0.002906
Triad:          83275.4     0.002903     0.002882     0.002924
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

Portable example with large memory

This example with Intel compiler release 2019u5.

Compile 3 examples for 2GB, 40GB, 60GB size:

source /opt/intel/compilers_and_libraries_2019.5.281/linux/bin/compilervars.sh intel64
icc -mcmodel=medium -O3 -axAVX -qopenmp -DSTREAM_ARRAY_SIZE=100000000   -DNTIMES=10 stream.c -o stream_i2019u5_2g
icc -mcmodel=medium -O3 -axAVX -qopenmp -DSTREAM_ARRAY_SIZE=1800000000  -DNTIMES=10 stream.c -o stream_i2019u5_40g
icc -mcmodel=medium -O3 -axAVX -qopenmp -DSTREAM_ARRAY_SIZE=2700000000  -DNTIMES=10 stream.c -o stream_i2019u5_60g

Copy libraries:

mkdir lib_i2019u5
cp /opt/intel/compilers_and_libraries_2019.5.281/linux/compiler/lib/intel64_lin/libimf.so ./lib_i2019u5/
cp /opt/intel/compilers_and_libraries_2019.5.281/linux/compiler/lib/intel64_lin/libsvml.so ./lib_i2019u5/
cp /opt/intel/compilers_and_libraries_2019.5.281/linux/compiler/lib/intel64_lin/libirng.so ./lib_i2019u5/
cp /opt/intel/compilers_and_libraries_2019.5.281/linux/compiler/lib/intel64_lin/libintlc.so.5 ./lib_i2019u5/
cp /opt/intel/compilers_and_libraries_2019.5.281/linux/compiler/lib/intel64_lin/libiomp5.so ./lib_i2019u5/

Run:

export LD_LIBRARY_PATH=./lib_i2019u5/
export KMP_AFFINITY=compact
./stream_i2019u5_2g

-mcmodel=medium needed to remove 2GiB restriction.

With KMP_AFFINITY the results won’t fluctuate that much.