The STREAM benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth (in MB/s) and the corresponding computation rate for simple vector kernels.
More information on: https://www.cs.virginia.edu/stream/
GCC
Install gcc and download source code:
yum install gcc -y
mkdir STREAM
cd STREAM
curl -O http://www.cs.virginia.edu/stream/FTP/Code/stream.c
curl -O http://www.cs.virginia.edu/stream/FTP/Code/mysecond.c
Compile.
-DSTREAM_ARRAY_SIZE=100000000 - array size 100M (default size of 10M)
-DNTIMES=20 - runs each kernel "NTIMES" times and reports the best result (default 10)
gcc -O3 -fopenmp -DSTREAM_ARRAY_SIZE=100000000 -DNTIMES=20 stream.c -o stream
Run:
# ./stream
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 100000000 (elements), Offset = 0 (elements)
Memory per array = 762.9 MiB (= 0.7 GiB).
Total memory required = 2288.8 MiB (= 2.2 GiB).
Each kernel will be executed 20 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 8
Number of Threads counted = 8
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 21951 microseconds.
(= 21951 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 78635.2 0.020807 0.020347 0.021635
Scale: 53727.5 0.030174 0.029780 0.030693
Add: 61043.9 0.039484 0.039316 0.039651
Triad: 61609.6 0.039064 0.038955 0.039306
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
OMP_NUM_THREADS can be used to control number of threads. Example:
OMP_NUM_THREADS=16 ./stream
Intel compiler
Download Intel Parallel Studio XE 2020
In this example Intel Parallel Studio XE 2020 Cluster Edition
tar xf parallel_studio_xe_2020_cluster_edition.tgz
cd parallel_studio_xe_2020_cluster_edition/
./install.sh
Download source code:
mkdir STREAM
cd STREAM
curl -O http://www.cs.virginia.edu/stream/FTP/Code/stream.c
curl -O http://www.cs.virginia.edu/stream/FTP/Code/mysecond.c
Compile:
source /opt/intel/compilers_and_libraries_2020.0.166/linux/bin/compilervars.sh intel64
icc -qopenmp-link=static -qopenmp -O3 -DSTREAM_ARRAY_SIZE=80000000 -DNTIMES=20 -o stream_i2020u0 stream.c
Run:
# ./stream
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 8
Number of Threads counted = 8
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 1844 microseconds.
(= 1844 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 79920.0 0.002019 0.002002 0.002038
Scale: 80206.6 0.002010 0.001995 0.002046
Add: 83165.3 0.002897 0.002886 0.002906
Triad: 83275.4 0.002903 0.002882 0.002924
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
Portable example with large memory
This example with Intel compiler release 2019u5.
Compile 3 examples for 2GB, 40GB, 60GB size:
source /opt/intel/compilers_and_libraries_2019.5.281/linux/bin/compilervars.sh intel64
icc -mcmodel=medium -O3 -axAVX -qopenmp -DSTREAM_ARRAY_SIZE=100000000 -DNTIMES=10 stream.c -o stream_i2019u5_2g
icc -mcmodel=medium -O3 -axAVX -qopenmp -DSTREAM_ARRAY_SIZE=1800000000 -DNTIMES=10 stream.c -o stream_i2019u5_40g
icc -mcmodel=medium -O3 -axAVX -qopenmp -DSTREAM_ARRAY_SIZE=2700000000 -DNTIMES=10 stream.c -o stream_i2019u5_60g
Copy libraries:
mkdir lib_i2019u5
cp /opt/intel/compilers_and_libraries_2019.5.281/linux/compiler/lib/intel64_lin/libimf.so ./lib_i2019u5/
cp /opt/intel/compilers_and_libraries_2019.5.281/linux/compiler/lib/intel64_lin/libsvml.so ./lib_i2019u5/
cp /opt/intel/compilers_and_libraries_2019.5.281/linux/compiler/lib/intel64_lin/libirng.so ./lib_i2019u5/
cp /opt/intel/compilers_and_libraries_2019.5.281/linux/compiler/lib/intel64_lin/libintlc.so.5 ./lib_i2019u5/
cp /opt/intel/compilers_and_libraries_2019.5.281/linux/compiler/lib/intel64_lin/libiomp5.so ./lib_i2019u5/
Run:
export LD_LIBRARY_PATH=./lib_i2019u5/
export KMP_AFFINITY=compact
./stream_i2019u5_2g
-mcmodel=medium needed to remove 2GiB restriction.
With KMP_AFFINITY the results won’t fluctuate that much.