HPC LINPACK benchmark

Contents

Intro
HPL + Intel MKL + Intel MPI
HPL + ATLAS + MPICH2
HPL + GotoBLAS2 + Open MPI

0. Intro

HPL is a portable implementation of HPLinpack used to provide data for the Top500 list http://www.top500.org/ .
To get best result in flops, I have tried different combinations: linear algebra libraries + MPI library + compiler.
I have used Intel CPUs and I got best results with MKL + Intel MPI + icc.
Below I'll show how to compile and run LINPACK benchmark with different libraries.

I have tried:

Linear algebra libraries

ATLAS
GotoBLAS2
MKL (intel)

MPI library

MPICH2
Open MPI
Intel MPI

Compiler

gcc, g77
icc (intel)

OS: CentOS 6 or Rocks 6

1. HPL + Intel MKL + Intel MPI

Get Intel evaluation products: http://software.intel.com/en-us/intel-software-evaluation-center

Intel Composer XE 2013 for Linux
Download file: l_ccompxe_2013.1.117.tgz + license file

tar xaf l_ccompxe_2013.1.117.tgz
cd l_ccompxe_2013.1.117
./install.sh

Default installation path: /opt/intel/composer_xe_2013.1.117

Intel Math Kernel Library (Intel MKL) 11.0 for Linux
Download file: l_mkl_11.0.1.117.tgz + license file

tar xzf l_mkl_11.0.1.117.tgz
cd l_mkl_11.0.1.117
./install.sh

Default installation path: /opt/intel/composer_xe_2013.1.117

Intel MPI Library 4.1 for Linux
Download file: l_mpi_p_4.1.0.024.tgz + license file

tar xzf l_mpi_p_4.1.0.024.tgz
cd l_mpi_p_4.1.0.024
./install.sh

Default installation path: /opt/intel/impi/4.1.0.024

Linpack

Download latest Linpack: http://www.netlib.org/benchmark/hpl/

tar xzf hpl-2.1.tar.gz
mv hpl-2.1 hpl_intel_mkl
cd hpl_intel_mkl/

Create file Make.Linux_intel64_mkl

SHELL        = /bin/sh
CD           = cd
CP           = cp
LN_S         = ln -fs
MKDIR        = mkdir -p
RM           = /bin/rm -f
TOUCH        = touch
ARCH         = Linux_intel64
TOPdir       = $(HOME)/hpl_intel_mkl
INCdir       = $(TOPdir)/include
BINdir       = $(TOPdir)/bin/$(ARCH)
LIBdir       = $(TOPdir)/lib/$(ARCH)
HPLlib       = $(LIBdir)/libhpl.a
HPLlibHybrid = /opt/intel/composer_xe_2013.1.117/mkl/benchmarks/mp_linpack/lib_hybrid/intel64/libhpl_hybrid.a
LAdir        = /opt/intel
LAinc        = -I$(LAdir)/mkl/include
LAlib        = -L$(LAdir)/mkl/lib/intel64 -Wl,--start-group $(LAdir)/mkl/lib/intel64/libmkl_intel_lp64.a $(LAdir)/mkl/lib/intel64/libmkl_intel_thread.a $(LAdir)/mkl/lib/intel64/libmkl_core.a -Wl,--end-group -lpthread -ldl $(HPLlibHybrid)
F2CDEFS      = -DAdd__ -DF77_INTEGER=int -DStringSunStyle
HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) $(LAinc) $(MPinc)
HPL_LIBS     = $(HPLlib) $(LAlib) $(MPlib)
HPL_OPTS     = -DASYOUGO -DHYBRID
HPL_DEFS     = $(F2CDEFS) $(HPL_OPTS) $(HPL_INCLUDES)
CC           = mpiicc
CCNOOPT      = $(HPL_DEFS) -O0 -w -nocompchk
MKLINCDIR    = -I"/opt/intel/mkl/include"
CCFLAGS      = $(HPL_DEFS) $(MKLINCDIR) -O3  -w -ansi-alias -i-static -z noexecstack -z relro -z now -openmp -nocompchk
LINKER       = $(CC)
LINKFLAGS    = $(CCFLAGS) -openmp -mt_mpi $(STATICFLAG) -nocompchk
ARCHIVER     = ar
ARFLAGS      = r
RANLIB       = echo

Compilation

export PATH=/opt/intel/impi/4.1.0.024/bin64/:/opt/intel/composer_xe_2013.1.117/bin/intel64/:$PATH
export LD_LIBRARY_PATH=/opt/intel/impi/4.1.0.024/lib64/:/opt/intel/composer_xe_2013.1.117/mkl/lib/intel64/:$LD_LIBRARY_PATH
make arch=Linux_intel64

Preparation for benchmark

cd bin/Linux_intel64/

HPL.dat is not ready, use online tool www.advancedclustering.com/faq/how-do-i-tune-my-hpldat-file.html

For example: 4 node, 4 CPU cores, 15000 MB RAM each

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
8            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
79232         Ns
1            # of NBs
128           NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
4            Ps
4            Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
1            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)
##### This line (no. 32) is ignored (it serves as a separator). ######
0                               Number of additional problem sizes for PTRANS
1200 10000 30000                values of N
0                               number of additional blocking sizes for PTRANS
40 9 8 13 13 20 16 32 64        values of NB

Create host list file ./hosts

192.168.0.1
192.168.0.2
192.168.0.3
192.168.0.4

Create file /etc/profile.d/intel.sh

#!/bin/bash
export PATH=/opt/intel/impi/4.1.0.024/bin64/:/opt/intel/composer_xe_2013.1.117/bin/intel64/:$PATH
export LD_LIBRARY_PATH=/opt/intel/impi/4.1.0.024/lib64/:/opt/intel/composer_xe_2013.1.117/mkl/lib/intel64/:$LD_LIBRARY_PATH
export I_MPI_FABRICS=shm:tcp

All nodes should have:
- Intel MPI Library 4.1 for Linux
- copy /root/hpl_intel_mkl/
- /etc/profile.d/intel.sh

Benchmarking

-n 16 # Total CPU core amount in cluster

export I_MPI_FABRICS=shm:tcp
mpiexec.hydra -f hosts -n 16 ./xhpl

When benchmark finished, you can find result on first compute node 192.168.0.1 in file /root/hpl_intel_mkl/bin/Linux_intel64/HPL.out

My result is 143 Gflops

Full HPL.out

================================================================================
HPLinpack 2.1  --  High-Performance Linpack benchmark  --   October 26, 2012
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   79232
NB     :     128
PMAP   : Row-major process mapping
P      :       4
Q      :       4
PFACT  :   Right
NBMIN  :       4
NDIV   :       2
RFACT  :   Crout
BCAST  :  1ringM
DEPTH  :       1
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       79232   128     4     4            2311.44              1.435e+02
HPL_pdgesv() start time Wed Jan  2 21:53:09 2013

HPL_pdgesv() end time   Wed Jan  2 22:31:40 2013

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0034949 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================

2. HPL + ATLAS + MPICH2

yum --enablerepo=* install atlas blas lapack mpich2 atlas-devel mpich2-devel gcc gcc-c++ make
wget http://www.netlib.org/benchmark/hpl/hpl-2.1.tar.gz
tar -xvzf hpl-2.1.tar.gz
cd hpl-2.1
# use template
cp setup/Make.Linux_PII_CBLAS ./

File Make.Linux_PII_CBLAS

SHELL        = /bin/sh
CD           = cd
CP           = cp
LN_S         = ln -fs
MKDIR        = mkdir -p
RM           = /bin/rm -f
TOUCH        = touch
ARCH         = Linux_PII_CBLAS
TOPdir       = $(HOME)/hpl-2.1
INCdir       = $(TOPdir)/include
BINdir       = $(TOPdir)/bin/$(ARCH)
LIBdir       = $(TOPdir)/lib/$(ARCH)
HPLlib       = $(LIBdir)/libhpl.a
LAdir        = /usr/lib64/atlas
LAinc        =
LAlib        = $(LAdir)/libcblas.a $(LAdir)/libatlas.a
F2CDEFS      =
HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) $(LAinc) $(MPinc)
HPL_LIBS     = $(HPLlib) $(LAlib) $(MPlib)
HPL_OPTS     = -DHPL_CALL_CBLAS
HPL_DEFS     = $(F2CDEFS) $(HPL_OPTS) $(HPL_INCLUDES)
CC           = /usr/bin/mpicc
CCNOOPT      = $(HPL_DEFS)
CCFLAGS      = $(HPL_DEFS) -fomit-frame-pointer -O3 -funroll-loops
LINKER       = /usr/bin/mpicc
LINKFLAGS    = $(CCFLAGS)
ARCHIVER     = ar
ARFLAGS      = r
RANLIB       = echo

Compile

make arch=Linux_PII_CBLAS

Preparaion

cd bin/Linux_PII_CBLAS

Edit HPL.dat, use online tool www.advancedclustering.com/faq/how-do-i-tune-my-hpldat-file.html

Prepare compute nodes

scp -r hpl-2.1 my-compute-0000:/root/
scp -r hpl-2.1 my-compute-0001:/root/
scp -r hpl-2.1 my-compute-0002:/root/
scp -r hpl-2.1 my-compute-0003:/root/

On each server

chkconfig iptables off; service iptables stop
yum --enablerepo=* install atlas blas lapack mpich2

Run benchmark

$ ssh central-server
$ cd hpl-2.1/bin/Linux_PII_CBLAS
$ cat hosts
my-compute-0000
my-compute-0001
my-compute-0002
my-compute-0003
$ mpiexec.hydra -f hosts -n 16 ./xhpl

Result on my-compute-0000:/root/hpl-2.1/bin/Linux_PII_CBLAS/HPL.out

3. HPL + GotoBLAS2 + Open MPI

Download GotoBLAS2 http://www.tacc.utexas.edu/tacc-projects/gotoblas2

tar xzf GotoBLAS2-1.13.tar.gz
cd GotoBLAS2
make
# GotoBLAS2 didn't detect Core i7 automatically.
make TARGET=NEHALEM

Compile HPL

wget http://www.netlib.org/benchmark/hpl/hpl-2.1.tar.gz
tar -xvzf hpl-2.1.tar.gz
mv hpl-2.1 hpl_gotoblas2
cd /root/hpl_gotoblas2/

Create Make.Linux_gotoblas2

SHELL        = /bin/sh
CD           = cd
CP           = cp
LN_S         = ln -sf
MKDIR        = mkdir -p
RM           = /bin/rm -f
TOUCH        = touch
ARCH         = Linux_gotoblas2
TOPdir       = $(HOME)/hpl_gotoblas2
INCdir       = $(TOPdir)/include
BINdir       = $(TOPdir)/bin/$(ARCH)
LIBdir       = $(TOPdir)/lib/$(ARCH)
HPLlib       = $(LIBdir)/libhpl.a
MPdir        =
MPinc        =
MPlib        =
LAdir        = /root/GotoBLAS2
LAinc        =
LAlib        = -L$(LAdir) -lgoto2
F2CDEFS      =
HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) $(LAinc) $(MPinc)
HPL_LIBS     = $(HPLlib) $(LAlib) $(MPlib)
HPL_OPTS     = -DHPL_CALL_CBLAS
HPL_DEFS     = $(F2CDEFS) $(HPL_OPTS) $(HPL_INCLUDES)
CC           = mpicc
CCNOOPT      = $(HPL_DEFS)
CCFLAGS      = $(HPL_DEFS) -fomit-frame-pointer -O3 -funroll-loops
LINKER       = mpif77
LINKFLAGS    = $(CCFLAGS)
ARCHIVER     = ar
ARFLAGS      = r
RANLIB       = echo

Compilation

make arch=Linux_gotoblas2

Preparation

Create file on all compute nodes /etc/profile.d/gotoblas2.sh

#!/bin/bash
export LD_LIBRARY_PATH=/root/GotoBLAS2:$LD_LIBRARY_PATH

cd bin/Linux_gotoblas2/

Change HPL.dat, use online tool www.advancedclustering.com/faq/how-do-i-tune-my-hpldat-file.html

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
8            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
79232         Ns
1            # of NBs
128           NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
4            Ps
4            Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
1            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)
##### This line (no. 32) is ignored (it serves as a separator). ######
0                               Number of additional problem sizes for PTRANS
1200 10000 30000                values of N
0                               number of additional blocking sizes for PTRANS
40 9 8 13 13 20 16 32 64        values of NB

Create file hosts

192.168.0.1
192.168.0.2
192.168.0.3
192.168.0.4

Make copy to compute nodes

scp -r /root/GotoBLAS2 192.168.0.1:/root/
...
scp -r /root/hpl_gotoblas2 192.168.0.1:/root/
...

Run benchmark

cd /root/hpl_gotoblas2/bin/Linux_gotoblas2/
mpiexec --hostfile hosts -np 16 ./xhpl

Get result

# ssh 192.168.0.1
# cat /root/hpl_gotoblas2/bin/Linux_gotoblas2/HPL.out
....
================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       79232   128     4     4            3242.79              **1.023e+02**
.....

102 Gflops