HPC LINPACK benchmark

Posted on Sat 05 January 2013 by Pavlo Khmel


  1. Intro
  2. HPL + Intel MKL + Intel MPI
  4. HPL + GotoBLAS2 + Open MPI

0. Intro

HPL is a portable implementation of HPLinpack used to provide data for the Top500 list http://www.top500.org/ .
To get best result in flops, I have tried different combinations: linear algebra libraries + MPI library + compiler.
I have used Intel CPUs and I got best results with MKL + Intel MPI + icc.
Below I'll show how to compile and run LINPACK benchmark with different libraries.

I have tried:

Linear algebra libraries

  • GotoBLAS2
  • MKL (intel)

MPI library

  • MPICH2
  • Open MPI
  • Intel MPI


  • gcc, g77
  • icc (intel)

OS: CentOS 6 or Rocks 6

1. HPL + Intel MKL + Intel MPI

Get Intel evaluation products: http://software.intel.com/en-us/intel-software-evaluation-center

Intel Composer XE 2013 for Linux
Download file: l_ccompxe_2013.1.117.tgz + license file

tar xaf l_ccompxe_2013.1.117.tgz
cd l_ccompxe_2013.1.117

Default installation path: /opt/intel/composer_xe_2013.1.117

Intel Math Kernel Library (Intel MKL) 11.0 for Linux
Download file: l_mkl_11.0.1.117.tgz + license file

tar xzf l_mkl_11.0.1.117.tgz
cd l_mkl_11.0.1.117

Default installation path: /opt/intel/composer_xe_2013.1.117

Intel MPI Library 4.1 for Linux
Download file: l_mpi_p_4.1.0.024.tgz + license file

tar xzf l_mpi_p_4.1.0.024.tgz
cd l_mpi_p_4.1.0.024

Default installation path: /opt/intel/impi/


Download latest Linpack: http://www.netlib.org/benchmark/hpl/

tar xzf hpl-2.1.tar.gz
mv hpl-2.1 hpl_intel_mkl
cd hpl_intel_mkl/

Create file Make.Linux_intel64_mkl

SHELL        = /bin/sh
CD           = cd
CP           = cp
LN_S         = ln -fs
MKDIR        = mkdir -p
RM           = /bin/rm -f
TOUCH        = touch
ARCH         = Linux_intel64
TOPdir       = $(HOME)/hpl_intel_mkl
INCdir       = $(TOPdir)/include
BINdir       = $(TOPdir)/bin/$(ARCH)
LIBdir       = $(TOPdir)/lib/$(ARCH)
HPLlib       = $(LIBdir)/libhpl.a
HPLlibHybrid = /opt/intel/composer_xe_2013.1.117/mkl/benchmarks/mp_linpack/lib_hybrid/intel64/libhpl_hybrid.a
LAdir        = /opt/intel
LAinc        = -I$(LAdir)/mkl/include
LAlib        = -L$(LAdir)/mkl/lib/intel64 -Wl,--start-group $(LAdir)/mkl/lib/intel64/libmkl_intel_lp64.a $(LAdir)/mkl/lib/intel64/libmkl_intel_thread.a $(LAdir)/mkl/lib/intel64/libmkl_core.a -Wl,--end-group -lpthread -ldl $(HPLlibHybrid)
F2CDEFS      = -DAdd__ -DF77_INTEGER=int -DStringSunStyle
HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) $(LAinc) $(MPinc)
HPL_LIBS     = $(HPLlib) $(LAlib) $(MPlib)
CC           = mpiicc
CCNOOPT      = $(HPL_DEFS) -O0 -w -nocompchk
MKLINCDIR    = -I"/opt/intel/mkl/include"
CCFLAGS      = $(HPL_DEFS) $(MKLINCDIR) -O3  -w -ansi-alias -i-static -z noexecstack -z relro -z now -openmp -nocompchk
LINKER       = $(CC)
LINKFLAGS    = $(CCFLAGS) -openmp -mt_mpi $(STATICFLAG) -nocompchk
ARCHIVER     = ar
ARFLAGS      = r
RANLIB       = echo


export PATH=/opt/intel/impi/$PATH
export LD_LIBRARY_PATH=/opt/intel/impi/$LD_LIBRARY_PATH
make arch=Linux_intel64

Preparation for benchmark

cd bin/Linux_intel64/

HPL.dat is not ready, use online tool www.advancedclustering.com/faq/how-do-i-tune-my-hpldat-file.html

For example: 4 node, 4 CPU cores, 15000 MB RAM each

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any)
8            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
79232         Ns
1            # of NBs
128           NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
4            Ps
4            Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
1            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)
##### This line (no. 32) is ignored (it serves as a separator). ######
0                               Number of additional problem sizes for PTRANS
1200 10000 30000                values of N
0                               number of additional blocking sizes for PTRANS
40 9 8 13 13 20 16 32 64        values of NB

Create host list file ./hosts

Create file /etc/profile.d/intel.sh

export PATH=/opt/intel/impi/$PATH
export LD_LIBRARY_PATH=/opt/intel/impi/$LD_LIBRARY_PATH
export I_MPI_FABRICS=shm:tcp

All nodes should have:
- Intel MPI Library 4.1 for Linux
- copy /root/hpl_intel_mkl/
- /etc/profile.d/intel.sh


-n 16 # Total CPU core amount in cluster

export I_MPI_FABRICS=shm:tcp
mpiexec.hydra -f hosts -n 16 ./xhpl

When benchmark finished, you can find result on first compute node in file /root/hpl_intel_mkl/bin/Linux_intel64/HPL.out

My result is 143 Gflops

Full HPL.out

HPLinpack 2.1  --  High-Performance Linpack benchmark  --   October 26, 2012
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   79232
NB     :     128
PMAP   : Row-major process mapping
P      :       4
Q      :       4
PFACT  :   Right
NBMIN  :       4
NDIV   :       2
RFACT  :   Crout
BCAST  :  1ringM
DEPTH  :       1
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words


- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

T/V                N    NB     P     Q               Time                 Gflops
WR11C2R4       79232   128     4     4            2311.44              1.435e+02
HPL_pdgesv() start time Wed Jan  2 21:53:09 2013

HPL_pdgesv() end time   Wed Jan  2 22:31:40 2013

||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0034949 ...... PASSED

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.

End of Tests.


yum --enablerepo=* install atlas blas lapack mpich2 atlas-devel mpich2-devel gcc gcc-c++ make
wget http://www.netlib.org/benchmark/hpl/hpl-2.1.tar.gz
tar -xvzf hpl-2.1.tar.gz
cd hpl-2.1
# use template
cp setup/Make.Linux_PII_CBLAS ./

File Make.Linux_PII_CBLAS

SHELL        = /bin/sh
CD           = cd
CP           = cp
LN_S         = ln -fs
MKDIR        = mkdir -p
RM           = /bin/rm -f
TOUCH        = touch
ARCH         = Linux_PII_CBLAS
TOPdir       = $(HOME)/hpl-2.1
INCdir       = $(TOPdir)/include
BINdir       = $(TOPdir)/bin/$(ARCH)
LIBdir       = $(TOPdir)/lib/$(ARCH)
HPLlib       = $(LIBdir)/libhpl.a
LAdir        = /usr/lib64/atlas
LAinc        =
LAlib        = $(LAdir)/libcblas.a $(LAdir)/libatlas.a
F2CDEFS      =
HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) $(LAinc) $(MPinc)
HPL_LIBS     = $(HPLlib) $(LAlib) $(MPlib)
CC           = /usr/bin/mpicc
CCFLAGS      = $(HPL_DEFS) -fomit-frame-pointer -O3 -funroll-loops
LINKER       = /usr/bin/mpicc
ARCHIVER     = ar
ARFLAGS      = r
RANLIB       = echo


make arch=Linux_PII_CBLAS


cd bin/Linux_PII_CBLAS

Edit HPL.dat, use online tool www.advancedclustering.com/faq/how-do-i-tune-my-hpldat-file.html

Prepare compute nodes

scp -r hpl-2.1 my-compute-0000:/root/
scp -r hpl-2.1 my-compute-0001:/root/
scp -r hpl-2.1 my-compute-0002:/root/
scp -r hpl-2.1 my-compute-0003:/root/

On each server

chkconfig iptables off; service iptables stop
yum --enablerepo=* install atlas blas lapack mpich2

Run benchmark

$ ssh central-server
$ cd hpl-2.1/bin/Linux_PII_CBLAS
$ cat hosts
$ mpiexec.hydra -f hosts -n 16 ./xhpl

Result on my-compute-0000:/root/hpl-2.1/bin/Linux_PII_CBLAS/HPL.out

3. HPL + GotoBLAS2 + Open MPI

Download GotoBLAS2 http://www.tacc.utexas.edu/tacc-projects/gotoblas2

tar xzf GotoBLAS2-1.13.tar.gz
cd GotoBLAS2
# GotoBLAS2 didn't detect Core i7 automatically.

Compile HPL

wget http://www.netlib.org/benchmark/hpl/hpl-2.1.tar.gz
tar -xvzf hpl-2.1.tar.gz
mv hpl-2.1 hpl_gotoblas2
cd /root/hpl_gotoblas2/

Create Make.Linux_gotoblas2

SHELL        = /bin/sh
CD           = cd
CP           = cp
LN_S         = ln -sf
MKDIR        = mkdir -p
RM           = /bin/rm -f
TOUCH        = touch
ARCH         = Linux_gotoblas2
TOPdir       = $(HOME)/hpl_gotoblas2
INCdir       = $(TOPdir)/include
BINdir       = $(TOPdir)/bin/$(ARCH)
LIBdir       = $(TOPdir)/lib/$(ARCH)
HPLlib       = $(LIBdir)/libhpl.a
MPdir        =
MPinc        =
MPlib        =
LAdir        = /root/GotoBLAS2
LAinc        =
LAlib        = -L$(LAdir) -lgoto2
F2CDEFS      =
HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) $(LAinc) $(MPinc)
HPL_LIBS     = $(HPLlib) $(LAlib) $(MPlib)
CC           = mpicc
CCFLAGS      = $(HPL_DEFS) -fomit-frame-pointer -O3 -funroll-loops
LINKER       = mpif77
ARCHIVER     = ar
ARFLAGS      = r
RANLIB       = echo


make arch=Linux_gotoblas2


Create file on all compute nodes /etc/profile.d/gotoblas2.sh


cd bin/Linux_gotoblas2/

Change HPL.dat, use online tool www.advancedclustering.com/faq/how-do-i-tune-my-hpldat-file.html

Create file hosts

Make copy to compute nodes

scp -r /root/GotoBLAS2
scp -r /root/hpl_gotoblas2

Run benchmark

cd /root/hpl_gotoblas2/bin/Linux_gotoblas2/
mpiexec --hostfile hosts -np 16 ./xhpl

Get result

# ssh
# cat /root/hpl_gotoblas2/bin/Linux_gotoblas2/HPL.out
T/V                N    NB     P     Q               Time                 Gflops
WR11C2R4       79232   128     4     4            3242.79              **1.023e+02**

102 Gflops