Contents
- Intro
- HPL + Intel MKL + Intel MPI
- HPL + ATLAS + MPICH2
- HPL + GotoBLAS2 + Open MPI
0. Intro
HPL is a portable implementation of HPLinpack used to provide data for the Top500 list http://www.top500.org/ .
To get best result in flops, I have tried different combinations: linear algebra libraries + MPI library + compiler.
I have used Intel CPUs and I got best results with MKL + Intel MPI + icc.
Below I'll show how to compile and run LINPACK benchmark with different libraries.
I have tried:
Linear algebra libraries
- ATLAS
- GotoBLAS2
- MKL (intel)
MPI library
- MPICH2
- Open MPI
- Intel MPI
Compiler
- gcc, g77
- icc (intel)
OS: CentOS 6 or Rocks 6
1. HPL + Intel MKL + Intel MPI
Get Intel evaluation products: http://software.intel.com/en-us/intel-software-evaluation-center
Intel Composer XE 2013 for Linux
Download file: l_ccompxe_2013.1.117.tgz + license file
tar xaf l_ccompxe_2013.1.117.tgz
cd l_ccompxe_2013.1.117
./install.sh
Default installation path: /opt/intel/composer_xe_2013.1.117
Intel Math Kernel Library (Intel MKL) 11.0 for Linux
Download file: l_mkl_11.0.1.117.tgz + license file
tar xzf l_mkl_11.0.1.117.tgz
cd l_mkl_11.0.1.117
./install.sh
Default installation path: /opt/intel/composer_xe_2013.1.117
Intel MPI Library 4.1 for Linux
Download file: l_mpi_p_4.1.0.024.tgz + license file
tar xzf l_mpi_p_4.1.0.024.tgz
cd l_mpi_p_4.1.0.024
./install.sh
Default installation path: /opt/intel/impi/4.1.0.024
Linpack
Download latest Linpack: http://www.netlib.org/benchmark/hpl/
tar xzf hpl-2.1.tar.gz
mv hpl-2.1 hpl_intel_mkl
cd hpl_intel_mkl/
Create file Make.Linux_intel64_mkl
SHELL = /bin/sh
CD = cd
CP = cp
LN_S = ln -fs
MKDIR = mkdir -p
RM = /bin/rm -f
TOUCH = touch
ARCH = Linux_intel64
TOPdir = $(HOME)/hpl_intel_mkl
INCdir = $(TOPdir)/include
BINdir = $(TOPdir)/bin/$(ARCH)
LIBdir = $(TOPdir)/lib/$(ARCH)
HPLlib = $(LIBdir)/libhpl.a
HPLlibHybrid = /opt/intel/composer_xe_2013.1.117/mkl/benchmarks/mp_linpack/lib_hybrid/intel64/libhpl_hybrid.a
LAdir = /opt/intel
LAinc = -I$(LAdir)/mkl/include
LAlib = -L$(LAdir)/mkl/lib/intel64 -Wl,--start-group $(LAdir)/mkl/lib/intel64/libmkl_intel_lp64.a $(LAdir)/mkl/lib/intel64/libmkl_intel_thread.a $(LAdir)/mkl/lib/intel64/libmkl_core.a -Wl,--end-group -lpthread -ldl $(HPLlibHybrid)
F2CDEFS = -DAdd__ -DF77_INTEGER=int -DStringSunStyle
HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) $(LAinc) $(MPinc)
HPL_LIBS = $(HPLlib) $(LAlib) $(MPlib)
HPL_OPTS = -DASYOUGO -DHYBRID
HPL_DEFS = $(F2CDEFS) $(HPL_OPTS) $(HPL_INCLUDES)
CC = mpiicc
CCNOOPT = $(HPL_DEFS) -O0 -w -nocompchk
MKLINCDIR = -I"/opt/intel/mkl/include"
CCFLAGS = $(HPL_DEFS) $(MKLINCDIR) -O3 -w -ansi-alias -i-static -z noexecstack -z relro -z now -openmp -nocompchk
LINKER = $(CC)
LINKFLAGS = $(CCFLAGS) -openmp -mt_mpi $(STATICFLAG) -nocompchk
ARCHIVER = ar
ARFLAGS = r
RANLIB = echo
Compilation
export PATH=/opt/intel/impi/4.1.0.024/bin64/:/opt/intel/composer_xe_2013.1.117/bin/intel64/:$PATH
export LD_LIBRARY_PATH=/opt/intel/impi/4.1.0.024/lib64/:/opt/intel/composer_xe_2013.1.117/mkl/lib/intel64/:$LD_LIBRARY_PATH
make arch=Linux_intel64
Preparation for benchmark
cd bin/Linux_intel64/
HPL.dat is not ready, use online tool www.advancedclustering.com/faq/how-do-i-tune-my-hpldat-file.html
For example: 4 node, 4 CPU cores, 15000 MB RAM each
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
8 device out (6=stdout,7=stderr,file)
1 # of problems sizes (N)
79232 Ns
1 # of NBs
128 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
4 Ps
4 Qs
16.0 threshold
1 # of panel fact
2 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
4 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
1 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 DEPTHs (>=0)
2 SWAP (0=bin-exch,1=long,2=mix)
64 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
##### This line (no. 32) is ignored (it serves as a separator). ######
0 Number of additional problem sizes for PTRANS
1200 10000 30000 values of N
0 number of additional blocking sizes for PTRANS
40 9 8 13 13 20 16 32 64 values of NB
Create host list file ./hosts
192.168.0.1
192.168.0.2
192.168.0.3
192.168.0.4
Create file /etc/profile.d/intel.sh
#!/bin/bash
export PATH=/opt/intel/impi/4.1.0.024/bin64/:/opt/intel/composer_xe_2013.1.117/bin/intel64/:$PATH
export LD_LIBRARY_PATH=/opt/intel/impi/4.1.0.024/lib64/:/opt/intel/composer_xe_2013.1.117/mkl/lib/intel64/:$LD_LIBRARY_PATH
export I_MPI_FABRICS=shm:tcp
All nodes should have:
- Intel MPI Library 4.1 for Linux
- copy /root/hpl_intel_mkl/
- /etc/profile.d/intel.sh
Benchmarking
-n 16 # Total CPU core amount in cluster
export I_MPI_FABRICS=shm:tcp
mpiexec.hydra -f hosts -n 16 ./xhpl
When benchmark finished, you can find result on first compute node 192.168.0.1 in file /root/hpl_intel_mkl/bin/Linux_intel64/HPL.out
My result is 143 Gflops
Full HPL.out
================================================================================
HPLinpack 2.1 -- High-Performance Linpack benchmark -- October 26, 2012
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 79232
NB : 128
PMAP : Row-major process mapping
P : 4
Q : 4
PFACT : Right
NBMIN : 4
NDIV : 2
RFACT : Crout
BCAST : 1ringM
DEPTH : 1
SWAP : Mix (threshold = 64)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words
--------------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11C2R4 79232 128 4 4 2311.44 1.435e+02
HPL_pdgesv() start time Wed Jan 2 21:53:09 2013
HPL_pdgesv() end time Wed Jan 2 22:31:40 2013
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0034949 ...... PASSED
================================================================================
Finished 1 tests with the following results:
1 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------
End of Tests.
================================================================================
2. HPL + ATLAS + MPICH2
yum --enablerepo=* install atlas blas lapack mpich2 atlas-devel mpich2-devel gcc gcc-c++ make
wget http://www.netlib.org/benchmark/hpl/hpl-2.1.tar.gz
tar -xvzf hpl-2.1.tar.gz
cd hpl-2.1
# use template
cp setup/Make.Linux_PII_CBLAS ./
File Make.Linux_PII_CBLAS
SHELL = /bin/sh
CD = cd
CP = cp
LN_S = ln -fs
MKDIR = mkdir -p
RM = /bin/rm -f
TOUCH = touch
ARCH = Linux_PII_CBLAS
TOPdir = $(HOME)/hpl-2.1
INCdir = $(TOPdir)/include
BINdir = $(TOPdir)/bin/$(ARCH)
LIBdir = $(TOPdir)/lib/$(ARCH)
HPLlib = $(LIBdir)/libhpl.a
LAdir = /usr/lib64/atlas
LAinc =
LAlib = $(LAdir)/libcblas.a $(LAdir)/libatlas.a
F2CDEFS =
HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) $(LAinc) $(MPinc)
HPL_LIBS = $(HPLlib) $(LAlib) $(MPlib)
HPL_OPTS = -DHPL_CALL_CBLAS
HPL_DEFS = $(F2CDEFS) $(HPL_OPTS) $(HPL_INCLUDES)
CC = /usr/bin/mpicc
CCNOOPT = $(HPL_DEFS)
CCFLAGS = $(HPL_DEFS) -fomit-frame-pointer -O3 -funroll-loops
LINKER = /usr/bin/mpicc
LINKFLAGS = $(CCFLAGS)
ARCHIVER = ar
ARFLAGS = r
RANLIB = echo
Compile
make arch=Linux_PII_CBLAS
Preparaion
cd bin/Linux_PII_CBLAS
Edit HPL.dat, use online tool www.advancedclustering.com/faq/how-do-i-tune-my-hpldat-file.html
Prepare compute nodes
scp -r hpl-2.1 my-compute-0000:/root/
scp -r hpl-2.1 my-compute-0001:/root/
scp -r hpl-2.1 my-compute-0002:/root/
scp -r hpl-2.1 my-compute-0003:/root/
On each server
chkconfig iptables off; service iptables stop
yum --enablerepo=* install atlas blas lapack mpich2
Run benchmark
$ ssh central-server
$ cd hpl-2.1/bin/Linux_PII_CBLAS
$ cat hosts
my-compute-0000
my-compute-0001
my-compute-0002
my-compute-0003
$ mpiexec.hydra -f hosts -n 16 ./xhpl
Result on my-compute-0000:/root/hpl-2.1/bin/Linux_PII_CBLAS/HPL.out
3. HPL + GotoBLAS2 + Open MPI
Download GotoBLAS2 http://www.tacc.utexas.edu/tacc-projects/gotoblas2
tar xzf GotoBLAS2-1.13.tar.gz
cd GotoBLAS2
make
# GotoBLAS2 didn't detect Core i7 automatically.
make TARGET=NEHALEM
Compile HPL
wget http://www.netlib.org/benchmark/hpl/hpl-2.1.tar.gz
tar -xvzf hpl-2.1.tar.gz
mv hpl-2.1 hpl_gotoblas2
cd /root/hpl_gotoblas2/
Create Make.Linux_gotoblas2
SHELL = /bin/sh
CD = cd
CP = cp
LN_S = ln -sf
MKDIR = mkdir -p
RM = /bin/rm -f
TOUCH = touch
ARCH = Linux_gotoblas2
TOPdir = $(HOME)/hpl_gotoblas2
INCdir = $(TOPdir)/include
BINdir = $(TOPdir)/bin/$(ARCH)
LIBdir = $(TOPdir)/lib/$(ARCH)
HPLlib = $(LIBdir)/libhpl.a
MPdir =
MPinc =
MPlib =
LAdir = /root/GotoBLAS2
LAinc =
LAlib = -L$(LAdir) -lgoto2
F2CDEFS =
HPL_INCLUDES = -I$(INCdir) -I$(INCdir)/$(ARCH) $(LAinc) $(MPinc)
HPL_LIBS = $(HPLlib) $(LAlib) $(MPlib)
HPL_OPTS = -DHPL_CALL_CBLAS
HPL_DEFS = $(F2CDEFS) $(HPL_OPTS) $(HPL_INCLUDES)
CC = mpicc
CCNOOPT = $(HPL_DEFS)
CCFLAGS = $(HPL_DEFS) -fomit-frame-pointer -O3 -funroll-loops
LINKER = mpif77
LINKFLAGS = $(CCFLAGS)
ARCHIVER = ar
ARFLAGS = r
RANLIB = echo
Compilation
make arch=Linux_gotoblas2
Preparation
Create file on all compute nodes /etc/profile.d/gotoblas2.sh
1 2 3 4 |
|
Change HPL.dat, use online tool www.advancedclustering.com/faq/how-do-i-tune-my-hpldat-file.html
HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out output file name (if any)
8 device out (6=stdout,7=stderr,file)
1 # of problems sizes (N)
79232 Ns
1 # of NBs
128 NBs
0 PMAP process mapping (0=Row-,1=Column-major)
1 # of process grids (P x Q)
4 Ps
4 Qs
16.0 threshold
1 # of panel fact
2 PFACTs (0=left, 1=Crout, 2=Right)
1 # of recursive stopping criterium
4 NBMINs (>= 1)
1 # of panels in recursion
2 NDIVs
1 # of recursive panel fact.
1 RFACTs (0=left, 1=Crout, 2=Right)
1 # of broadcast
1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1 # of lookahead depth
1 DEPTHs (>=0)
2 SWAP (0=bin-exch,1=long,2=mix)
64 swapping threshold
0 L1 in (0=transposed,1=no-transposed) form
0 U in (0=transposed,1=no-transposed) form
1 Equilibration (0=no,1=yes)
8 memory alignment in double (> 0)
##### This line (no. 32) is ignored (it serves as a separator). ######
0 Number of additional problem sizes for PTRANS
1200 10000 30000 values of N
0 number of additional blocking sizes for PTRANS
40 9 8 13 13 20 16 32 64 values of NB
Create file hosts
192.168.0.1
192.168.0.2
192.168.0.3
192.168.0.4
Make copy to compute nodes
scp -r /root/GotoBLAS2 192.168.0.1:/root/
...
scp -r /root/hpl_gotoblas2 192.168.0.1:/root/
...
Run benchmark
cd /root/hpl_gotoblas2/bin/Linux_gotoblas2/
mpiexec --hostfile hosts -np 16 ./xhpl
Get result
# ssh 192.168.0.1
# cat /root/hpl_gotoblas2/bin/Linux_gotoblas2/HPL.out
....
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WR11C2R4 79232 128 4 4 3242.79 **1.023e+02**
.....
102 Gflops