- Setup servers
- Benchmark perftest
- Benchmark nccl-tests
- Benchmark NVIDIA HPCG
- Benchmark PyTorch ResNet50
- Benchmark OSU
GPUDirect RDMA (Remote Direct Memory Access) allows for the direct transfer of data between the GPUs, GPUs and other devices (network cards), eliminating the need for the CPU to handle data transfers.
Test setup: * Hardware: 2 servers Dell PowerEdge XE9680 ( 8 x Nvidia H100, 10 x Mellanox InfiniBand ConnectX-6 cards ) * OS: Rocky Linux 9.4 minimal * Hostnames(IP): alpha (10.3.1.5), delta (10.3.1.6)
NOTE 1: I'll run these benchmarks without a shared filesystem. In this example, all software is compiled on both servers.
1. Setup servers
Disable IOMMU. This option can have different names in the BIOS: AMD-Vi or VT-d or IOMMU or SR-IOV.
Disable ACS. This will be different on different servers or vendors. Check the server manual.
I followed this Dell manual: https://www.dell.com/support/manuals/en-us/poweredge-xe9680/xe9680_ism_pub/processor-settings?guid=guid-71fdb36a-23ad-4720-b453-7347fd93e697&lang=en-us
On Dell XE9680 with Xeon Platinum 8470: Disabling Virtualization Technology will disable ACS in BIOS. But ACS will still be visible as enabled in OS.
Check command (only root user):
# lspci -vvv | grep ACSCtl
ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
. . .
ACS is enabled if there are lines with plus sign "SrcValid+"
Create script file /usr/local/sbin/acs-disable to disable ACS:
#!/bin/bash
for BDF in `lspci -d "*:*:*" | awk '{print $1}'`; do
# skip if it doesn't support ACS
sudo setpci -v -s ${BDF} ECAP_ACS+0x6.w > /dev/null 2>&1
if [ $? -ne 0 ]; then
continue
fi
sudo setpci -v -s ${BDF} ECAP_ACS+0x6.w=0000
done
Make executable:
chmod 700 /usr/local/sbin/acs-disable
Disable ACS on boot. Create service file /etc/systemd/system/acs-disable.service
[Unit]
Description=ACS disable
After=default.target
[Service]
Type=oneshot
ExecStart=/usr/local/sbin/acs-disable
[Install]
WantedBy=default.target
Reload, enable start:
systemctl daemon-reload
systemctl enable acs-disable
systemctl start acs-disable
Check again (only root user):
# lspci -vvv | grep ACSCtl
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
. . .
To avoid issues with "not matching" drivers and Linux kernel releases: I installed kernel-related packages from the install ISO image.
Mount DVD iso image:
mount /root/Rocky-9.4-x86_64-dvd.iso /mnt/ -o loop
Create repository file /etc/yum.repos.d/iso.repo
[baseos-iso]
name=Rocky Linux $releasever - BaseOS ISO
baseurl=file:///mnt/BaseOS/
gpgcheck=0
enabled=0
[appstream-iso]
name=Rocky Linux $releasever - AppStream ISO
baseurl=file:///mnt/AppStream/
gpgcheck=0
enabled=0
dnf -y --disablerepo=* --enablerepo=baseos-iso,appstream-iso install kernel-devel
dnf -y --disablerepo=* --enablerepo=baseos-iso,appstream-iso install kernel-abi-stablelists kernel-rpm-macros kernel-core kernel-modules kernel-modules-core kernel-modules-extra kernel-devel-matched
Install development tools:
dnf -y groupinstall 'Development Tools'
dnf -y install epel-release
dnf -y --disablerepo=* --enablerepo=baseos-iso,appstream-iso,epel install dkms
dnf -y install pciutils-devel
Disable firewall
hostnamectl set-hostname alpha
systemctl disable firewalld
systemctl stop firewalld
Disable SELINUX on Rocky 9.x:
grubby --update-kernel ALL --args selinux=0
Or/and edit file /etc/selinux/config:
# grep '^SELINUX=' /etc/selinux/config
SELINUX=disabled
Install InfiniBand drivers
curl -LJO https://www.mellanox.com/downloads/DOCA/DOCA_v2.8.0/host/doca-host-2.8.0-204000_24.07_rhel94.x86_64.rpm
rpm -i doca-host-2.8.0-204000_24.07_rhel94.x86_64.rpm
dnf clean all
dnf -y --enablerepo=crb install doca-all
Install CUDA
curl -LJO https://developer.download.nvidia.com/compute/cuda/12.6.2/local_installers/cuda-repo-rhel9-12-6-local-12.6.2_560.35.03-1.x86_64.rpm
rpm -i cuda-repo-rhel9-12-6-local-12.6.2_560.35.03-1.x86_64.rpm
dnf clean all
dnf -y install cuda
dnf install nvidia-gds
dnf -y install nvidia-persistenced
dnf -y install nvidia-fabric-manager
systemctl enable nvidia-persistenced
systemctl enable nvidia-fabricmanager
Reboot:
reboot
Load kernel module nvidia_peermem
modprobe nvidia_peermem
Make kernel module nvidia_peermem to load on boot:
echo "nvidia_peermem" > /etc/modules-load.d/nvidia_peermem.conf
Check:
# lsmod | grep nvidia_peermem
nvidia_peermem 24576 0
nvidia 9760768 97 nvidia_uvm,nvidia_peermem,nvidia_fs,nvidia_modeset
ib_uverbs 217088 3 nvidia_peermem,rdma_ucm,mlx5_ib
Build GDRCopy. A low-latency GPU memory copy library based on NVIDIA GPUDirect RDMA technology.
Build gdrcopy and install RPMs
curl -LJO https://github.com/NVIDIA/gdrcopy/archive/refs/tags/v2.4.2.tar.gz
tar xf gdrcopy-2.4.2.tar.gz
cd gdrcopy-2.4.2/packages/
CUDA=/usr/local/cuda-12.6 ./build-rpm-packages.sh
dnf -y install ucx-gdrcopy ucx-cuda gdrcopy-2.4.2-1.el9.x86_64.rpm gdrcopy-devel-2.4.2-1.el9.noarch.rpm gdrcopy-kmod-2.4.2-1dkms.el9.noarch.rpm
Check that kernel module is loaded:
# lsmod | grep gdrdrv
gdrdrv 36864 0
nvidia 9760768 98 nvidia_uvm,nvidia_peermem,nvidia_fs,gdrdrv,nvidia_modeset
Check that gdr_copy will be used by UCX, output example:
ucx_info -d | grep gdr
# Memory domain: gdr_copy
# Component: gdr_copy
# Transport: gdr_copy
Check that IOMMU is disabled.
/usr/local/cuda/gds/tools/gdscheck -p
Check gdrcopy_sanity:
# gdrcopy_sanity
Total: 28, Passed: 26, Failed: 0, Waived: 2
List of waived tests:
invalidation_access_after_free_cumemalloc
invalidation_access_after_free_vmmalloc
Set IPv4 address for the first InfiniBand card:
nmtui
Show system topology:
# nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX PIX NODE NODE NODE SYS SYS SYS SYS SYS 0,2,4,6,8,10 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE PIX NODE NODE SYS SYS SYS SYS SYS 0,2,4,6,8,10 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE NODE NODE PIX NODE SYS SYS SYS SYS SYS 0,2,4,6,8,10 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 NODE NODE NODE NODE PIX SYS SYS SYS SYS SYS 0,2,4,6,8,10 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS SYS PIX PIX NODE NODE NODE 1,3,5,7,9,11 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS SYS NODE NODE PIX NODE NODE 1,3,5,7,9,11 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS SYS NODE NODE NODE PIX NODE 1,3,5,7,9,11 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS SYS NODE NODE NODE NODE PIX 1,3,5,7,9,11 1 N/A
NIC0 PIX NODE NODE NODE SYS SYS SYS SYS X PIX NODE NODE NODE SYS SYS SYS SYS SYS
NIC1 PIX NODE NODE NODE SYS SYS SYS SYS PIX X NODE NODE NODE SYS SYS SYS SYS SYS
NIC2 NODE PIX NODE NODE SYS SYS SYS SYS NODE NODE X NODE NODE SYS SYS SYS SYS SYS
NIC3 NODE NODE PIX NODE SYS SYS SYS SYS NODE NODE NODE X NODE SYS SYS SYS SYS SYS
NIC4 NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE NODE NODE X SYS SYS SYS SYS SYS
NIC5 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS SYS X PIX NODE NODE NODE
NIC6 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS SYS PIX X NODE NODE NODE
NIC7 SYS SYS SYS SYS NODE PIX NODE NODE SYS SYS SYS SYS SYS NODE NODE X NODE NODE
NIC8 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS SYS SYS SYS NODE NODE NODE X NODE
NIC9 SYS SYS SYS SYS NODE NODE NODE PIX SYS SYS SYS SYS SYS NODE NODE NODE NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8
NIC9: mlx5_9
The topology is important in benchmarks to select the fastest GPUs and NICs combination.
Now we can benchmark all combination GPUs and NICs between 2 servers.
2. Benchmark perftest
Compile perftest with CUDA
curl -LJO https://github.com/linux-rdma/perftest/archive/refs/tags/24.07.0-0.44.tar.gz
tar xf perftest-24.07.0-0.44.tar.gz
cd perftest-24.07.0-0.44
./autogen.sh
./configure CUDA_H_PATH=/usr/local/cuda/include/cuda.h
make
cd ..
Benchmark bandwidth between 2 nodes with the first NIC and first GPU. Benchmark other GPUs and NICs combinations.
Server apha:
./perftest-24.07.0-0.44/ib_write_bw -d mlx5_0 --use_cuda=0 -a --report_gbits
. . .
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
8388608 5000 98.76 98.76 0.001472
. . .
Client delta:
./perftest-24.07.0-0.44/ib_write_bw -d mlx5_0 --use_cuda=0 -a 10.3.1.5 --report_gbits
3. Benchmark nccl-tests
This benchmark will need OpenMPI compiled with UCX. UCX should be compiled with CUDA and GDRCopy. Mellanox InfiniBand driver comes with UCX and it was already installed.
Check with these commands:
ucx_info -v | grep cuda
ucx_info -v | grep gdrcopy
Build OpenMPI
curl -O https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.5.tar.gz
tar xf openmpi-5.0.5.tar.gz
cd openmpi-5.0.5
./configure --prefix=/root/openmpi505 --with-cuda=/usr/local/cuda/ --with-ucx
make -j
make install
cd ..
export PATH=/root/openmpi505/bin:$PATH
export LD_LIBRARY_PATH=/root/openmpi505/lib:$LD_LIBRARY_PATH
Build NVIDIA NCCL library:
git clone https://github.com/NVIDIA/nccl.git
cd nccl
make -j src.build
cd ..
export LD_LIBRARY_PATH=/root/nccl/build/lib/:$LD_LIBRARY_PATH
(Optional) Command to build RPMs: make pkg.redhat.build
Build nccl-tests:
curl -LJO https://github.com/NVIDIA/nccl-tests/archive/refs/tags/v2.13.11.tar.gz
tar xf nccl-tests-2.13.11.tar.gz
cd nccl-tests-2.13.11
make MPI=1 MPI_HOME=/root/openmpi505 CUDA_HOME=/usr/local/cuda/ NCCL_HOME=/root/nccl/build/
cd ..
Set environment vatiables on both servers. Add these lines to /root/.bashrc file:
export PATH=/root/openmpi505/bin:$PATH
export LD_LIBRARY_PATH=/root/openmpi505/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/root/nccl/build/lib/:$LD_LIBRARY_PATH
Log in and log out to reload BASH environment.
Add both servers in to /etc/hosts file:
...
10.3.1.5 alpha
10.3.1.6 delta
Generate ssh key and copy public key from alpha server:
ssh-keygen
ssh-copy-id delta
Start on 2 server with 16 GPUs:
mpirun --allow-run-as-root -n 16 --host alpha:8,delta:8 ./nccl-tests-2.13.11/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1
Result:
...
# Avg bus bandwidth : 73.3687
4. Benchmark NVIDIA HPCG
This benchmark will use the same /root/.bashrc and OpenMPI as in nccl-tests benchmark.
Download source code:
git clone https://github.com/NVIDIA/nvidia-hpcg
cd nvidia-hpcg/
Disable GRACE:
# grep "^export USE_GRACE=" ./build_sample.sh
export USE_GRACE=0
Build:
NCCL_PATH=/root/nccl/build/ CUDA_PATH=/usr/local/cuda/ MPI_PATH=/root/openmpi505/ USE_CUDA=1 USE_GRACE=0 ./build_sample.sh
cd ..
Start on 2 servers with 16 GPUs:
mpirun --allow-run-as-root -n 16 --host alpha:8,delta:8 ./nvidia-hpcg/bin/xhpcg 256 256 256 1900
Result:
Final Summary::HPCG result is VALID with a GFLOP/s rating of=7630.44
5. Benchmark PyTorch ResNet50
ResNet50 is an image classification model. The benchmark number is the training speed of ResNet50 on the ImageNet dataset. Training speed is measured in images/second, not counting data loading.
ImageNet dataset download here: https://image-net.org/challenges/LSVRC/2012/2012-downloads.php You need to create an account and it will take a long time to download (maybe 24 hours). These 2 files:
ILSVRC2012_img_train.tar (138G) ILSVRC2012_img_val.tar (6.3G)
mkdir /localscratch/
mkdir /localscratch/results
mkdir /localscratch/train
cd /localscratch/train
tar xf /root/ILSVRC2012_img_train.tar
find . -name "*.tar" | while read NAME ; do mkdir -p "${NAME%.tar}"; tar -xvf "${NAME}" -C "${NAME%.tar}"; rm -f "${NAME}"; done
mkdir /localscratch/val
cd /localscratch/val
tar xf /root/ILSVRC2012_img_val.tar
curl -O https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh
bash ./valprep.sh
cd /root
dnf install pip
pip install --user nvidia-dali-cuda120
pip install --user git+https://github.com/NVIDIA/dllogger#egg=dllogger
pip install --user pynvml
pip install --user torch torchvision torchaudio
git clone https://github.com/NVIDIA/DALI.git
Start on 2 servers with 16 GPUs.
Server alpha command:
torchrun --master_addr=10.3.1.5 --master_port=1234 --nproc_per_node=8 --nnodes=2 --node_rank=0 /root/DALI/docs/examples/use_cases/pytorch/efficientnet/main.py --amp --static-loss-scale 128 --batch-size 128 --epochs 1 --prof 1000 --no-checkpoints --training-only --data-backend synthetic --workspace /localscratch/results --report-file bench_report_synthetic.json /localscratch
Server delta command:
torchrun --master_addr=10.3.1.5 --master_port=1234 --nproc_per_node=8 --nnodes=2 --node_rank=1 /root/DALI/docs/examples/use_cases/pytorch/efficientnet/main.py --amp --static-loss-scale 128 --batch-size 128 --epochs 1 --prof 1000 --no-checkpoints --training-only --data-backend synthetic --workspace /localscratch/results --report-file bench_report_synthetic.json /localscratch
Result:
...
Experiment ended
DLL ..... - Summary: ...... train.compute_ips : 66772.19 images/s ......
....
6. Benchmark OSU
Download and build:
export PATH=/usr/local/cuda/bin:$PATH
curl -O https://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-7.5.tar.gz
tar xf osu-micro-benchmarks-7.5.tar.gz
cd osu-micro-benchmarks-7.5
./configure CC=`which mpicc` CXX=`which mpicxx` --enable-cuda --with-cuda-include=/usr/local/cuda/include --with-cuda-libpath=/usr/local/cuda/lib64
make
I'm running different GPUs and NICs combinations according to 'nvidia-smi topo -m' output:
Single direction, GPU 0, NIC 0 (PIX):
# mpirun --allow-run-as-root -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES=0 -x UCX_NET_DEVICES=mlx5_0:1 -H alpha:1,delta:1 ./osu-micro-benchmarks-7.5/c/mpi/pt2pt/standard/osu_bw D D
...
# Size Bandwidth (MB/s)
...
4194304 12339.46
Single direction, One GPU, Two NICs 0,1 (PIX, PIX):
# mpirun --allow-run-as-root -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES=0 -x UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1 -H alpha:1,delta:1 ./osu-micro-benchmarks-7.5/c/mpi/pt2pt/standard/osu_bw D D
...
# Size Bandwidth (MB/s)
...
4194304 24667.45
Single direction, One GPU, Two NICs 3,4 (NODE, NODE):
# mpirun --allow-run-as-root -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES=0 -x UCX_NET_DEVICES=mlx5_3:1,mlx5_4:1 -H alpha:1,delta:1 ./osu-micro-benchmarks-7.5/c/mpi/pt2pt/standard/osu_bw D D
...
# Size Bandwidth (MB/s)
...
4194304 24592.39
Single direction, One GPU, Two NICs 7,9 (SYS, SYS):
# mpirun --allow-run-as-root -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES=0 -x UCX_NET_DEVICES=mlx5_9:1,mlx5_7:1 -H alpha:1,delta:1 ./osu-micro-benchmarks-7.5/c/mpi/pt2pt/standard/osu_bw D D
...
# Size Bandwidth (MB/s)
...
4194304 22855.04
Single direction, auto GPU and NICs:
# mpirun --allow-run-as-root -x LD_LIBRARY_PATH -H alpha:1,delta:1 ./osu-micro-benchmarks-7.5/c/mpi/pt2pt/standard/osu_bw D D
...
# Size Bandwidth (MB/s)
...
4194304 24667.16
Bi-Directional, auto GPU and NICs:
# mpirun --allow-run-as-root -x LD_LIBRARY_PATH -H alpha:1,delta:1 ./osu-micro-benchmarks-7.5/c/mpi/pt2pt/standard/osu_bibw D D
# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.5
# Datatype: MPI_CHAR.
# Size Bandwidth (MB/s)
1 3.98
2 7.98
4 16.25
8 32.46
16 64.65
32 129.24
64 237.45
128 493.33
256 915.73
512 1703.28
1024 2993.79
2048 5184.55
4096 7407.15
8192 14501.31
16384 18774.54
32768 37192.35
65536 42738.37
131072 45578.91
262144 47329.40
524288 48296.48
1048576 48789.81
2097152 49016.33
4194304 49155.11
Auto select does a good job. PIX combination is the fastes, then goes NODE, SYS is the slowest.