Enable GPUDirect RDMA and benchmark with perftest, nccl-test, NVIDIA HPCG and PyTorch ResNet50

Setup servers
Benchmark perftest
Benchmark nccl-tests
Benchmark NVIDIA HPCG
Benchmark PyTorch ResNet50

GPUDirect RDMA (Remote Direct Memory Access) allows for the direct transfer of data between the GPUs, GPUs and other devices (network cards), eliminating the need for the CPU to handle data transfers.

Test setup: * Hardware: 2 servers Dell PowerEdge XE9680 ( 8 x Nvidia H100, 10 x Mellanox InfiniBand ConnectX-6 cards ) * OS: Rocky Linux 9.4 minimal * Hostnames(IP): alpha (10.3.1.5), delta (10.3.1.6)

NOTE 1: I'll run these benchmarks without shared filessystem. In this example all software is compiled on both servers.

NOTE 2: Benchmarks will be restarted twice:

Without GPUDirect RDMA:

rmmod nvidia_peermem; rmmod gdrdrv

With GPUDirect RDMA

modprobe nvidia_peermem; modprobe gdrdrv

1. Setup servers

Disable IOMMU. This option can have differen names i BIOS: AMD-Vi or VT-d or IOMMU or SR-IOV.

Disable ACS. This will be different on different serveres, vendors. Check server manual.

I followed this Dell manual: https://www.dell.com/support/manuals/en-us/poweredge-xe9680/xe9680_ism_pub/processor-settings?guid=guid-71fdb36a-23ad-4720-b453-7347fd93e697&lang=en-us

On Dell XE9680 with Xeon Platinum 8470: Disabling Virtualization Technology will disable ACS in BIOS. But ACS still will be visible as enabled in OS.

Check command:

# lspci -vvv | grep ACSCtl
        ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
. . .

ACS enabled if there are lines with plus sign "SrcValid+"

Create script file /usr/local/sbin/acs-disable to disable ACS:

#!/bin/bash
for BDF in `lspci -d "*:*:*" | awk '{print $1}'`; do
  # skip if it doesn't support ACS
  sudo setpci -v -s ${BDF} ECAP_ACS+0x6.w > /dev/null 2>&1
  if [ $? -ne 0 ]; then
    continue
  fi
  sudo setpci -v -s ${BDF} ECAP_ACS+0x6.w=0000
done

Make executable:

chmod 700 /usr/local/sbin/acs-disable

Disable ACS on boot. Create service file /etc/systemd/system/acs-disable.service

[Unit]
Description=ACS disable
After=default.target

[Service]
Type=oneshot
ExecStart=/usr/local/sbin/acs-disable

[Install]
WantedBy=default.target

Reload, enable start:

systemctl daemon-reload
systemctl enable acs-disable
systemctl start acs-disable

Check again:

# lspci -vvv | grep ACSCtl
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
        ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
. . .

To avoid issue with "not matching" drivers and Linux kernal releases: I installing kernel related packages from the install disk ISO image.

Mount DVD iso image:

mount /root/Rocky-9.4-x86_64-dvd.iso /mnt/ -o loop

Create repository file /etc/yum.repos.d/iso.repo

[baseos-iso]
name=Rocky Linux $releasever - BaseOS ISO
baseurl=file:///mnt/BaseOS/
gpgcheck=0
enabled=0

[appstream-iso]
name=Rocky Linux $releasever - AppStream ISO
baseurl=file:///mnt/AppStream/
gpgcheck=0
enabled=0

dnf -y --disablerepo=* --enablerepo=baseos-iso,appstream-iso install kernel-devel
dnf -y --disablerepo=* --enablerepo=baseos-iso,appstream-iso install kernel-abi-stablelists kernel-rpm-macros kernel-core kernel-modules kernel-modules-core kernel-modules-extra kernel-devel-matched

Install development tools:

dnf -y groupinstall 'Development Tools'
dnf -y install epel-release
dnf -y --disablerepo=* --enablerepo=baseos-iso,appstream-iso,epel install dkms
dnf -y install pciutils-devel

Disable firewall

hostnamectl set-hostname alpha
systemctl disable firewalld
systemctl stop firewalld

Disable SELINUX on Rocky 9.x:

grubby --update-kernel ALL --args selinux=0

Or/and edit file /etc/selinux/config:

# grep '^SELINUX=' /etc/selinux/config
SELINUX=disabled

Install InfiniBand drivers

curl -LJO https://www.mellanox.com/downloads/DOCA/DOCA_v2.8.0/host/doca-host-2.8.0-204000_24.07_rhel94.x86_64.rpm
rpm -i doca-host-2.8.0-204000_24.07_rhel94.x86_64.rpm
dnf clean all
dnf -y --enablerepo=crb install doca-all

Install CUDA

curl -LJO https://developer.download.nvidia.com/compute/cuda/12.6.2/local_installers/cuda-repo-rhel9-12-6-local-12.6.2_560.35.03-1.x86_64.rpm
rpm -i cuda-repo-rhel9-12-6-local-12.6.2_560.35.03-1.x86_64.rpm
dnf clean all
dnf -y install cuda
dnf install nvidia-gds
dnf -y install nvidia-persistenced
dnf -y install nvidia-fabric-manager

systemctl enable nvidia-persistenced
systemctl enable nvidia-fabricmanager

Reboot:

reboot

Load kernal module nvidia_peermem

modprobe nvidia_peermem

Make kernel module nvidia_peermem to load on boot:

echo "nvidia_peermem" > /etc/modules-load.d/nvidia_peermem.conf

Check:

# lsmod | grep nvidia_peermem
nvidia_peermem         24576  0
nvidia               9760768  97 nvidia_uvm,nvidia_peermem,nvidia_fs,nvidia_modeset
ib_uverbs             217088  3 nvidia_peermem,rdma_ucm,mlx5_ib

Build GDRCopy. A low-latency GPU memory copy library based on NVIDIA GPUDirect RDMA technology.

Build gdrcopy and install RPMs

curl -LJO https://github.com/NVIDIA/gdrcopy/archive/refs/tags/v2.4.2.tar.gz
tar xf gdrcopy-2.4.2.tar.gz 
cd gdrcopy-2.4.2/packages/
CUDA=/usr/local/cuda-12.6 ./build-rpm-packages.sh 
dnf -y install ucx-gdrcopy ucx-cuda gdrcopy-2.4.2-1.el9.x86_64.rpm gdrcopy-devel-2.4.2-1.el9.noarch.rpm gdrcopy-kmod-2.4.2-1dkms.el9.noarch.rpm

Check that kernel module is loaded:

# lsmod | grep gdrdrv
gdrdrv                 36864  0
nvidia               9760768  98 nvidia_uvm,nvidia_peermem,nvidia_fs,gdrdrv,nvidia_modeset

Check that gdr_copy will be used by UCX, output example:

ucx_info -d | grep gdr
# Memory domain: gdr_copy
#     Component: gdr_copy
#      Transport: gdr_copy

Check that IOMMU are disabled.

/usr/local/cuda/gds/tools/gdscheck -p

Check gdrcopy_sanity:

# gdrcopy_sanity 
Total: 28, Passed: 26, Failed: 0, Waived: 2

List of waived tests:
    invalidation_access_after_free_cumemalloc
    invalidation_access_after_free_vmmalloc

Set IPv4 address for the first InfiniBand card:

nmtui

Show system topology:

nvidia-smi topo -m

Now we can benchmark all combination GPUs and NICs between 2 servers.

2. Benchmark perftest

Compile perftest with CUDA

curl -LJO https://github.com/linux-rdma/perftest/archive/refs/tags/24.07.0-0.44.tar.gz
tar xf perftest-24.07.0-0.44.tar.gz
cd perftest-24.07.0-0.44
./autogen.sh 
./configure CUDA_H_PATH=/usr/local/cuda/include/cuda.h
make
cd ..

Benchmark bandwith between 2 nodes with first NIC and first GPU. Benchmark other GPUs and NICs compinations.

Server apha:

./perftest-24.07.0-0.44/ib_write_bw -d mlx5_0 --use_cuda=0 -a --report_gbits
 . . .
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 8388608    5000             98.76              98.76            0.001472
. . .

Client delta:

./perftest-24.07.0-0.44/ib_write_bw -d mlx5_0 --use_cuda=0 -a 10.3.1.5 --report_gbits

3. Benchmark nccl-tests

This benchmark will need OpenMPI compiled with UCX. UCX should be compiled with CUDA and GDRCopy. Mellanox InfiniBand driver comes with UCX and it was already installed.

Check with these commands:

ucx_info -v | grep cuda
ucx_info -v | grep gdrcopy

Build OpenMPI

curl -O https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.5.tar.gz
tar xf openmpi-5.0.5.tar.gz
cd openmpi-5.0.5
./configure --prefix=/root/openmpi505 --with-cuda=/usr/local/cuda/ --with-ucx
make -j
make install
cd ..
export PATH=/root/openmpi505/bin:$PATH
export LD_LIBRARY_PATH=/root/openmpi505/lib:$LD_LIBRARY_PATH

Build NVIDIA NCCL library

git clone https://github.com/NVIDIA/nccl.git
cd nccl
make -j src.build
cd ..
export LD_LIBRARY_PATH=/root/nccl/build/lib/:$LD_LIBRARY_PATH

(Optional) Command to build RPMs: make pkg.redhat.build

Build nccl-tests:

curl -LJO https://github.com/NVIDIA/nccl-tests/archive/refs/tags/v2.13.11.tar.gz
tar xf nccl-tests-2.13.11.tar.gz 
cd nccl-tests-2.13.11
make MPI=1 MPI_HOME=/root/openmpi505 CUDA_HOME=/usr/local/cuda/ NCCL_HOME=/root/nccl/build/
cd ..

Set environment vatiables on both servers. Add these lines to /root/.bashrc file:

export PATH=/root/openmpi505/bin:$PATH
export LD_LIBRARY_PATH=/root/openmpi505/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/root/nccl/build/lib/:$LD_LIBRARY_PATH

Add both servers in to /etc/hosts file:

...
10.3.1.5 alpha
10.3.1.6 delta

Generate ssh key and copy public key from alpha server:

ssh-keygen
ssh-copy-id delta

Start on 2 server with 16 GPUs:

mpirun --allow-run-as-root -n 16 --host alpha:8,delta:8 ./nccl-tests-2.13.11/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1

Result:

...
# Avg bus bandwidth    : 73.3687

4. Benchmark NVIDIA HPCG

This benchmark will use the same /root/.bashrc and OpenMPI as in nccl-tests benchmark.

Download source code:

git clone https://github.com/NVIDIA/nvidia-hpcg
cd nvidia-hpcg/

Disable GRACE:

# grep "^export USE_GRACE=" ./build_sample.sh
export USE_GRACE=0

Build:

NCCL_PATH=/root/nccl/build/ CUDA_PATH=/usr/local/cuda/ MPI_PATH=/root/openmpi505/ USE_CUDA=1 USE_GRACE=0 ./build_sample.sh
cd ..

Start on 2 servers with 16 GPUs:

mpirun --allow-run-as-root -n 16 --host alpha:8,delta:8 ./nvidia-hpcg/bin/xhpcg 256 256 256 1900

Result:

Final Summary::HPCG result is VALID with a GFLOP/s rating of=7630.44

5. Benchmark PyTorch ResNet50

ResNet50 is an image classification model. The benchmark number is the training speed of ResNet50 on the ImageNet dataset. Training speed is measured in images/second, not counting data loading.

ImageNet dataset download here: https://image-net.org/challenges/LSVRC/2012/2012-downloads.php You need to create an account and it will take a long time to download (maybe 24 hours). These 2 files:

ILSVRC2012_img_train.tar (138G) ILSVRC2012_img_val.tar (6.3G)

mkdir /localscratch/
mkdir /localscratch/results
mkdir /localscratch/train
cd /localscratch/train
tar xf /root/ILSVRC2012_img_train.tar
find . -name "*.tar" | while read NAME ; do mkdir -p "${NAME%.tar}"; tar -xvf "${NAME}" -C "${NAME%.tar}"; rm -f "${NAME}"; done
mkdir /localscratch/val
cd /localscratch/val
tar xf /root/ILSVRC2012_img_val.tar
curl -O https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh
bash ./valprep.sh
cd /root
dnf install pip
pip install --user nvidia-dali-cuda120
pip install --user git+https://github.com/NVIDIA/dllogger#egg=dllogger
pip install --user pynvml
pip install --user torch torchvision torchaudio
git clone https://github.com/NVIDIA/DALI.git

Start on 2 servers with 16GPUs.

Server alpha command:

torchrun --master_addr=10.3.1.5 --master_port=1234 --nproc_per_node=8 --nnodes=2 --node_rank=0 /root/DALI/docs/examples/use_cases/pytorch/efficientnet/main.py --amp --static-loss-scale 128 --batch-size 128 --epochs 1 --prof 1000 --no-checkpoints --training-only --data-backend synthetic --workspace /localscratch/results --report-file bench_report_synthetic.json /localscratch

Server delta command:

torchrun --master_addr=10.3.1.5 --master_port=1234 --nproc_per_node=8 --nnodes=2 --node_rank=1 /root/DALI/docs/examples/use_cases/pytorch/efficientnet/main.py --amp --static-loss-scale 128 --batch-size 128 --epochs 1 --prof 1000 --no-checkpoints --training-only --data-backend synthetic --workspace /localscratch/results --report-file bench_report_synthetic.json /localscratch

Result:

...
Experiment ended
DLL ..... - Summary: ...... train.compute_ips : 66772.19 images/s ......
....