Enable GPUDirect RDMA and benchmark with perftest, nccl-test, NVIDIA HPCG, PyTorch ResNet50, OSU

Posted on Mon 11 November 2024 by Pavlo Khmel
  1. Setup servers
  2. Benchmark perftest
  3. Benchmark nccl-tests
  4. Benchmark NVIDIA HPCG
  5. Benchmark PyTorch ResNet50
  6. Benchmark OSU

GPUDirect RDMA (Remote Direct Memory Access) allows for the direct transfer of data between the GPUs, GPUs and other devices (network cards), eliminating the need for the CPU to handle data transfers.

Test setup: * Hardware: 2 servers Dell PowerEdge XE9680 ( 8 x Nvidia H100, 10 x Mellanox InfiniBand ConnectX-6 cards ) * OS: Rocky Linux 9.4 minimal * Hostnames(IP): alpha (10.3.1.5), delta (10.3.1.6)

NOTE 1: I'll run these benchmarks without a shared filesystem. In this example, all software is compiled on both servers.

1. Setup servers

Disable IOMMU. This option can have different names in the BIOS: AMD-Vi or VT-d or IOMMU or SR-IOV.

Disable ACS. This will be different on different servers or vendors. Check the server manual.

I followed this Dell manual: https://www.dell.com/support/manuals/en-us/poweredge-xe9680/xe9680_ism_pub/processor-settings?guid=guid-71fdb36a-23ad-4720-b453-7347fd93e697&lang=en-us

On Dell XE9680 with Xeon Platinum 8470: Disabling Virtualization Technology will disable ACS in BIOS. But ACS will still be visible as enabled in OS.

Check command (only root user):

# lspci -vvv | grep ACSCtl
 ACSCtl:    SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
 ACSCtl:    SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
 ACSCtl:    SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
 ACSCtl:    SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
. . .

ACS is enabled if there are lines with plus sign "SrcValid+"

Create script file /usr/local/sbin/acs-disable to disable ACS:

#!/bin/bash
for BDF in `lspci -d "*:*:*" | awk '{print $1}'`; do
  # skip if it doesn't support ACS
  sudo setpci -v -s ${BDF} ECAP_ACS+0x6.w > /dev/null 2>&1
  if [ $? -ne 0 ]; then
    continue
  fi
  sudo setpci -v -s ${BDF} ECAP_ACS+0x6.w=0000
done

Make executable:

chmod 700 /usr/local/sbin/acs-disable

Disable ACS on boot. Create service file /etc/systemd/system/acs-disable.service

[Unit]
Description=ACS disable
After=default.target

[Service]
Type=oneshot
ExecStart=/usr/local/sbin/acs-disable

[Install]
WantedBy=default.target

Reload, enable start:

systemctl daemon-reload
systemctl enable acs-disable
systemctl start acs-disable

Check again (only root user):

# lspci -vvv | grep ACSCtl
 ACSCtl:    SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
 ACSCtl:    SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
. . .

To avoid issues with "not matching" drivers and Linux kernel releases: I installed kernel-related packages from the install ISO image.

Mount DVD iso image:

mount /root/Rocky-9.4-x86_64-dvd.iso /mnt/ -o loop

Create repository file /etc/yum.repos.d/iso.repo

[baseos-iso]
name=Rocky Linux $releasever - BaseOS ISO
baseurl=file:///mnt/BaseOS/
gpgcheck=0
enabled=0

[appstream-iso]
name=Rocky Linux $releasever - AppStream ISO
baseurl=file:///mnt/AppStream/
gpgcheck=0
enabled=0
dnf -y --disablerepo=* --enablerepo=baseos-iso,appstream-iso install kernel-devel
dnf -y --disablerepo=* --enablerepo=baseos-iso,appstream-iso install kernel-abi-stablelists kernel-rpm-macros kernel-core kernel-modules kernel-modules-core kernel-modules-extra kernel-devel-matched

Install development tools:

dnf -y groupinstall 'Development Tools'
dnf -y install epel-release
dnf -y --disablerepo=* --enablerepo=baseos-iso,appstream-iso,epel install dkms
dnf -y install pciutils-devel

Disable firewall

hostnamectl set-hostname alpha
systemctl disable firewalld
systemctl stop firewalld

Disable SELINUX on Rocky 9.x:

grubby --update-kernel ALL --args selinux=0

Or/and edit file /etc/selinux/config:

# grep '^SELINUX=' /etc/selinux/config
SELINUX=disabled

Install InfiniBand drivers

curl -LJO https://www.mellanox.com/downloads/DOCA/DOCA_v2.8.0/host/doca-host-2.8.0-204000_24.07_rhel94.x86_64.rpm
rpm -i doca-host-2.8.0-204000_24.07_rhel94.x86_64.rpm
dnf clean all
dnf -y --enablerepo=crb install doca-all

Install CUDA

curl -LJO https://developer.download.nvidia.com/compute/cuda/12.6.2/local_installers/cuda-repo-rhel9-12-6-local-12.6.2_560.35.03-1.x86_64.rpm
rpm -i cuda-repo-rhel9-12-6-local-12.6.2_560.35.03-1.x86_64.rpm
dnf clean all
dnf -y install cuda
dnf install nvidia-gds
dnf -y install nvidia-persistenced
dnf -y install nvidia-fabric-manager
systemctl enable nvidia-persistenced
systemctl enable nvidia-fabricmanager

Reboot:

reboot

Load kernel module nvidia_peermem

modprobe nvidia_peermem

Make kernel module nvidia_peermem to load on boot:

echo "nvidia_peermem" > /etc/modules-load.d/nvidia_peermem.conf

Check:

# lsmod | grep nvidia_peermem
nvidia_peermem         24576  0
nvidia               9760768  97 nvidia_uvm,nvidia_peermem,nvidia_fs,nvidia_modeset
ib_uverbs             217088  3 nvidia_peermem,rdma_ucm,mlx5_ib

Build GDRCopy. A low-latency GPU memory copy library based on NVIDIA GPUDirect RDMA technology.

Build gdrcopy and install RPMs

curl -LJO https://github.com/NVIDIA/gdrcopy/archive/refs/tags/v2.4.2.tar.gz
tar xf gdrcopy-2.4.2.tar.gz 
cd gdrcopy-2.4.2/packages/
CUDA=/usr/local/cuda-12.6 ./build-rpm-packages.sh 
dnf -y install ucx-gdrcopy ucx-cuda gdrcopy-2.4.2-1.el9.x86_64.rpm gdrcopy-devel-2.4.2-1.el9.noarch.rpm gdrcopy-kmod-2.4.2-1dkms.el9.noarch.rpm

Check that kernel module is loaded:

# lsmod | grep gdrdrv
gdrdrv                 36864  0
nvidia               9760768  98 nvidia_uvm,nvidia_peermem,nvidia_fs,gdrdrv,nvidia_modeset

Check that gdr_copy will be used by UCX, output example:

ucx_info -d | grep gdr
# Memory domain: gdr_copy
#     Component: gdr_copy
#      Transport: gdr_copy

Check that IOMMU is disabled.

/usr/local/cuda/gds/tools/gdscheck -p

Check gdrcopy_sanity:

# gdrcopy_sanity 
Total: 28, Passed: 26, Failed: 0, Waived: 2

List of waived tests:
    invalidation_access_after_free_cumemalloc
    invalidation_access_after_free_vmmalloc

Set IPv4 address for the first InfiniBand card:

nmtui

Show system topology:

# nvidia-smi topo -m
    GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    NIC9    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X  NV18    NV18    NV18    NV18    NV18    NV18    NV18    PIX PIX NODE    NODE    NODE    SYS SYS SYS SYS SYS 0,2,4,6,8,10    0 N/A
GPU1    NV18     X  NV18    NV18    NV18    NV18    NV18    NV18    NODE    NODE    PIX NODE    NODE    SYS SYS SYS SYS SYS 0,2,4,6,8,10    0 N/A
GPU2    NV18    NV18     X  NV18    NV18    NV18    NV18    NV18    NODE    NODE    NODE    PIX NODE    SYS SYS SYS SYS SYS 0,2,4,6,8,10    0 N/A
GPU3    NV18    NV18    NV18     X  NV18    NV18    NV18    NV18    NODE    NODE    NODE    NODE    PIX SYS SYS SYS SYS SYS 0,2,4,6,8,10    0 N/A
GPU4    NV18    NV18    NV18    NV18     X  NV18    NV18    NV18    SYS SYS SYS SYS SYS PIX PIX NODE    NODE    NODE    1,3,5,7,9,11    1 N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X  NV18    NV18    SYS SYS SYS SYS SYS NODE    NODE    PIX NODE    NODE    1,3,5,7,9,11    1 N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X  NV18    SYS SYS SYS SYS SYS NODE    NODE    NODE    PIX NODE    1,3,5,7,9,11    1 N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X  SYS SYS SYS SYS SYS NODE    NODE    NODE    NODE    PIX 1,3,5,7,9,11    1 N/A
NIC0    PIX NODE    NODE    NODE    SYS SYS SYS SYS  X  PIX NODE    NODE    NODE    SYS SYS SYS SYS SYS 
NIC1    PIX NODE    NODE    NODE    SYS SYS SYS SYS PIX  X  NODE    NODE    NODE    SYS SYS SYS SYS SYS 
NIC2    NODE    PIX NODE    NODE    SYS SYS SYS SYS NODE    NODE     X  NODE    NODE    SYS SYS SYS SYS SYS 
NIC3    NODE    NODE    PIX NODE    SYS SYS SYS SYS NODE    NODE    NODE     X  NODE    SYS SYS SYS SYS SYS 
NIC4    NODE    NODE    NODE    PIX SYS SYS SYS SYS NODE    NODE    NODE    NODE     X  SYS SYS SYS SYS SYS 
NIC5    SYS SYS SYS SYS PIX NODE    NODE    NODE    SYS SYS SYS SYS SYS  X  PIX NODE    NODE    NODE 
NIC6    SYS SYS SYS SYS PIX NODE    NODE    NODE    SYS SYS SYS SYS SYS PIX  X  NODE    NODE    NODE 
NIC7    SYS SYS SYS SYS NODE    PIX NODE    NODE    SYS SYS SYS SYS SYS NODE    NODE     X  NODE    NODE 
NIC8    SYS SYS SYS SYS NODE    NODE    PIX NODE    SYS SYS SYS SYS SYS NODE    NODE    NODE     X  NODE 
NIC9    SYS SYS SYS SYS NODE    NODE    NODE    PIX SYS SYS SYS SYS SYS NODE    NODE    NODE    NODE     X  

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8
  NIC9: mlx5_9

The topology is important in benchmarks to select the fastest GPUs and NICs combination.

Now we can benchmark all combination GPUs and NICs between 2 servers.

2. Benchmark perftest

Compile perftest with CUDA

curl -LJO https://github.com/linux-rdma/perftest/archive/refs/tags/24.07.0-0.44.tar.gz
tar xf perftest-24.07.0-0.44.tar.gz
cd perftest-24.07.0-0.44
./autogen.sh 
./configure CUDA_H_PATH=/usr/local/cuda/include/cuda.h
make
cd ..

Benchmark bandwidth between 2 nodes with the first NIC and first GPU. Benchmark other GPUs and NICs combinations.

Server apha:

./perftest-24.07.0-0.44/ib_write_bw -d mlx5_0 --use_cuda=0 -a --report_gbits
 . . .
 #bytes     #iterations    BW peak[Gb/sec]    BW average[Gb/sec]   MsgRate[Mpps]
 8388608    5000             98.76              98.76        0.001472
. . .

Client delta:

./perftest-24.07.0-0.44/ib_write_bw -d mlx5_0 --use_cuda=0 -a 10.3.1.5 --report_gbits

3. Benchmark nccl-tests

This benchmark will need OpenMPI compiled with UCX. UCX should be compiled with CUDA and GDRCopy. Mellanox InfiniBand driver comes with UCX and it was already installed.

Check with these commands:

ucx_info -v | grep cuda
ucx_info -v | grep gdrcopy

Build OpenMPI

curl -O https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.5.tar.gz
tar xf openmpi-5.0.5.tar.gz
cd openmpi-5.0.5
./configure --prefix=/root/openmpi505 --with-cuda=/usr/local/cuda/ --with-ucx
make -j
make install
cd ..
export PATH=/root/openmpi505/bin:$PATH
export LD_LIBRARY_PATH=/root/openmpi505/lib:$LD_LIBRARY_PATH

Build NVIDIA NCCL library:

git clone https://github.com/NVIDIA/nccl.git
cd nccl
make -j src.build
cd ..
export LD_LIBRARY_PATH=/root/nccl/build/lib/:$LD_LIBRARY_PATH

(Optional) Command to build RPMs: make pkg.redhat.build

Build nccl-tests:

curl -LJO https://github.com/NVIDIA/nccl-tests/archive/refs/tags/v2.13.11.tar.gz
tar xf nccl-tests-2.13.11.tar.gz 
cd nccl-tests-2.13.11
make MPI=1 MPI_HOME=/root/openmpi505 CUDA_HOME=/usr/local/cuda/ NCCL_HOME=/root/nccl/build/
cd ..

Set environment vatiables on both servers. Add these lines to /root/.bashrc file:

export PATH=/root/openmpi505/bin:$PATH
export LD_LIBRARY_PATH=/root/openmpi505/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/root/nccl/build/lib/:$LD_LIBRARY_PATH

Log in and log out to reload BASH environment.

Add both servers in to /etc/hosts file: 

...
10.3.1.5 alpha
10.3.1.6 delta

Generate ssh key and copy public key from alpha server:

ssh-keygen
ssh-copy-id delta

Start on 2 server with 16 GPUs:

mpirun --allow-run-as-root -n 16 --host alpha:8,delta:8 ./nccl-tests-2.13.11/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1

Result:

...
# Avg bus bandwidth    : 73.3687 

4. Benchmark NVIDIA HPCG

This benchmark will use the same /root/.bashrc and OpenMPI as in nccl-tests benchmark.

Download source code:

git clone https://github.com/NVIDIA/nvidia-hpcg
cd nvidia-hpcg/

Disable GRACE:

# grep "^export USE_GRACE=" ./build_sample.sh
export USE_GRACE=0

Build:

NCCL_PATH=/root/nccl/build/ CUDA_PATH=/usr/local/cuda/ MPI_PATH=/root/openmpi505/ USE_CUDA=1 USE_GRACE=0 ./build_sample.sh
cd ..

Start on 2 servers with 16 GPUs:

mpirun --allow-run-as-root -n 16 --host alpha:8,delta:8 ./nvidia-hpcg/bin/xhpcg 256 256 256 1900

Result:

Final Summary::HPCG result is VALID with a GFLOP/s rating of=7630.44

5. Benchmark PyTorch ResNet50

ResNet50 is an image classification model. The benchmark number is the training speed of ResNet50 on the ImageNet dataset. Training speed is measured in images/second, not counting data loading.

ImageNet dataset download here: https://image-net.org/challenges/LSVRC/2012/2012-downloads.php You need to create an account and it will take a long time to download (maybe 24 hours). These 2 files:

ILSVRC2012_img_train.tar (138G) ILSVRC2012_img_val.tar (6.3G)

mkdir /localscratch/
mkdir /localscratch/results
mkdir /localscratch/train
cd /localscratch/train
tar xf /root/ILSVRC2012_img_train.tar
find . -name "*.tar" | while read NAME ; do mkdir -p "${NAME%.tar}"; tar -xvf "${NAME}" -C "${NAME%.tar}"; rm -f "${NAME}"; done
mkdir /localscratch/val
cd /localscratch/val
tar xf /root/ILSVRC2012_img_val.tar
curl -O https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh
bash ./valprep.sh
cd /root
dnf install pip
pip install --user nvidia-dali-cuda120
pip install --user git+https://github.com/NVIDIA/dllogger#egg=dllogger
pip install --user pynvml
pip install --user torch torchvision torchaudio
git clone https://github.com/NVIDIA/DALI.git

Start on 2 servers with 16 GPUs.

Server alpha command:

torchrun --master_addr=10.3.1.5 --master_port=1234 --nproc_per_node=8 --nnodes=2 --node_rank=0 /root/DALI/docs/examples/use_cases/pytorch/efficientnet/main.py --amp --static-loss-scale 128 --batch-size 128 --epochs 1 --prof 1000 --no-checkpoints --training-only --data-backend synthetic --workspace /localscratch/results --report-file bench_report_synthetic.json /localscratch

Server delta command:

torchrun --master_addr=10.3.1.5 --master_port=1234 --nproc_per_node=8 --nnodes=2 --node_rank=1 /root/DALI/docs/examples/use_cases/pytorch/efficientnet/main.py --amp --static-loss-scale 128 --batch-size 128 --epochs 1 --prof 1000 --no-checkpoints --training-only --data-backend synthetic --workspace /localscratch/results --report-file bench_report_synthetic.json /localscratch

Result:

...
Experiment ended
DLL ..... - Summary: ...... train.compute_ips : 66772.19 images/s ......
....

6. Benchmark OSU

Download and build:

export PATH=/usr/local/cuda/bin:$PATH
curl -O https://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-7.5.tar.gz
tar xf osu-micro-benchmarks-7.5.tar.gz 
cd osu-micro-benchmarks-7.5
./configure CC=`which mpicc` CXX=`which mpicxx` --enable-cuda --with-cuda-include=/usr/local/cuda/include --with-cuda-libpath=/usr/local/cuda/lib64
make

I'm running different GPUs and NICs combinations according to 'nvidia-smi topo -m' output:

Single direction, GPU 0, NIC 0 (PIX):

# mpirun --allow-run-as-root -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES=0 -x UCX_NET_DEVICES=mlx5_0:1 -H alpha:1,delta:1 ./osu-micro-benchmarks-7.5/c/mpi/pt2pt/standard/osu_bw D D
...
# Size      Bandwidth (MB/s)
...
4194304             12339.46

Single direction, One GPU, Two NICs 0,1 (PIX, PIX):

# mpirun --allow-run-as-root -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES=0 -x UCX_NET_DEVICES=mlx5_0:1,mlx5_1:1 -H alpha:1,delta:1 ./osu-micro-benchmarks-7.5/c/mpi/pt2pt/standard/osu_bw D D
...
# Size      Bandwidth (MB/s)
...
4194304             24667.45

Single direction, One GPU, Two NICs 3,4 (NODE, NODE):

# mpirun --allow-run-as-root -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES=0 -x UCX_NET_DEVICES=mlx5_3:1,mlx5_4:1 -H alpha:1,delta:1 ./osu-micro-benchmarks-7.5/c/mpi/pt2pt/standard/osu_bw D D
...
# Size      Bandwidth (MB/s)
...
4194304             24592.39

Single direction, One GPU, Two NICs 7,9 (SYS, SYS):

# mpirun --allow-run-as-root -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES=0 -x UCX_NET_DEVICES=mlx5_9:1,mlx5_7:1 -H alpha:1,delta:1 ./osu-micro-benchmarks-7.5/c/mpi/pt2pt/standard/osu_bw D D
...
# Size      Bandwidth (MB/s)
...
4194304             22855.04

Single direction, auto GPU and NICs:

# mpirun --allow-run-as-root -x LD_LIBRARY_PATH -H alpha:1,delta:1 ./osu-micro-benchmarks-7.5/c/mpi/pt2pt/standard/osu_bw D D
...
# Size      Bandwidth (MB/s)
...
4194304             24667.16

Bi-Directional, auto GPU and NICs:

# mpirun --allow-run-as-root -x LD_LIBRARY_PATH -H alpha:1,delta:1 ./osu-micro-benchmarks-7.5/c/mpi/pt2pt/standard/osu_bibw D D

# OSU MPI-CUDA Bi-Directional Bandwidth Test v7.5
# Datatype: MPI_CHAR.
# Size      Bandwidth (MB/s)
1                       3.98
2                       7.98
4                      16.25
8                      32.46
16                     64.65
32                    129.24
64                    237.45
128                   493.33
256                   915.73
512                  1703.28
1024                 2993.79
2048                 5184.55
4096                 7407.15
8192                14501.31
16384               18774.54
32768               37192.35
65536               42738.37
131072              45578.91
262144              47329.40
524288              48296.48
1048576             48789.81
2097152             49016.33
4194304             49155.11

Auto select does a good job. PIX combination is the fastes, then goes NODE, SYS is the slowest.