- Setup servers
- Benchmark perftest
- Benchmark nccl-tests
- Benchmark NVIDIA HPCG
- Benchmark PyTorch ResNet50
GPUDirect RDMA (Remote Direct Memory Access) allows for the direct transfer of data between the GPUs, GPUs and other devices (network cards), eliminating the need for the CPU to handle data transfers.
Test setup: * Hardware: 2 servers Dell PowerEdge XE9680 ( 8 x Nvidia H100, 10 x Mellanox InfiniBand ConnectX-6 cards ) * OS: Rocky Linux 9.4 minimal * Hostnames(IP): alpha (10.3.1.5), delta (10.3.1.6)
NOTE 1: I'll run these benchmarks without shared filessystem. In this example all software is compiled on both servers.
NOTE 2: Benchmarks will be restarted twice:
Without GPUDirect RDMA:
rmmod nvidia_peermem; rmmod gdrdrv
With GPUDirect RDMA
modprobe nvidia_peermem; modprobe gdrdrv
1. Setup servers
Disable IOMMU. This option can have differen names i BIOS: AMD-Vi or VT-d or IOMMU or SR-IOV.
Disable ACS. This will be different on different serveres, vendors. Check server manual.
I followed this Dell manual: https://www.dell.com/support/manuals/en-us/poweredge-xe9680/xe9680_ism_pub/processor-settings?guid=guid-71fdb36a-23ad-4720-b453-7347fd93e697&lang=en-us
On Dell XE9680 with Xeon Platinum 8470: Disabling Virtualization Technology will disable ACS in BIOS. But ACS still will be visible as enabled in OS.
Check command:
# lspci -vvv | grep ACSCtl
ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid+ TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
. . .
ACS enabled if there are lines with plus sign "SrcValid+"
Create script file /usr/local/sbin/acs-disable to disable ACS:
#!/bin/bash
for BDF in `lspci -d "*:*:*" | awk '{print $1}'`; do
# skip if it doesn't support ACS
sudo setpci -v -s ${BDF} ECAP_ACS+0x6.w > /dev/null 2>&1
if [ $? -ne 0 ]; then
continue
fi
sudo setpci -v -s ${BDF} ECAP_ACS+0x6.w=0000
done
Make executable:
chmod 700 /usr/local/sbin/acs-disable
Disable ACS on boot. Create service file /etc/systemd/system/acs-disable.service
[Unit]
Description=ACS disable
After=default.target
[Service]
Type=oneshot
ExecStart=/usr/local/sbin/acs-disable
[Install]
WantedBy=default.target
Reload, enable start:
systemctl daemon-reload
systemctl enable acs-disable
systemctl start acs-disable
Check again:
# lspci -vvv | grep ACSCtl
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
. . .
To avoid issue with "not matching" drivers and Linux kernal releases: I installing kernel related packages from the install disk ISO image.
Mount DVD iso image:
mount /root/Rocky-9.4-x86_64-dvd.iso /mnt/ -o loop
Create repository file /etc/yum.repos.d/iso.repo
[baseos-iso]
name=Rocky Linux $releasever - BaseOS ISO
baseurl=file:///mnt/BaseOS/
gpgcheck=0
enabled=0
[appstream-iso]
name=Rocky Linux $releasever - AppStream ISO
baseurl=file:///mnt/AppStream/
gpgcheck=0
enabled=0
dnf -y --disablerepo=* --enablerepo=baseos-iso,appstream-iso install kernel-devel
dnf -y --disablerepo=* --enablerepo=baseos-iso,appstream-iso install kernel-abi-stablelists kernel-rpm-macros kernel-core kernel-modules kernel-modules-core kernel-modules-extra kernel-devel-matched
Install development tools:
dnf -y groupinstall 'Development Tools'
dnf -y install epel-release
dnf -y --disablerepo=* --enablerepo=baseos-iso,appstream-iso,epel install dkms
dnf -y install pciutils-devel
Disable firewall
hostnamectl set-hostname alpha
systemctl disable firewalld
systemctl stop firewalld
Disable SELINUX on Rocky 9.x:
grubby --update-kernel ALL --args selinux=0
Or/and edit file /etc/selinux/config:
# grep '^SELINUX=' /etc/selinux/config
SELINUX=disabled
Install InfiniBand drivers
curl -LJO https://www.mellanox.com/downloads/DOCA/DOCA_v2.8.0/host/doca-host-2.8.0-204000_24.07_rhel94.x86_64.rpm
rpm -i doca-host-2.8.0-204000_24.07_rhel94.x86_64.rpm
dnf clean all
dnf -y --enablerepo=crb install doca-all
Install CUDA
curl -LJO https://developer.download.nvidia.com/compute/cuda/12.6.2/local_installers/cuda-repo-rhel9-12-6-local-12.6.2_560.35.03-1.x86_64.rpm
rpm -i cuda-repo-rhel9-12-6-local-12.6.2_560.35.03-1.x86_64.rpm
dnf clean all
dnf -y install cuda
dnf install nvidia-gds
dnf -y install nvidia-persistenced
dnf -y install nvidia-fabric-manager
systemctl enable nvidia-persistenced
systemctl enable nvidia-fabricmanager
Reboot:
reboot
Load kernal module nvidia_peermem
modprobe nvidia_peermem
Make kernel module nvidia_peermem to load on boot:
echo "nvidia_peermem" > /etc/modules-load.d/nvidia_peermem.conf
Check:
# lsmod | grep nvidia_peermem
nvidia_peermem 24576 0
nvidia 9760768 97 nvidia_uvm,nvidia_peermem,nvidia_fs,nvidia_modeset
ib_uverbs 217088 3 nvidia_peermem,rdma_ucm,mlx5_ib
Build GDRCopy. A low-latency GPU memory copy library based on NVIDIA GPUDirect RDMA technology.
Build gdrcopy and install RPMs
curl -LJO https://github.com/NVIDIA/gdrcopy/archive/refs/tags/v2.4.2.tar.gz
tar xf gdrcopy-2.4.2.tar.gz
cd gdrcopy-2.4.2/packages/
CUDA=/usr/local/cuda-12.6 ./build-rpm-packages.sh
dnf -y install ucx-gdrcopy ucx-cuda gdrcopy-2.4.2-1.el9.x86_64.rpm gdrcopy-devel-2.4.2-1.el9.noarch.rpm gdrcopy-kmod-2.4.2-1dkms.el9.noarch.rpm
Check that kernel module is loaded:
# lsmod | grep gdrdrv
gdrdrv 36864 0
nvidia 9760768 98 nvidia_uvm,nvidia_peermem,nvidia_fs,gdrdrv,nvidia_modeset
Check that gdr_copy will be used by UCX, output example:
ucx_info -d | grep gdr
# Memory domain: gdr_copy
# Component: gdr_copy
# Transport: gdr_copy
Check that IOMMU are disabled.
/usr/local/cuda/gds/tools/gdscheck -p
Check gdrcopy_sanity:
# gdrcopy_sanity
Total: 28, Passed: 26, Failed: 0, Waived: 2
List of waived tests:
invalidation_access_after_free_cumemalloc
invalidation_access_after_free_vmmalloc
Set IPv4 address for the first InfiniBand card:
nmtui
Show system topology:
nvidia-smi topo -m
Now we can benchmark all combination GPUs and NICs between 2 servers.
2. Benchmark perftest
Compile perftest with CUDA
curl -LJO https://github.com/linux-rdma/perftest/archive/refs/tags/24.07.0-0.44.tar.gz
tar xf perftest-24.07.0-0.44.tar.gz
cd perftest-24.07.0-0.44
./autogen.sh
./configure CUDA_H_PATH=/usr/local/cuda/include/cuda.h
make
cd ..
Benchmark bandwith between 2 nodes with first NIC and first GPU. Benchmark other GPUs and NICs compinations.
Server apha:
./perftest-24.07.0-0.44/ib_write_bw -d mlx5_0 --use_cuda=0 -a --report_gbits
. . .
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
8388608 5000 98.76 98.76 0.001472
. . .
Client delta:
./perftest-24.07.0-0.44/ib_write_bw -d mlx5_0 --use_cuda=0 -a 10.3.1.5 --report_gbits
3. Benchmark nccl-tests
This benchmark will need OpenMPI compiled with UCX. UCX should be compiled with CUDA and GDRCopy. Mellanox InfiniBand driver comes with UCX and it was already installed.
Check with these commands:
ucx_info -v | grep cuda
ucx_info -v | grep gdrcopy
Build OpenMPI
curl -O https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.5.tar.gz
tar xf openmpi-5.0.5.tar.gz
cd openmpi-5.0.5
./configure --prefix=/root/openmpi505 --with-cuda=/usr/local/cuda/ --with-ucx
make -j
make install
cd ..
export PATH=/root/openmpi505/bin:$PATH
export LD_LIBRARY_PATH=/root/openmpi505/lib:$LD_LIBRARY_PATH
Build NVIDIA NCCL library
git clone https://github.com/NVIDIA/nccl.git
cd nccl
make -j src.build
cd ..
export LD_LIBRARY_PATH=/root/nccl/build/lib/:$LD_LIBRARY_PATH
(Optional) Command to build RPMs: make pkg.redhat.build
Build nccl-tests:
curl -LJO https://github.com/NVIDIA/nccl-tests/archive/refs/tags/v2.13.11.tar.gz
tar xf nccl-tests-2.13.11.tar.gz
cd nccl-tests-2.13.11
make MPI=1 MPI_HOME=/root/openmpi505 CUDA_HOME=/usr/local/cuda/ NCCL_HOME=/root/nccl/build/
cd ..
Set environment vatiables on both servers. Add these lines to /root/.bashrc file:
export PATH=/root/openmpi505/bin:$PATH
export LD_LIBRARY_PATH=/root/openmpi505/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/root/nccl/build/lib/:$LD_LIBRARY_PATH
Log in and log out to reload BASH environment.
Add both servers in to /etc/hosts file:
...
10.3.1.5 alpha
10.3.1.6 delta
Generate ssh key and copy public key from alpha server:
ssh-keygen
ssh-copy-id delta
Start on 2 server with 16 GPUs:
mpirun --allow-run-as-root -n 16 --host alpha:8,delta:8 ./nccl-tests-2.13.11/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1
Result:
...
# Avg bus bandwidth : 73.3687
4. Benchmark NVIDIA HPCG
This benchmark will use the same /root/.bashrc and OpenMPI as in nccl-tests benchmark.
Download source code:
git clone https://github.com/NVIDIA/nvidia-hpcg
cd nvidia-hpcg/
Disable GRACE:
# grep "^export USE_GRACE=" ./build_sample.sh
export USE_GRACE=0
Build:
NCCL_PATH=/root/nccl/build/ CUDA_PATH=/usr/local/cuda/ MPI_PATH=/root/openmpi505/ USE_CUDA=1 USE_GRACE=0 ./build_sample.sh
cd ..
Start on 2 servers with 16 GPUs:
mpirun --allow-run-as-root -n 16 --host alpha:8,delta:8 ./nvidia-hpcg/bin/xhpcg 256 256 256 1900
Result:
Final Summary::HPCG result is VALID with a GFLOP/s rating of=7630.44
5. Benchmark PyTorch ResNet50
ResNet50 is an image classification model. The benchmark number is the training speed of ResNet50 on the ImageNet dataset. Training speed is measured in images/second, not counting data loading.
ImageNet dataset download here: https://image-net.org/challenges/LSVRC/2012/2012-downloads.php You need to create an account and it will take a long time to download (maybe 24 hours). These 2 files:
ILSVRC2012_img_train.tar (138G) ILSVRC2012_img_val.tar (6.3G)
mkdir /localscratch/
mkdir /localscratch/results
mkdir /localscratch/train
cd /localscratch/train
tar xf /root/ILSVRC2012_img_train.tar
find . -name "*.tar" | while read NAME ; do mkdir -p "${NAME%.tar}"; tar -xvf "${NAME}" -C "${NAME%.tar}"; rm -f "${NAME}"; done
mkdir /localscratch/val
cd /localscratch/val
tar xf /root/ILSVRC2012_img_val.tar
curl -O https://raw.githubusercontent.com/soumith/imagenetloader.torch/master/valprep.sh
bash ./valprep.sh
cd /root
dnf install pip
pip install --user nvidia-dali-cuda120
pip install --user git+https://github.com/NVIDIA/dllogger#egg=dllogger
pip install --user pynvml
pip install --user torch torchvision torchaudio
git clone https://github.com/NVIDIA/DALI.git
Start on 2 servers with 16GPUs.
Server alpha command:
torchrun --master_addr=10.3.1.5 --master_port=1234 --nproc_per_node=8 --nnodes=2 --node_rank=0 /root/DALI/docs/examples/use_cases/pytorch/efficientnet/main.py --amp --static-loss-scale 128 --batch-size 128 --epochs 1 --prof 1000 --no-checkpoints --training-only --data-backend synthetic --workspace /localscratch/results --report-file bench_report_synthetic.json /localscratch
Server delta command:
torchrun --master_addr=10.3.1.5 --master_port=1234 --nproc_per_node=8 --nnodes=2 --node_rank=1 /root/DALI/docs/examples/use_cases/pytorch/efficientnet/main.py --amp --static-loss-scale 128 --batch-size 128 --epochs 1 --prof 1000 --no-checkpoints --training-only --data-backend synthetic --workspace /localscratch/results --report-file bench_report_synthetic.json /localscratch
Result:
...
Experiment ended
DLL ..... - Summary: ...... train.compute_ips : 66772.19 images/s ......
....