Server "server-a" PowerEdge XE9680 is hanging on boot with the message:
"UEFI0031: PCIe downtrain is detected on PCIe Device Slot 36, (Bus:0x5B Dev:0x02 F:0x00). Expected link width: x16 and actual link width : x4"
It is possible to press F1 to continue booting.
This issue will be solved with Dell support. But for now, there is an issue with GPUDirect RDMA. Server has 8 x GPUs and 10 x InfiniBand cards.
nccl-test benchmark build guide: https://pavlokhmel.com/enable-gpudirect-rdma-and-benchmark-with-perftest-nccl-test-nvidia-hpcg-pytorch-resnet50-osu.html
nccl-test gets slow and different performance results (GPUDirect enabled):
mpirun -n 16 --host server-a:8,server-b:8 ./nccl-tests-2.16.7/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1
# Avg bus bandwidth : 4.22649
# Avg bus bandwidth : 8.40644
# Avg bus bandwidth : 29.2344
Find Bus Address for "Slot 36":
# dmidecode -t slot | grep -B 2 -A 12 "Slot 36"
Handle 0x0904, DMI type 9, 24 bytes
System Slot Information
Designation: PCIe Slot 36
Type: PCI Express 5
Data Bus Width: 16x or x16
Current Usage: In Use
Length: Short
ID: 36
Characteristics:
3.3 V is provided
PME signal is supported
Bus Address: 0000:5e:00.0
Data Bus Width (Base): 0
Peer Devices: 0
Height: Not applicable
Find device with bus address 0000:5e:00.0
# lspci -D | grep '0000:5e:00.0'
0000:5e:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
Find InfiniBand card Linux name:
# grep '0000:5e:00.0' /sys/class/net/*/device/uevent
/sys/class/net/ibp94s0/device/uevent:PCI_SLOT_NAME=0000:5e:00.0
Find InfiniBand GUID (a MAC equivalent) of ibp94s0:
# ip link show ibp94s0
10: ibp94s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 2044 qdisc mq state UP mode DEFAULT group default qlen 1000
link/infiniband 00:00:10:27:fe:80:00:00:00:00:00:00:a0:88:c2:03:00:ae:21:58 brd 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
[root@idun-01-03 ~]#
Find IndiniBand device name with the same GUID:
# ibstatus | grep -B 1 -A 6 2158
Infiniband device 'mlx5_4' port 1 status:
default gid: fe80:0000:0000:0000:a088:c203:00ae:2158
base lid: 0x152
sm lid: 0x3
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 100 Gb/sec (2X HDR)
link_layer: InfiniBand
Find mlx5_4 base LID and port:
# ibstat mlx5_4
CA 'mlx5_4'
CA type: MT4123
Number of ports: 1
Firmware version: 20.43.1014
Hardware version: 0
Node GUID: 0xa088c20300ae2158
System image GUID: 0xa088c20300ae2158
Port 1:
State: Active
Physical state: LinkUp
Rate: 100
Base lid: 338
LMC: 0
SM lid: 3
Capability mask: 0xa651e848
Port GUID: 0xa088c20300ae2158
Link layer: InfiniBand
Disable mlx5_4:
ibportstate 338 1 disable
Start nccl-test benchmark again:
$ mpirun -n 16 --host server-a:8,server-b:8 ./nccl-tests-2.16.7/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1
. . .
# Avg bus bandwidth : 62.4008
Performance improved.
Use this command to re-enable mlx5_4 again:
mlxfwreset -d mlx5_4 -l 3 reset