PROXMOX (Debian 10, KVM) enabling SR-IOV for Mellanox Infiniband cards

Posted by Pavlo Khmel on Mon 10 May 2021

There are 2 differences in configuration on:

  • AMD or Intel based systems.
  • ConnectX-3 or ConnectX-6 Mellanox cards.

Update - I tested Proxmox VE 8.1, it works with some changes:

  • Packages install list:
apt install pve-headers chrpath pkg-config graphviz quilt swig libltdl-dev tk flex automake gfortran bison tcl m4 autotools-dev dkms autoconf debhelper ethtool
  • Mellanox InfiniBand Packages installation
curl -O https://content.mellanox.com/ofed/MLNX_OFED-23.10-0.5.5.0/MLNX_OFED_LINUX-23.10-0.5.5.0-debian12.1-x86_64.iso
mount -o loop,ro MLNX_OFED_LINUX-23.10-0.5.5.0-debian12.1-x86_64.iso  /media/
dpkg -i /media/DEBS/mlnx-tools_23.10.0-1.2310055_amd64.deb
dpkg -i /media/DEBS/mlnx-ofed-kernel-utils_23.10.OFED.23.10.0.5.5.1-1_amd64.deb
dpkg -i /media/DEBS/mlnx-ofed-kernel-dkms_23.10.OFED.23.10.0.5.5.1-1_all.deb
dpkg -i /media/DEBS/mft_4.26.0-93_amd64.deb

This example uses:

  • Proxmox VE 6.4 (iso file proxmox-ve_6.4-1.iso)
  • Mellanox driver v.4.9 (file MLNX_OFED_LINUX-4.9-3.1.5.0-debian10.0-x86_64.iso)

NOTE: I'm not using currently availble Mellanox driver v.5.3 because support for ConnectX-3 was depricated.

Install Proxmox VE 6.4 on server.

In BIOS enable SR-IOV. This option can have differen names: AMD-Vi or VT-d or IOMMU or SR-IOV.

Disable Commercial Repo

sed -i "s/^deb/\#deb/" /etc/apt/sources.list.d/pve-enterprise.list

Add PVE Community Repo and upgrade

echo "deb http://download.proxmox.com/debian/pve $(grep "VERSION=" /etc/os-release | sed -n 's/.*(\(.*\)).*/\1/p') pve-no-subscription" > /etc/apt/sources.list.d/pve-no-enterprise.list
apt update
apt upgrade
reboot

Enable SR-IOV on Linux kernel

Modify file /etc/default/grub

for INTEL

GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt"

for AMD

GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt pcie_acs_override=downstream"

Update GRUB and reboot

update-grub
reboot

Verify dmesg output:

dmesg | grep IOMMU

Install packages needed by Mellanox package:

apt install pve-headers chrpath pkg-config graphviz quilt swig libltdl-dev tk flex automake dpatch gfortran libgfortran4 bison tcl m4 autotools-dev dkms autoconf debhelper ethtool

Download Mellanox driver and mount:

wget https://content.mellanox.com/ofed/MLNX_OFED-4.9-3.1.5.0/MLNX_OFED_LINUX-4.9-3.1.5.0-debian10.0-x86_64.iso
mount -o loop,ro MLNX_OFED_LINUX-4.9-3.1.5.0-debian10.0-x86_64.iso /media/

I'll not install all Mellanox packags in this example. Because it will require to delete Proxmox packages and install them again.

dpkg -i /media/DEBS/COMMON/mlnx-ofed-kernel-utils_4.9-OFED.4.9.3.1.5.1_amd64.deb
dpkg -i /media/DEBS/COMMON/mlnx-ofed-kernel-dkms_4.9-OFED.4.9.3.1.5.1_all.deb
dpkg -i /media/DEBS/COMMON/mft_4.15.1-9_amd64.deb
reboot

Update firmware settings

Enable 8 Virtual Functions for ConnectX-3 on Intel based system:

# lspci | grep -i mell
03:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]

# mlxconfig -d 03:00.0 s SRIOV_EN=1
# mlxconfig -d 03:00.0 s NUM_OF_VFS=8
# mlxconfig -d 03:00.0 q | grep -e SRIOV_EN -e NUM_OF_VFS
         SRIOV_EN                            True(1)        
         NUM_OF_VFS                          8    

Enable 16 Virtual Functions for ConnectX-6 on AMD based system:

# lspci | grep -i mell
81:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]

# mlxconfig -d 81:00.0 s SRIOV_EN=1
# mlxconfig -d 81:00.0 s NUM_OF_VFS=16
# mlxconfig -d 81:00.0 q | grep -e SRIOV_EN -e NUM_OF_VFS
         NUM_OF_VFS                          16              
         SRIOV_EN                            True(1)  

Only for ConnectX-3 - change mellanox driver options.

vi /etc/modprobe.d/mlx4_core.conf
options mlx4_core num_vfs=8 port_type_array=1,2 probe_vf=1

And reboot

Only for ConnectX-6 - change mellanox driver options

echo 16 > /sys/class/infiniband/mlx5_0/device/mlx5_num_vfs

Manually for each Virtual Function create uniq IDs:

echo Follow > /sys/class/infiniband/mlx5_0/device/sriov/0/policy
echo 11:22:33:44:77:66:77:90 > /sys/class/infiniband/mlx5_0/device/sriov/0/node
echo 11:22:33:44:77:66:77:91 > /sys/class/infiniband/mlx5_0/device/sriov/0/port
echo 0000:3b:00.1 > /sys/bus/pci/drivers/mlx5_core/unbind
echo 0000:3b:00.1 > /sys/bus/pci/drivers/mlx5_core/bind

Or automate and run it as last service:

Create file /etc/systemd/system/mlnx_sriov.service

[Unit]
Description=Initialize 16 Virtual Function on mlx5_0
After=openibd.service

[Service]
Type=simple
RemainAfterExit=yes
ExecStartPre=/usr/bin/echo 16 > /sys/class/infiniband/mlx5_0/device/mlx5_num_vfs
ExecStart=/usr/local/sbin/mlnx_sriov.sh mlx5_0 16

[Install]
WantedBy=multi-user.target

Enable service:

systemctl enable mlnx_sriov

Create script file /usr/local/sbin/mlnx_sriov.sh. And make it executable:

chmod u+x /usr/local/sbin/mlnx_sriov.sh

Script from: https://gist.github.com/koallen/32709a244d77a2c0f8e17ed79a4092ed

#!/bin/bash
# params
#   - device name (e.g. mlx5_0)
#   - number of virtual functions (e.g. 10)
configure_dev () {
    local num_of_vfs="$2"
    local devid=$(echo $1 | cut -d_ -f2)
    local max_id="0"
    local num_vfs_path="/sys/class/infiniband/$1/device/mlx5_num_vfs"
    if [[ "$(cat $num_vfs_path)" -lt "$num_of_vfs" ]]; then
        echo $num_of_vfs > /sys/class/infiniband/$1/device/mlx5_num_vfs
    fi
    let "max_id=$num_of_vfs-1"
    for vf in $(seq 0 $max_id); do
        echo ' ' ' ' Configuring virtual function $vf
        # enable the virtual function
        echo Follow > /sys/class/infiniband/$1/device/sriov/$vf/policy
        # assign GUID to virtual card and port
        let "first_part=$vf/100"
        let "second_part=$vf-$first_part*100"
        local ip_last_seg=$(hostname -i | cut -d. -f4)
        let "ip_last_seg_first=$ip_last_seg/100"
        let "ip_last_seg_second=$ip_last_seg-$ip_last_seg_first*100"
        local guid_prefix="$(printf "%02d" $devid):22:33:$(printf "%02d" $first_part):$(printf "%02d" $second_part):$(printf "%02d" $ip_last_seg_first):$(printf "%02d" $ip_last_seg_second)"
        echo "$guid_prefix:90" > /sys/class/infiniband/$1/device/sriov/$vf/node
        echo "$guid_prefix:91" > /sys/class/infiniband/$1/device/sriov/$vf/port
        # reload driver to make the change effective
        pcie_addr="$(readlink -f /sys/class/infiniband/$1/device/virtfn${vf} | awk -F/ '{print $NF}')"
        echo $pcie_addr > /sys/bus/pci/drivers/mlx5_core/unbind
        echo $pcie_addr > /sys/bus/pci/drivers/mlx5_core/bind
    done
}
# if specific devices are provided, only those will be configured
# otherwise, all devices supporting SR-IOV will be configured
if [[ "$#" -eq "0" ]]; then
    echo Configuring SR-IOV for all supported devices
    for dev in $(ls /sys/class/infiniband); do
        totalvfs_path="/sys/class/infiniband/$dev/device/sriov_totalvfs"
        if [[ -e "$totalvfs_path" && "$(cat $totalvfs_path)" -gt "0" ]]; then
            echo ' ' Configuring for $dev $(cat $totalvfs_path)
            #configure_dev $dev $(cat $totalvfs_path)
        fi
    done
elif ! (( $# % 2 )); then
    echo Configuring SR-IOV for specified devices
    while (( "$#" )); do
        dev=$1
        num_of_vfs=$2
        echo ' ' Configuring for $dev
        configure_dev $dev $num_of_vfs
        shift 2
    done
else
    echo Please use the script in the following two ways:
    echo ' ' ./mlnx.sh
    echo ' ' ./mlnx.sh mlx5_0 10 mlx5_1 25
fi

Reboot

Final check:

Verify that 8 or 16 Virtual Functions available after reboot. Example output:

# lspci | grep Mellanox
03:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
03:00.1 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]
03:00.2 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]
03:00.3 Network controller: Mellanox Technologies MT27500/MT27520 Family [ConnectX-3/ConnectX-3 Pro Virtual Function]
. . .
. . .

Check that you got uniq IOMMU Groups for each Virtual Function with this script. Script from: https://github.com/drewmullen/pci-passthrough-ryzen/blob/master/iommu_groups.sh

# cat iommu_groups.sh
#!/bin/bash
shopt -s nullglob
for d in /sys/kernel/iommu_groups/*/devices/*; do
    n=${d#*/iommu_groups/*}; n=${n%%/*}
    printf 'IOMMU Group %s ' "$n"
    lspci -nns "${d##*/}"
done;

Output example:

# bash iommu_groups.sh | grep Mellanox
IOMMU Group 108 81:00.0 Infiniband controller [0207]: Mellanox Technologies MT28908 Family [ConnectX-6] [15b3:101b]
IOMMU Group 131 81:00.1 Infiniband controller [0207]: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function] [15b3:101c]
IOMMU Group 132 81:00.2 Infiniband controller [0207]: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function] [15b3:101c]
. . .
. . .

Now you can add Virtual Function as PCI device to your Virtual Machines.