-
Notifications
You must be signed in to change notification settings - Fork 51
Description
I am trying to deploy Kubernetes Plugin for RoCE NIC. For this, I have deployed Kubernetes using Kubespray Playbook and post deployment all pods are up and running.
Post deployment, I am trying to deploy the RoCE Plugin as part of which, k8s-rdma-shared-dev-plugin is being installed. However, after deployment, the rdma pods are stuck in a CrashLoopBackOff state and the roce pods are in pending state as shown in below screenshot:
I ran kubectl logs <podman> -n <namespace command to check the logs and found the below error:
I found a workaround for this and followed the below steps:
- Edit the rdma daemonset by add a volume mount for pci-id file
Add the following mountPath under the volumeMounts section:
- name: pci-ids
mountPath: /usr/share/misc/pci.ids
readOnly: true
Add the following under volumes section:
- name: pci-ids
hostPath:
path: /usr/share/misc/pci.ids
type: File
Post this, the CrashLoop issue was resolved and all pods including rdma and roce, came up as running.
Is this a known issue and any fix which is available for the same?
If a fix is not present currently, is the above workaround fine?
Environment Details:
Kubespray: v2.27.0
Kubernetes : v1.31.4
rdma plugin: https://github.com/Mellanox/k8s-rdma-shared-dev-plugin.git (v1.5.2)


