Skip to content

Post deployment of k8s-rdma-shared-dev-plugin (v1.5.2), rdma pods are in CrashLoopBackOff state #151

@VrindaMarwah

Description

@VrindaMarwah

I am trying to deploy Kubernetes Plugin for RoCE NIC. For this, I have deployed Kubernetes using Kubespray Playbook and post deployment all pods are up and running.

Image

Post deployment, I am trying to deploy the RoCE Plugin as part of which, k8s-rdma-shared-dev-plugin is being installed. However, after deployment, the rdma pods are stuck in a CrashLoopBackOff state and the roce pods are in pending state as shown in below screenshot:

Image

I ran kubectl logs <podman> -n <namespace command to check the logs and found the below error:

Image

I found a workaround for this and followed the below steps:

  1. Edit the rdma daemonset by add a volume mount for pci-id file

Add the following mountPath under the volumeMounts section:

  • name: pci-ids
    mountPath: /usr/share/misc/pci.ids
    readOnly: true

Add the following under volumes section:

  • name: pci-ids
    hostPath:
    path: /usr/share/misc/pci.ids
    type: File

Post this, the CrashLoop issue was resolved and all pods including rdma and roce, came up as running.

Is this a known issue and any fix which is available for the same?
If a fix is not present currently, is the above workaround fine?

Environment Details:

Kubespray: v2.27.0
Kubernetes : v1.31.4
rdma plugin: https://github.com/Mellanox/k8s-rdma-shared-dev-plugin.git (v1.5.2)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions