-
Notifications
You must be signed in to change notification settings - Fork 66
Description
What happened:
pods stuck with infiniband resource in creating container state
Warning FailedCreatePodSandBox 16s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "de0d2174
b993dc3373f1d255cb1870691f93df3bc005aacee1646c56524cbed3": plugin type="multus" name="multus-cni-network" failed (add): [test-nccl/nccltest-worker-1/cc41464d-185f-48ff-9f7a-48e88a380320:ibp1
92s0]: error adding container to network "ibp192s0": infiniBand SRI-OV CNI failed to configure VF "VF ibp192s0v10 GUID is not valid"
What you expected to happen:
pods to run
How to reproduce it (as minimally and precisely as possible):
We are using BCM-10 and DGX OS on H200 nodes, when the GUID of IB VF are "00:00:00:00:00:00:00:00" then sriov-network-config-daemon do not configure new guid fot these uninitialized VF.
Anything else we need to know?:
Logs:
- NicClusterPolicy CR spec and state:
root@bcm10-headnode:~# kubectl get nicclusterpolicies.mellanox.com nic-cluster-policy -o yaml
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
annotations:
name: nic-cluster-policy
spec:
nvIpam:
enableWebhook: false
image: nvidia-k8s-ipam
imagePullSecrets: []
repository: artifactory.coupang.net/ghcr-remote/mellanox
version: v0.3.0
secondaryNetwork:
cniPlugins:
image: plugins
imagePullSecrets: []
repository: ghcr.io/k8snetworkplumbingwg
version: v1.5.0
multus:
image: multus-cni
imagePullSecrets: []
repository: ghcr.io/k8snetworkplumbingwg
version: v4.1.0
status:```
During our debug we found that "consts.UninitializedNodeGUID" is "0000:0000:0000:0000" but GUID read from interfaces is "00:00:00:00:00:00:00:00", to prove this we add a log line in the code at api/v1/helper.go
which printed `@@samir --------------- vfStatus.GUID: 00:00:00:00:00:00:00:00 consts.UninitializedNodeGUID: 0000:0000:0000:0000
` because of which new GUIDs werre not assigned to VF and pod were stuck in creating state with below error.
Warning FailedCreatePodSandBox 16s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "de0d2174
b993dc3373f1d255cb1870691f93df3bc005aacee1646c56524cbed3": plugin type="multus" name="multus-cni-network" failed (add): [test-nccl/nccltest-worker-1/cc41464d-185f-48ff-9f7a-48e88a380320:ibp1
92s0]: error adding container to network "ibp192s0": infiniBand SRI-OV CNI failed to configure VF "VF ibp192s0v10 GUID is not valid"
Network-operator version: 25.1.0 (but this code is same in latest version too)
Kubernetes: 1.32.5