Skip to content

VF ibp192s0v10 GUID is not valid #1825

@samirdasiitr

Description

@samirdasiitr

What happened:
pods stuck with infiniband resource in creating container state

  Warning  FailedCreatePodSandBox           16s                kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "de0d2174
b993dc3373f1d255cb1870691f93df3bc005aacee1646c56524cbed3": plugin type="multus" name="multus-cni-network" failed (add): [test-nccl/nccltest-worker-1/cc41464d-185f-48ff-9f7a-48e88a380320:ibp1
92s0]: error adding container to network "ibp192s0": infiniBand SRI-OV CNI failed to configure VF "VF ibp192s0v10 GUID is not valid" 

What you expected to happen:
pods to run

How to reproduce it (as minimally and precisely as possible):
We are using BCM-10 and DGX OS on H200 nodes, when the GUID of IB VF are "00:00:00:00:00:00:00:00" then sriov-network-config-daemon do not configure new guid fot these uninitialized VF.

Anything else we need to know?:

Logs:

  • NicClusterPolicy CR spec and state:
root@bcm10-headnode:~# kubectl get nicclusterpolicies.mellanox.com nic-cluster-policy -o yaml
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy   
metadata:        
  annotations:                 
  name: nic-cluster-policy
spec:                      
  nvIpam:       
    enableWebhook: false             
    image: nvidia-k8s-ipam
    imagePullSecrets: []              
    repository: artifactory.coupang.net/ghcr-remote/mellanox
    version: v0.3.0
  secondaryNetwork:    
    cniPlugins:
      image: plugins
      imagePullSecrets: []
      repository: ghcr.io/k8snetworkplumbingwg
      version: v1.5.0
    multus:
      image: multus-cni
      imagePullSecrets: []
      repository: ghcr.io/k8snetworkplumbingwg
      version: v4.1.0
status:```


During our debug we found that "consts.UninitializedNodeGUID" is "0000:0000:0000:0000" but GUID read from interfaces is "00:00:00:00:00:00:00:00", to prove this we add a log line in the code at api/v1/helper.go
which printed `@@samir --------------- vfStatus.GUID: 00:00:00:00:00:00:00:00 consts.UninitializedNodeGUID: 0000:0000:0000:0000                                                                              
` because of which new GUIDs werre not assigned to VF and pod were stuck in creating state with below error.

Warning FailedCreatePodSandBox 16s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "de0d2174
b993dc3373f1d255cb1870691f93df3bc005aacee1646c56524cbed3": plugin type="multus" name="multus-cni-network" failed (add): [test-nccl/nccltest-worker-1/cc41464d-185f-48ff-9f7a-48e88a380320:ibp1
92s0]: error adding container to network "ibp192s0": infiniBand SRI-OV CNI failed to configure VF "VF ibp192s0v10 GUID is not valid"


Network-operator version: 25.1.0 (but this code is same in latest version too)
Kubernetes: 1.32.5

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions