Skip to content

nvidia-vfio-manager/sandbox/kata-manager Failed to be deployed #650

@seungsoo-lee

Description

@seungsoo-lee

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 22.04
  • Kernel Version: 5.19.0-rc6-snp-guest-c4daeffce56e
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8s
  • GPU Operator Version: 23.9.1

2. Issue or feature description

Briefly explain the issue in terms of expected behavior and current behavior.

My machine's specs:

CPU: Dual AMD EPYC 9224 16-Core Processor
GPU: H100 10de:2331 (vbios: 96.00.5E.00.01 cuda: 12.2 nvidia driver: 535.86.10)
Host OS: Ubuntu 22.04 with 5.19.0-rc6-snp-host-c4daeffce56e kernel
Guest OS: Ubuntu 22.04.2 with 5.19.0-rc6-snp-guest-c4daeffce56e kernel

Following by the deployment document, I succeeded unitl p.39.

But, in the guest VM, when I tried to deploy gpu operator-related pods, it faced error.

3. Steps to reproduce the issue

cclab@guest:~$ helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set sandboxWorkloads.enabled=true \
--set kataManager.enabled=true \
--set ccManager.enabled=true \
--set nfd.nodefeaturerules=true

But, when I tried to install NVIDIA GPU operator (3. Install Operator on the p. 40), I faced some error as follows.

cclab@guest:~$ kubectl get pods -n gpu-operator
NAME                                                              READY   STATUS                  RESTARTS        AGE
gpu-operator-1704448302-node-feature-discovery-gc-5785d845tnkb9   1/1     Running                 0               36m
gpu-operator-1704448302-node-feature-discovery-master-7464275cx   1/1     Running                 0               36m
gpu-operator-1704448302-node-feature-discovery-worker-bdbkv       1/1     Running                 0               36m
gpu-operator-d7467c67f-bfrxd                                      1/1     Running                 0               36m
nvidia-kata-manager-x74m4                                         0/1     Running                 0               36m
nvidia-sandbox-device-plugin-daemonset-wcj79                      0/1     Init:0/2                0               21m
nvidia-sandbox-validator-n9rcj                                    0/1     Init:CrashLoopBackOff   9 (96s ago)     22m
nvidia-vfio-manager-2mtfv                                         0/1     Init:CrashLoopBackOff   7 (3m44s ago)   22m

it seems that nvidia-* pods cannot be deployed in the cluster.

and I can see the describe about the pod
kubectl describe pod nvidia-sandbox-validator-n9rcj -n gpu-operator

it says

  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  23m                   default-scheduler  Successfully assigned gpu-operator/nvidia-sandbox-validator-n9rcj to guest
  Normal   Pulled     23m                   kubelet            Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1" already present on machine
  Normal   Created    23m                   kubelet            Created container cc-manager-validation
  Normal   Started    23m                   kubelet            Started container cc-manager-validation
  Normal   Pulled     22m (x5 over 23m)     kubelet            Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1" already present on machine
  Normal   Created    22m (x5 over 23m)     kubelet            Created container vfio-pci-validation
  Normal   Started    22m (x5 over 23m)     kubelet            Started container vfio-pci-validation
  Warning  BackOff    3m36s (x94 over 23m)  kubelet            Back-off restarting failed container vfio-pci-validation in pod nvidia-sandbox-validator-n9rcj_gpu-operator(12ecb940-e6ae-4f5a-9235-6cb0afdfdd5d)

kubectl logs gpu-operator-d7467c67f-bfrxd -n gpu-operator shows

{"level":"error","ts":1704525037.9049249,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy is not ready, states not ready: [state-sandbox-validation state-vfio-manager state-sandbox-device-plugin state-kata-manager]"}

4. Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
  • kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
  • If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
  • If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
  • Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
  • containerd logs journalctl -u containerd > containerd.log

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

Metadata

Metadata

Assignees

No one assigned

    Labels

    lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions