-
Notifications
You must be signed in to change notification settings - Fork 412
Description
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.
1. Quick Debug Information
- OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 22.04
- Kernel Version: 5.19.0-rc6-snp-guest-c4daeffce56e
- Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
- K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8s
- GPU Operator Version: 23.9.1
2. Issue or feature description
Briefly explain the issue in terms of expected behavior and current behavior.
My machine's specs:
CPU: Dual AMD EPYC 9224 16-Core Processor
GPU: H100 10de:2331 (vbios: 96.00.5E.00.01 cuda: 12.2 nvidia driver: 535.86.10)
Host OS: Ubuntu 22.04 with 5.19.0-rc6-snp-host-c4daeffce56e kernel
Guest OS: Ubuntu 22.04.2 with 5.19.0-rc6-snp-guest-c4daeffce56e kernel
Following by the deployment document, I succeeded unitl p.39.
But, in the guest VM, when I tried to deploy gpu operator-related pods, it faced error.
3. Steps to reproduce the issue
cclab@guest:~$ helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set sandboxWorkloads.enabled=true \
--set kataManager.enabled=true \
--set ccManager.enabled=true \
--set nfd.nodefeaturerules=true
But, when I tried to install NVIDIA GPU operator (3. Install Operator on the p. 40), I faced some error as follows.
cclab@guest:~$ kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-operator-1704448302-node-feature-discovery-gc-5785d845tnkb9 1/1 Running 0 36m
gpu-operator-1704448302-node-feature-discovery-master-7464275cx 1/1 Running 0 36m
gpu-operator-1704448302-node-feature-discovery-worker-bdbkv 1/1 Running 0 36m
gpu-operator-d7467c67f-bfrxd 1/1 Running 0 36m
nvidia-kata-manager-x74m4 0/1 Running 0 36m
nvidia-sandbox-device-plugin-daemonset-wcj79 0/1 Init:0/2 0 21m
nvidia-sandbox-validator-n9rcj 0/1 Init:CrashLoopBackOff 9 (96s ago) 22m
nvidia-vfio-manager-2mtfv 0/1 Init:CrashLoopBackOff 7 (3m44s ago) 22m
it seems that nvidia-* pods cannot be deployed in the cluster.
and I can see the describe about the pod
kubectl describe pod nvidia-sandbox-validator-n9rcj -n gpu-operator
it says
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 23m default-scheduler Successfully assigned gpu-operator/nvidia-sandbox-validator-n9rcj to guest
Normal Pulled 23m kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1" already present on machine
Normal Created 23m kubelet Created container cc-manager-validation
Normal Started 23m kubelet Started container cc-manager-validation
Normal Pulled 22m (x5 over 23m) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1" already present on machine
Normal Created 22m (x5 over 23m) kubelet Created container vfio-pci-validation
Normal Started 22m (x5 over 23m) kubelet Started container vfio-pci-validation
Warning BackOff 3m36s (x94 over 23m) kubelet Back-off restarting failed container vfio-pci-validation in pod nvidia-sandbox-validator-n9rcj_gpu-operator(12ecb940-e6ae-4f5a-9235-6cb0afdfdd5d)
kubectl logs gpu-operator-d7467c67f-bfrxd -n gpu-operator shows
{"level":"error","ts":1704525037.9049249,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy is not ready, states not ready: [state-sandbox-validation state-vfio-manager state-sandbox-device-plugin state-kata-manager]"}
4. Information to attach (optional if deemed irrelevant)
- kubernetes pods status:
kubectl get pods -n OPERATOR_NAMESPACE - kubernetes daemonset status:
kubectl get ds -n OPERATOR_NAMESPACE - If a pod/ds is in an error state or pending state
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME - If a pod/ds is in an error state or pending state
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers - Output from running
nvidia-smifrom the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi - containerd logs
journalctl -u containerd > containerd.log
Collecting full debug bundle (optional):
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: [email protected]