nvidia-vfio-manager/sandbox/kata-manager Failed to be deployed

_The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense._

_**Important Note:  NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case [here](https://enterprise-support.nvidia.com/s/create-case)**._


### 1. Quick Debug Information
* OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu 22.04
* Kernel Version: 5.19.0-rc6-snp-guest-c4daeffce56e
* Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
* K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8s
* GPU Operator Version: 23.9.1


### 2. Issue or feature description
_Briefly explain the issue in terms of expected behavior and current behavior._

My machine's specs:

CPU: Dual AMD EPYC 9224 16-Core Processor
GPU: H100 10de:2331 (vbios: 96.00.5E.00.01 cuda: 12.2 nvidia driver: 535.86.10)
Host OS: Ubuntu 22.04 with 5.19.0-rc6-snp-host-c4daeffce56e kernel
Guest OS: Ubuntu 22.04.2 with 5.19.0-rc6-snp-guest-c4daeffce56e kernel

Following by the [deployment document](https://docs.nvidia.com/confidential-computing-deployment-guide.pdf), I succeeded unitl p.39.

But, in the guest VM, when I tried to deploy gpu operator-related pods, it faced error.



### 3. Steps to reproduce the issue

```
cclab@guest:~$ helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--set sandboxWorkloads.enabled=true \
--set kataManager.enabled=true \
--set ccManager.enabled=true \
--set nfd.nodefeaturerules=true
```

But, when I tried to install NVIDIA GPU operator (3. Install Operator on the p. 40), I faced some error as follows.

```
cclab@guest:~$ kubectl get pods -n gpu-operator
NAME                                                              READY   STATUS                  RESTARTS        AGE
gpu-operator-1704448302-node-feature-discovery-gc-5785d845tnkb9   1/1     Running                 0               36m
gpu-operator-1704448302-node-feature-discovery-master-7464275cx   1/1     Running                 0               36m
gpu-operator-1704448302-node-feature-discovery-worker-bdbkv       1/1     Running                 0               36m
gpu-operator-d7467c67f-bfrxd                                      1/1     Running                 0               36m
nvidia-kata-manager-x74m4                                         0/1     Running                 0               36m
nvidia-sandbox-device-plugin-daemonset-wcj79                      0/1     Init:0/2                0               21m
nvidia-sandbox-validator-n9rcj                                    0/1     Init:CrashLoopBackOff   9 (96s ago)     22m
nvidia-vfio-manager-2mtfv                                         0/1     Init:CrashLoopBackOff   7 (3m44s ago)   22m
```

it seems that ```nvidia-*``` pods cannot be deployed in the cluster.

and I can see the describe about the pod
```kubectl describe pod nvidia-sandbox-validator-n9rcj -n gpu-operator```

it says 
```
  Type     Reason     Age                   From               Message
  ----     ------     ----                  ----               -------
  Normal   Scheduled  23m                   default-scheduler  Successfully assigned gpu-operator/nvidia-sandbox-validator-n9rcj to guest
  Normal   Pulled     23m                   kubelet            Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1" already present on machine
  Normal   Created    23m                   kubelet            Created container cc-manager-validation
  Normal   Started    23m                   kubelet            Started container cc-manager-validation
  Normal   Pulled     22m (x5 over 23m)     kubelet            Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v23.9.1" already present on machine
  Normal   Created    22m (x5 over 23m)     kubelet            Created container vfio-pci-validation
  Normal   Started    22m (x5 over 23m)     kubelet            Started container vfio-pci-validation
  Warning  BackOff    3m36s (x94 over 23m)  kubelet            Back-off restarting failed container vfio-pci-validation in pod nvidia-sandbox-validator-n9rcj_gpu-operator(12ecb940-e6ae-4f5a-9235-6cb0afdfdd5d)
```

```kubectl logs gpu-operator-d7467c67f-bfrxd   -n gpu-operator``` shows
```
{"level":"error","ts":1704525037.9049249,"logger":"controllers.ClusterPolicy","msg":"ClusterPolicy is not ready, states not ready: [state-sandbox-validation state-vfio-manager state-sandbox-device-plugin state-kata-manager]"}
```


### 4. Information to [attach](https://help.github.com/articles/file-attachments-on-issues-and-pull-requests/) (optional if deemed irrelevant)

 - [ ] kubernetes pods status: `kubectl get pods -n OPERATOR_NAMESPACE`
 - [ ] kubernetes daemonset status: `kubectl get ds -n OPERATOR_NAMESPACE`
 - [ ] If a pod/ds is in an error state or pending state `kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME`
 - [ ] If a pod/ds is in an error state or pending state `kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers`
 - [ ] Output from running `nvidia-smi` from the driver container: `kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi`
 - [ ] containerd logs `journalctl -u containerd > containerd.log`


Collecting full debug bundle (optional):

```
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh 
chmod +x must-gather.sh
./must-gather.sh
```
**NOTE**: please refer to the [must-gather](https://raw.githubusercontent.com/NVIDIA/gpu-operator/master/hack/must-gather.sh) script for debug data collected.

This bundle can be submitted to us via email: **operator_feedback@nvidia.com**


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

nvidia-vfio-manager/sandbox/kata-manager Failed to be deployed #650

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

4. Information to attach (optional if deemed irrelevant)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

nvidia-vfio-manager/sandbox/kata-manager Failed to be deployed #650

Description

1. Quick Debug Information

2. Issue or feature description

3. Steps to reproduce the issue

4. Information to attach (optional if deemed irrelevant)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions