kata-qemu-nvidia-gpu does not exist

_**Important Note:  NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case [here](https://enterprise-support.nvidia.com/s/create-case)**._

**Describe the bug**

I installed GPU operator Kata according to the official [documentation](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/24.9/gpu-operator-kata.html
), and currently all containers have started up normally. However, the name displayed in the runtimeclasses does not match the name in the documentation, and qemu is missing. So, I started the container using kata-nvidia-gpu, but it still hasn't started:


root@master-01:~# kubectl get pods -n gpu-operator
NAME                                                              READY   STATUS    RESTARTS      AGE
gpu-operator-1763107084-node-feature-discovery-gc-5cbbf8546klcw   1/1     Running   5 (27h ago)   3d22h
gpu-operator-1763107084-node-feature-discovery-master-768dn9kpw   1/1     Running   5 (27h ago)   3d22h
gpu-operator-1763107084-node-feature-discovery-worker-7wk85       1/1     Running   5 (27h ago)   3d22h
gpu-operator-7b5fb5b8b-mg84b                                      1/1     Running   5 (27h ago)   3d22h
nvidia-kata-manager-5kvc9                                         1/1     Running   0             22h
nvidia-sandbox-device-plugin-daemonset-f7spt                      1/1     Running   0             22h
nvidia-sandbox-validator-n5lr6                                    1/1     Running   0             22h
nvidia-vfio-manager-glt2z                                         1/1     Running   0             22h


root@master-01:~# kubectl get runtimeclasses.node.k8s.io
NAME                  HANDLER               AGE
kata                  kata                  3d22h
kata-clh              kata-clh              3d22h
kata-clh-tdx          kata-clh-tdx          3d22h
kata-nvidia-gpu       kata-nvidia-gpu       3d22h
kata-nvidia-gpu-snp   kata-nvidia-gpu-snp   3d22h
kata-qemu             kata-qemu             3d22h
kata-qemu-sev         kata-qemu-sev         3d22h
kata-qemu-snp         kata-qemu-snp         3d22h
kata-qemu-tdx         kata-qemu-tdx         3d22h
nvidia                nvidia                3d22h

Here it is not kata-qemu-nvidia-gpu as mentioned in the document, but kata-nvidia-gpu。


I use the following YAML file to run the container:

```yaml
apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd-kata
  annotations:
    cdi.k8s.io/gpu: "nvidia.com/pgpu=0"
    io.katacontainers.config.hypervisor.default_memory: "4096"
spec:
  runtimeClassName: kata-nvidia-gpu
  restartPolicy: OnFailure
  containers:
  - name: cuda-vectoradd
    image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
    resources:
      requests:
        memory: 6Gi
      limits:
        "nvidia.com/GA102_GEFORCE_RTX_3090": 1
        memory: 6Gi
```

At this point, it will start with an error message stating that it is missing default_mamory. Even if this option is added, it still cannot start. Therefore, I modified the file/ opt/nvidia-gpu-operator/artifacts/runtimeclasses/kata-nvidia-gpu/configuration-kata-qemu-nvidia-gpu.toml， Added these three items:

  [hypervisor.qemu]
    enable_iommu = true
    enable_vfio = true
    enable_annotations = ["enable_iommu","default_memory"]

But it repeatedly restarts and cannot start normally，The following error occurs: 

  Warning  FailedCreatePodSandBox  28s (x18 over 26m)  kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: exitting QMP loop, command cancelled: unknown

  Warning  FailedCreatePodSandBox  3m23s (x373 over 168m)  kubelet  Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: QMP command failed: vfio 0000:1b:00.1: group 36 used in multiple address spaces: unknown



**To Reproduce**
Detailed steps to reproduce the issue.

**Expected behavior**
A clear and concise description of what you expected to happen.

**Environment (please provide the following information):**
 - GPU Operator Version: [e.g. v25.3.0]
 - OS: [e.g. Ubuntu24.04]
 - Kernel Version: [e.g. 6.8.0-generic]
 - Container Runtime Version: [e.g. containerd 2.0.0]
 - Kubernetes Distro and Version: [e.g. K8s, OpenShift, Rancher, GKE, EKS]



**Information to [attach](https://help.github.com/articles/file-attachments-on-issues-and-pull-requests/)** (optional if deemed irrelevant)

 - [ ] kubernetes pods status: `kubectl get pods -n OPERATOR_NAMESPACE`
 - [ ] kubernetes daemonset status: `kubectl get ds -n OPERATOR_NAMESPACE`
 - [ ] If a pod/ds is in an error state or pending state `kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME`
 - [ ] If a pod/ds is in an error state or pending state `kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers`
 - [ ] Output from running `nvidia-smi` from the driver container: `kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi`
 - [ ] containerd logs `journalctl -u containerd > containerd.log`


Collecting full debug bundle (optional):

```
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
```
**NOTE**: please refer to the [must-gather](https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh) script for debug data collected.

This bundle can be submitted to us via email: **operator_feedback@nvidia.com**


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

kata-qemu-nvidia-gpu does not exist #1906

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

kata-qemu-nvidia-gpu does not exist #1906

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions