Skip to content

kata-qemu-nvidia-gpu does not exist #1906

@ys928

Description

@ys928

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

Describe the bug

I installed GPU operator Kata according to the official documentation, and currently all containers have started up normally. However, the name displayed in the runtimeclasses does not match the name in the documentation, and qemu is missing. So, I started the container using kata-nvidia-gpu, but it still hasn't started:

root@master-01:~# kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-operator-1763107084-node-feature-discovery-gc-5cbbf8546klcw 1/1 Running 5 (27h ago) 3d22h
gpu-operator-1763107084-node-feature-discovery-master-768dn9kpw 1/1 Running 5 (27h ago) 3d22h
gpu-operator-1763107084-node-feature-discovery-worker-7wk85 1/1 Running 5 (27h ago) 3d22h
gpu-operator-7b5fb5b8b-mg84b 1/1 Running 5 (27h ago) 3d22h
nvidia-kata-manager-5kvc9 1/1 Running 0 22h
nvidia-sandbox-device-plugin-daemonset-f7spt 1/1 Running 0 22h
nvidia-sandbox-validator-n5lr6 1/1 Running 0 22h
nvidia-vfio-manager-glt2z 1/1 Running 0 22h

root@master-01:~# kubectl get runtimeclasses.node.k8s.io
NAME HANDLER AGE
kata kata 3d22h
kata-clh kata-clh 3d22h
kata-clh-tdx kata-clh-tdx 3d22h
kata-nvidia-gpu kata-nvidia-gpu 3d22h
kata-nvidia-gpu-snp kata-nvidia-gpu-snp 3d22h
kata-qemu kata-qemu 3d22h
kata-qemu-sev kata-qemu-sev 3d22h
kata-qemu-snp kata-qemu-snp 3d22h
kata-qemu-tdx kata-qemu-tdx 3d22h
nvidia nvidia 3d22h

Here it is not kata-qemu-nvidia-gpu as mentioned in the document, but kata-nvidia-gpu。

I use the following YAML file to run the container:

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vectoradd-kata
  annotations:
    cdi.k8s.io/gpu: "nvidia.com/pgpu=0"
    io.katacontainers.config.hypervisor.default_memory: "4096"
spec:
  runtimeClassName: kata-nvidia-gpu
  restartPolicy: OnFailure
  containers:
  - name: cuda-vectoradd
    image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
    resources:
      requests:
        memory: 6Gi
      limits:
        "nvidia.com/GA102_GEFORCE_RTX_3090": 1
        memory: 6Gi

At this point, it will start with an error message stating that it is missing default_mamory. Even if this option is added, it still cannot start. Therefore, I modified the file/ opt/nvidia-gpu-operator/artifacts/runtimeclasses/kata-nvidia-gpu/configuration-kata-qemu-nvidia-gpu.toml, Added these three items:

[hypervisor.qemu]
enable_iommu = true
enable_vfio = true
enable_annotations = ["enable_iommu","default_memory"]

But it repeatedly restarts and cannot start normally,The following error occurs:

Warning FailedCreatePodSandBox 28s (x18 over 26m) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: exitting QMP loop, command cancelled: unknown

Warning FailedCreatePodSandBox 3m23s (x373 over 168m) kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create containerd task: failed to create shim task: QMP command failed: vfio 0000:1b:00.1: group 36 used in multiple address spaces: unknown

To Reproduce
Detailed steps to reproduce the issue.

Expected behavior
A clear and concise description of what you expected to happen.

Environment (please provide the following information):

  • GPU Operator Version: [e.g. v25.3.0]
  • OS: [e.g. Ubuntu24.04]
  • Kernel Version: [e.g. 6.8.0-generic]
  • Container Runtime Version: [e.g. containerd 2.0.0]
  • Kubernetes Distro and Version: [e.g. K8s, OpenShift, Rancher, GKE, EKS]

Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
  • kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
  • If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
  • If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
  • Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
  • containerd logs journalctl -u containerd > containerd.log

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: [email protected]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions