Skip to content

GPU already used, showing up in multiple containers #1021

@astranero

Description

@astranero

I have issue with nvidia-gpu-operator, where when setting limits for "nvidia.com/gpu: 1". I get scheduled with a GPU that is already allocated to another container.
Additionally, I had previously troubles with containers showing one additional GPU even though limit was set to 1.

What I want: Only allocate a GPU that is not already in use by another pod.
What it does: Allocates a GPU that is already in use by another pod.

Environment:
GPU model H100, NVIDIA-SMI 550.90.12 , Driver Version: 550.90.12 , CUDA Version: 12.4

Installation steps:

  1. Installing gpu-operator resources
microk8s.helm3 install gpu-operator -n gpu-operator-resources --create-namespace   nvidia/gpu-operator --version v24.6.1   --set toolkit.env[0].name=CONTAINERD_CONFIG   --set toolkit.env[0].value=/var/snap/microk8s/current/args/containerd-template.toml   --set toolkit.env[1].name=CONTAINERD_SOCKET   --set toolkit.env[1].value=/var/snap/microk8s/common/run/containerd.sock   --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS   --set toolkit.env[2].value=nvidia   --set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT   --set-string toolkit.env[3].value=true --set cdi.default=false --set cdi.enabled=true --set toolkit.enabled=true --set driver.enabled=false
  1. Patching CDI manually
kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \
    -p='[{"op": "replace", "path": "/spec/cdi/default", "value":true}]'
  1. Removing default runtime to restrict giving two GPUs
vi /var/snap/microk8s/current/args/containerd-template.toml
default_runtime_name = "nvidia"   # REMOVED THIS

Would appreciate any help I can get, thank you

Metadata

Metadata

Assignees

No one assigned

    Labels

    lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions