-
Notifications
You must be signed in to change notification settings - Fork 413
Closed
Labels
lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.Denotes an issue or PR has remained open with no activity and has become stale.
Description
I have issue with nvidia-gpu-operator, where when setting limits for "nvidia.com/gpu: 1". I get scheduled with a GPU that is already allocated to another container.
Additionally, I had previously troubles with containers showing one additional GPU even though limit was set to 1.
What I want: Only allocate a GPU that is not already in use by another pod.
What it does: Allocates a GPU that is already in use by another pod.
Environment:
GPU model H100, NVIDIA-SMI 550.90.12 , Driver Version: 550.90.12 , CUDA Version: 12.4
Installation steps:
- Installing gpu-operator resources
microk8s.helm3 install gpu-operator -n gpu-operator-resources --create-namespace nvidia/gpu-operator --version v24.6.1 --set toolkit.env[0].name=CONTAINERD_CONFIG --set toolkit.env[0].value=/var/snap/microk8s/current/args/containerd-template.toml --set toolkit.env[1].name=CONTAINERD_SOCKET --set toolkit.env[1].value=/var/snap/microk8s/common/run/containerd.sock --set toolkit.env[2].name=CONTAINERD_RUNTIME_CLASS --set toolkit.env[2].value=nvidia --set toolkit.env[3].name=CONTAINERD_SET_AS_DEFAULT --set-string toolkit.env[3].value=true --set cdi.default=false --set cdi.enabled=true --set toolkit.enabled=true --set driver.enabled=false
- Patching CDI manually
kubectl patch clusterpolicies.nvidia.com/cluster-policy --type='json' \
-p='[{"op": "replace", "path": "/spec/cdi/default", "value":true}]'
- Removing default runtime to restrict giving two GPUs
vi /var/snap/microk8s/current/args/containerd-template.toml
default_runtime_name = "nvidia" # REMOVED THIS
Would appreciate any help I can get, thank you
Metadata
Metadata
Assignees
Labels
lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.Denotes an issue or PR has remained open with no activity and has become stale.