-
Notifications
You must be signed in to change notification settings - Fork 755
Description
Summary
When using NVIDIA MPS with the Kubernetes device plugin, the set_default_active_thread_percentage value is being set incorrectly, leading to severe GPU underutilization on workloads scheduled to the same GPU.
This parameter is global per MPS daemon, and since there is one MPS daemon per GPU, incorrectly setting it (e.g. based on replica count) results in throttling all associated workloads to a fraction of the GPU capacity.
Observed Behavior
- When running with the default
active_thread_percentageapplied by the device plugin, nvidia-smi or any other gpu monitoring tool shows GPU usage around 60% with 2 workloads. (adding more workloads GPU usage increases to 100% and applications start to slow down) - When applying
set_active_thread_percentage100 manually via nvidia-cuda-mps-control, the same workload drops to ~2–3% GPU usage, showing that resources are correctly shared and not artificially limited. - This confirms that the device plugin is configuring MPS with the wrong
active_thread_percentageduring initialization.
How to Reproduce
- Deploy a GPU workload using the Kubernetes NVIDIA device plugin with MPS enabled.
- Observe GPU utilization in DCGM/NVIDIA SMI
- Exec into the workload pod and run:
echo "get_server_list" | nvidia-cuda-mps-control
SERVERID
echo "set_active_thread_percentage $SERVERID 100" | nvidia-cuda-mps-control
- Restart workload to apply new values.
- Observe that workload GPU usage is decreasing, and therefore can scale up. Application is not slowing down.
Root Cause
set_default_active_thread_percentage is applied based on replica count, but MPS only runs a single daemon per GPU, so this setting is shared across all workloads.
Expected Behavior
The device plugin should not override active_thread_percentage unless explicitly configured by the user.
Per-GPU or per-pod resource tuning should not be attempted in this manner without awareness of global MPS constraints.
Environment Details
GPU: e.g. NVIDIA RTX 4000
Driver version: 580.82.07
Container image: nvcr.io/nvidia/k8s-device-plugin:v0.17.4
Kubernetes version: 1.32.6