Bug: Incorrect set_default_active_thread_percentage Behavior in Kubernetes Device Plugin with MPS

**Summary**

When using NVIDIA MPS with the Kubernetes device plugin, the `set_default_active_thread_percentage` value is being set incorrectly, leading to severe GPU underutilization on workloads scheduled to the same GPU.

This parameter is global per MPS daemon, and since there is one MPS daemon per GPU, incorrectly setting it (e.g. based on replica count) results in throttling all associated workloads to a fraction of the GPU capacity.

**Observed Behavior**

* When running with the default `active_thread_percentage` applied by the device plugin, nvidia-smi or any other gpu monitoring tool shows GPU usage around 60% with 2 workloads. (adding more workloads GPU usage increases to 100% and applications start to slow down)
* When applying `set_active_thread_percentage` 100 manually via nvidia-cuda-mps-control, the same workload drops to ~2–3% GPU usage, showing that resources are correctly shared and not artificially limited.
* This confirms that the device plugin is configuring MPS with the wrong `active_thread_percentage` during initialization.

**How to Reproduce**

* Deploy a GPU workload using the Kubernetes NVIDIA device plugin with MPS enabled.
* Observe GPU utilization in DCGM/NVIDIA SMI
* Exec into the workload pod and run:
``` 
echo "get_server_list" | nvidia-cuda-mps-control
SERVERID
echo "set_active_thread_percentage $SERVERID 100" | nvidia-cuda-mps-control
```
* Restart workload to apply new values. 
* Observe that workload GPU usage is decreasing, and therefore can scale up. Application is not slowing down. 

**Root Cause**

`set_default_active_thread_percentage` is applied based on replica count, but MPS only runs a single daemon per GPU, so this setting is shared across all workloads.

**Expected Behavior**

The device plugin should not override `active_thread_percentage` unless explicitly configured by the user.

Per-GPU or per-pod resource tuning should not be attempted in this manner without awareness of global MPS constraints.

**Environment Details**
GPU: e.g. NVIDIA RTX 4000
Driver version: 580.82.07
Container image: nvcr.io/nvidia/k8s-device-plugin:v0.17.4
Kubernetes version: 1.32.6



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bug: Incorrect set_default_active_thread_percentage Behavior in Kubernetes Device Plugin with MPS #1494

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: Incorrect set_default_active_thread_percentage Behavior in Kubernetes Device Plugin with MPS #1494

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions