-
Notifications
You must be signed in to change notification settings - Fork 759
Open
Description
We’ve encountered a recurring issue where the init container in the nvidia-device-plugin (and similarly in the mps-control-daemon) becomes stuck in the Running state and never completes. This prevents the main container from starting.
The last log line observed before the hang is (for the device-plugin):
nvidia-device-plugin-init W1127 22:37:28.837842 7 client_config.go:659] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
nvidia-device-plugin-init I1127 22:37:28.838748 7 main.go:246] Waiting for change to 'nvidia.com/device-plugin.config' label
The last log line observed before the hang is (for the mps-control-daemon):
mps-control-daemon-mounts I1127 18:28:06.816180 7 main.go:81] "NVIDIA MPS Control Daemon" version=<
mps-control-daemon-mounts e0a461e1
mps-control-daemon-mounts commit: e0a461e1e7ad1d239d4708c954f08c3038e2654a
mps-control-daemon-mounts >
mps-control-daemon-mounts W1127 18:28:06.818457 7 mount_helper_common.go:34] Warning: mount cleanup skipped because path does not exist: /mps/shm
mps-control-daemon-init W1127 18:28:14.816030 20 client_config.go:659] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
mps-control-daemon-init I1127 18:28:14.816535 20 main.go:246] Waiting for change to 'nvidia.com/device-plugin.config' label
Restarting the affected pod immediately resolves the issue.
Observed Behavior
- The init container remains stuck indefinitely.
- The main container does not start.
- After restarting the pod, it proceeds normally.
Environment Details
- Component: nvidia-device-plugin (and mps-control-daemon)
- Kubernetes version: v1.32.9 EKS
- Driver / Toolkit version: v0.17.3
- Config override: Not overriding config per node.
- Deployed via helm chart
- General config:
config:
map:
default: |-
version: v1
flags:
migStrategy: none
sharing:
mps:
renameByDefault: false
resources:
- name: nvidia.com/gpu
replicas: 5
Frequency
- Occurs on approximately 1–3% of nodes.
- Appears to be non-deterministic, potentially a race condition.
Code Reference
Code reference for where the hang occurs:
k8s-device-plugin/cmd/config-manager/main.go
Lines 229 to 255 in 624b771
| func start(c *cli.Context, f *Flags) error { | |
| kubeconfig, err := clientcmd.BuildConfigFromFlags("", f.Kubeconfig) | |
| if err != nil { | |
| return fmt.Errorf("error building kubernetes clientcmd config: %s", err) | |
| } | |
| clientset, err := kubernetes.NewForConfig(kubeconfig) | |
| if err != nil { | |
| return fmt.Errorf("error building kubernetes clientset from config: %s", err) | |
| } | |
| config := NewSyncableConfig(f) | |
| stop := continuouslySyncConfigChanges(clientset, config, f) | |
| defer close(stop) | |
| for { | |
| klog.Infof("Waiting for change to '%s' label", f.NodeLabel) | |
| config := config.Get() | |
| klog.Infof("Label change detected: %s=%s", f.NodeLabel, config) | |
| err := updateConfig(config, f) | |
| if f.Oneshot || err != nil { | |
| return err | |
| } | |
| } | |
| } | |
From initial review, the issue seems to occur while waiting to acquire the lock.
We considered setting ONESHOT=false, but since it appears to hang during lock acquisition, it may not resolve the issue.
amirgo1, gaziter, YaronFr, talmarco, daviduash and 6 more
Metadata
Metadata
Assignees
Labels
No labels