Skip to content

Init container hangs waiting for device-plugin.config label #1540

@uristernik

Description

@uristernik

We’ve encountered a recurring issue where the init container in the nvidia-device-plugin (and similarly in the mps-control-daemon) becomes stuck in the Running state and never completes. This prevents the main container from starting.

The last log line observed before the hang is (for the device-plugin):

nvidia-device-plugin-init W1127 22:37:28.837842       7 client_config.go:659] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
nvidia-device-plugin-init I1127 22:37:28.838748       7 main.go:246] Waiting for change to 'nvidia.com/device-plugin.config' label

The last log line observed before the hang is (for the mps-control-daemon):

mps-control-daemon-mounts I1127 18:28:06.816180       7 main.go:81] "NVIDIA MPS Control Daemon" version=<
mps-control-daemon-mounts     e0a461e1
mps-control-daemon-mounts     commit: e0a461e1e7ad1d239d4708c954f08c3038e2654a
mps-control-daemon-mounts  >
mps-control-daemon-mounts W1127 18:28:06.818457       7 mount_helper_common.go:34] Warning: mount cleanup skipped because path does not exist: /mps/shm
mps-control-daemon-init W1127 18:28:14.816030      20 client_config.go:659] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
mps-control-daemon-init I1127 18:28:14.816535      20 main.go:246] Waiting for change to 'nvidia.com/device-plugin.config' label

Restarting the affected pod immediately resolves the issue.

Observed Behavior

  • The init container remains stuck indefinitely.
  • The main container does not start.
  • After restarting the pod, it proceeds normally.

Environment Details

  • Component: nvidia-device-plugin (and mps-control-daemon)
  • Kubernetes version: v1.32.9 EKS
  • Driver / Toolkit version: v0.17.3
  • Config override: Not overriding config per node.
  • Deployed via helm chart
  • General config:
config:
  map:
    default: |-
      version: v1
      flags:
        migStrategy: none
      sharing:
        mps:
          renameByDefault: false
          resources:
          - name: nvidia.com/gpu
            replicas: 5

Frequency

  • Occurs on approximately 1–3% of nodes.
  • Appears to be non-deterministic, potentially a race condition.

Code Reference
Code reference for where the hang occurs:

func start(c *cli.Context, f *Flags) error {
kubeconfig, err := clientcmd.BuildConfigFromFlags("", f.Kubeconfig)
if err != nil {
return fmt.Errorf("error building kubernetes clientcmd config: %s", err)
}
clientset, err := kubernetes.NewForConfig(kubeconfig)
if err != nil {
return fmt.Errorf("error building kubernetes clientset from config: %s", err)
}
config := NewSyncableConfig(f)
stop := continuouslySyncConfigChanges(clientset, config, f)
defer close(stop)
for {
klog.Infof("Waiting for change to '%s' label", f.NodeLabel)
config := config.Get()
klog.Infof("Label change detected: %s=%s", f.NodeLabel, config)
err := updateConfig(config, f)
if f.Oneshot || err != nil {
return err
}
}
}

From initial review, the issue seems to occur while waiting to acquire the lock.
We considered setting ONESHOT=false, but since it appears to hang during lock acquisition, it may not resolve the issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions