fix: resolve NVIDIADriver stuck in NotReady on nodeSelector changes with OnDelete strategy #1868

karthikvetrivel · 2025-11-06T16:25:56Z

Solves #1661.

Problem

When editing the nodeSelector field of an NVIDIADriver CR, the resource enters a permanent NotReady state if the change doesn't result in pod updates (e.g., replacing equivalent labels). This causes infinite reconciliation loops.

Root Cause

The readiness check required UpdatedNumberScheduled == NumberAvailable, but with OnDelete update strategy, pods are never auto-updated even when already on correct nodes.

New Logic Flow Diagram

isDaemonSetReady()
  │
  ├─ Check: DesiredNumberScheduled != 0?
  ├─ Check: NumberUnavailable == 0?
  ├─ Check: DesiredNumberScheduled == NumberAvailable?
  │
  ├─ If OnDelete strategy:
  │   │
  │   └─> isDaemonSetReadyOnDelete()
  │        │
  │        ├─ getOwnedPods() → Get pods for DaemonSet
  │        │
  │        ├─ arePodsHealthy() → All Running + Ready?
  │        │   └─ NO → Return NotReady ❌
  │        │
  │        ├─ getLatestRevisionHash() → Get DaemonSet revision
  │        │
  │        ├─ Check pod controller-revision-hash labels
  │        │   ├─ All match? → Return Ready ✅
  │        │   │
  │        │   └─ Some outdated? → verifyNodePlacement()
  │        │        │
  │        │        ├─ For each pod:
  │        │        │   └─ Get node, check labels match nodeSelector
  │        │        │
  │        │        ├─ All on correct nodes? → Return Ready ✅
  │        │        └─ Some on wrong nodes? → Return NotReady ❌
  │
  └─ If RollingUpdate strategy:
      │
      └─> Check: UpdatedNumberScheduled == NumberAvailable?
          ├─ YES → Return Ready ✅
          └─ NO  → Return NotReady ❌

How It Fixes the Bug

Before (Buggy Behavior)

User changes: nodeSelector: {zone: us-east} → {region: us-east}
  ↓
DaemonSet spec updated (creates new ControllerRevision: def456)
  ↓
Kubernetes checks nodes:
  - node1 has {zone: us-east, region: us-east} ✓
  - node2 has {zone: us-east, region: us-east} ✓
  - Pods already on these nodes
  ↓
OnDelete strategy: Don't touch existing pods
  ↓
DaemonSet Status:
  - DesiredNumberScheduled: 2 ✓
  - NumberAvailable: 2 ✓
  - UpdatedNumberScheduled: 0 ✗ (pods have old revision abc123)
  ↓
Readiness Check:
  if (2 != 0 && 2 == 2 && 0 == 2) → FALSE
  ↓
Status: NotReady
  ↓
Reconcile loop: Wait 5s → Check again → Still NotReady
  ↓
♾️ INFINITE LOOP - NEVER BECOMES READY

After (Fixed Behavior)

User changes: nodeSelector: {zone: us-east} → {region: us-east}
  ↓
DaemonSet spec updated (creates new ControllerRevision: def456)
  ↓
Kubernetes checks nodes:
  - node1 has {zone: us-east, region: us-east} ✓
  - node2 has {zone: us-east, region: us-east} ✓
  - Pods already on these nodes
  ↓
OnDelete strategy: Don't touch existing pods
  ↓
DaemonSet Status:
  - DesiredNumberScheduled: 2 ✓
  - NumberAvailable: 2 ✓
  - UpdatedNumberScheduled: 0 ✗ (pods have old revision abc123)
  ↓
NEW Readiness Check:
  ├─ OnDelete strategy detected
  ├─ isDaemonSetReadyOnDelete() called
  │   ├─ getOwnedPods() → [pod1, pod2]
  │   ├─ arePodsHealthy() → TRUE ✓
  │   ├─ getLatestRevisionHash() → "def456"
  │   ├─ Check revisions:
  │   │   - pod1: abc123 ≠ def456 (outdated)
  │   │   - pod2: abc123 ≠ def456 (outdated)
  │   ├─ Pods outdated → verifyNodePlacement() called
  │   │   ├─ pod1 on node1 → node1 has region=us-east? YES ✓
  │   │   └─ pod2 on node2 → node2 has region=us-east? YES ✓
  │   └─ All pods on correct nodes → Return TRUE
  ↓
Status: Ready ✅
  ↓
No more reconciliation needed!

The fix recognizes that for OnDelete strategy, outdated pod revisions are acceptable if pods are already on nodes matching the current nodeSelector. This indicates the change was nodeSelector-only and doesn't require pod recreation.

internal/state/state_skel.go

…ith OnDelete strategy Signed-off-by: Karthik Vetrivel <[email protected]>

rajathagasthya · 2025-11-07T16:07:39Z

internal/state/state_skel.go

+		if hash, ok := pod.Labels["controller-revision-hash"]; !ok || hash != dsRevisionHash {
+			// Pods have outdated revision - verify they're on nodes matching current nodeSelector
+			reqLogger.V(consts.LogLevelInfo).Info("Pods have outdated revision, verifying node placement")
+			return s.verifyNodePlacement(ctx, ds, ownedPods, reqLogger)


Is node placement is the actual thing we care about for "ready" status? If so, can we just not check for node placement regardless of revision hash on the pods?

I don't think we necessarily need to know if a pod was updated in a level-triggered reconciliation. We just need to periodically check if the final condition is true.

This is a great point. I was originally looking at ways we can see if the pod is updated/not but that's not strictly required. I will look into updating this.

I'm wondering if this also applies to the UpdatedNumberScheduled check. I'm thinking the entire status check can be reduced to:

DesiredNumberScheduled == NumberAvailable, AND

Pods are placed on the correct nodes (or, this might be more precise: each node selected by the nodeSelector has a pod scheduled on it)

cc @tariq1890 @cdesiniotis to validate if my assumptions are correct.

karthikvetrivel · 2025-11-07T18:59:29Z

One issue with this change is that if the driver image spec is updated but the pods are still on the correct nodes, the DaemonSet is marked as Ready. Prior, the DaemonSet was marked as not ready.

The only way I think we can get around this is distinguishing between placement (i.e. node selectory, taint tolerations) changes & workload changes (image, command/args, env variables). If a placement change was made but the pod placement is analogous to the prior policy, then we keep the DaemonSet as marked for ready. For all workload changes, we would mark the DaemonSet as not ready.

karthikvetrivel · 2025-11-18T14:43:26Z

I'm closing this PR. After discussion, it seems like this requires a larger change to nodeSelector / NVIDIADriver CR.

karthikvetrivel requested review from ArangoGutierrez, cdesiniotis, elezar, shivamerla and tariq1890 as code owners November 6, 2025 16:25

karthikvetrivel marked this pull request as draft November 6, 2025 16:26

rahulait reviewed Nov 6, 2025

View reviewed changes

internal/state/state_skel.go Show resolved Hide resolved

rahulait reviewed Nov 6, 2025

View reviewed changes

internal/state/state_skel.go Outdated Show resolved Hide resolved

fix: resolve NVIDIADriver stuck in NotReady on nodeSelector changes w…

c3b65e2

…ith OnDelete strategy Signed-off-by: Karthik Vetrivel <[email protected]>

karthikvetrivel force-pushed the fix/nvidia-driver-nodeselector-state branch from 3abada7 to c3b65e2 Compare November 7, 2025 13:52

rajathagasthya reviewed Nov 7, 2025

View reviewed changes

rajathagasthya mentioned this pull request Nov 14, 2025

ClusterPolicy Status Fluctuates from 'Ready' to 'NotReady' and Back to 'Ready' During GPU Operator Upgrade in Multi-GPU Node Clusters #1567

Open

karthikvetrivel closed this Nov 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: resolve NVIDIADriver stuck in NotReady on nodeSelector changes with OnDelete strategy #1868

fix: resolve NVIDIADriver stuck in NotReady on nodeSelector changes with OnDelete strategy #1868

Uh oh!

karthikvetrivel commented Nov 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

rajathagasthya Nov 7, 2025

Uh oh!

karthikvetrivel Nov 7, 2025

Uh oh!

rajathagasthya Nov 7, 2025 •

edited

Loading

Uh oh!

karthikvetrivel commented Nov 7, 2025

Uh oh!

karthikvetrivel commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: resolve NVIDIADriver stuck in NotReady on nodeSelector changes with OnDelete strategy #1868

fix: resolve NVIDIADriver stuck in NotReady on nodeSelector changes with OnDelete strategy #1868

Uh oh!

Conversation

karthikvetrivel commented Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root Cause

New Logic Flow Diagram

How It Fixes the Bug

Before (Buggy Behavior)

After (Fixed Behavior)

Uh oh!

Uh oh!

Uh oh!

rajathagasthya Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

karthikvetrivel Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

rajathagasthya Nov 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karthikvetrivel commented Nov 7, 2025

Uh oh!

karthikvetrivel commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

karthikvetrivel commented Nov 6, 2025 •

edited

Loading

rajathagasthya Nov 7, 2025 •

edited

Loading