Skip to content

Conversation

@karthikvetrivel
Copy link
Member

@karthikvetrivel karthikvetrivel commented Nov 6, 2025

Solves #1661.

Problem

When editing the nodeSelector field of an NVIDIADriver CR, the resource enters a permanent NotReady state if the change doesn't result in pod updates (e.g., replacing equivalent labels). This causes infinite reconciliation loops.

Root Cause

The readiness check required UpdatedNumberScheduled == NumberAvailable, but with OnDelete update strategy, pods are never auto-updated even when already on correct nodes.

New Logic Flow Diagram

isDaemonSetReady()
  │
  ├─ Check: DesiredNumberScheduled != 0?
  ├─ Check: NumberUnavailable == 0?
  ├─ Check: DesiredNumberScheduled == NumberAvailable?
  │
  ├─ If OnDelete strategy:
  │   │
  │   └─> isDaemonSetReadyOnDelete()
  │        │
  │        ├─ getOwnedPods() → Get pods for DaemonSet
  │        │
  │        ├─ arePodsHealthy() → All Running + Ready?
  │        │   └─ NO → Return NotReady ❌
  │        │
  │        ├─ getLatestRevisionHash() → Get DaemonSet revision
  │        │
  │        ├─ Check pod controller-revision-hash labels
  │        │   ├─ All match? → Return Ready ✅
  │        │   │
  │        │   └─ Some outdated? → verifyNodePlacement()
  │        │        │
  │        │        ├─ For each pod:
  │        │        │   └─ Get node, check labels match nodeSelector
  │        │        │
  │        │        ├─ All on correct nodes? → Return Ready ✅
  │        │        └─ Some on wrong nodes? → Return NotReady ❌
  │
  └─ If RollingUpdate strategy:
      │
      └─> Check: UpdatedNumberScheduled == NumberAvailable?
          ├─ YES → Return Ready ✅
          └─ NO  → Return NotReady ❌

How It Fixes the Bug

Before (Buggy Behavior)

User changes: nodeSelector: {zone: us-east} → {region: us-east}
  ↓
DaemonSet spec updated (creates new ControllerRevision: def456)
  ↓
Kubernetes checks nodes:
  - node1 has {zone: us-east, region: us-east} ✓
  - node2 has {zone: us-east, region: us-east} ✓
  - Pods already on these nodes
  ↓
OnDelete strategy: Don't touch existing pods
  ↓
DaemonSet Status:
  - DesiredNumberScheduled: 2 ✓
  - NumberAvailable: 2 ✓
  - UpdatedNumberScheduled: 0 ✗ (pods have old revision abc123)
  ↓
Readiness Check:
  if (2 != 0 && 2 == 2 && 0 == 2) → FALSE
  ↓
Status: NotReady
  ↓
Reconcile loop: Wait 5s → Check again → Still NotReady
  ↓
♾️ INFINITE LOOP - NEVER BECOMES READY

After (Fixed Behavior)

User changes: nodeSelector: {zone: us-east} → {region: us-east}
  ↓
DaemonSet spec updated (creates new ControllerRevision: def456)
  ↓
Kubernetes checks nodes:
  - node1 has {zone: us-east, region: us-east} ✓
  - node2 has {zone: us-east, region: us-east} ✓
  - Pods already on these nodes
  ↓
OnDelete strategy: Don't touch existing pods
  ↓
DaemonSet Status:
  - DesiredNumberScheduled: 2 ✓
  - NumberAvailable: 2 ✓
  - UpdatedNumberScheduled: 0 ✗ (pods have old revision abc123)
  ↓
NEW Readiness Check:
  ├─ OnDelete strategy detected
  ├─ isDaemonSetReadyOnDelete() called
  │   ├─ getOwnedPods() → [pod1, pod2]
  │   ├─ arePodsHealthy() → TRUE ✓
  │   ├─ getLatestRevisionHash() → "def456"
  │   ├─ Check revisions:
  │   │   - pod1: abc123 ≠ def456 (outdated)
  │   │   - pod2: abc123 ≠ def456 (outdated)
  │   ├─ Pods outdated → verifyNodePlacement() called
  │   │   ├─ pod1 on node1 → node1 has region=us-east? YES ✓
  │   │   └─ pod2 on node2 → node2 has region=us-east? YES ✓
  │   └─ All pods on correct nodes → Return TRUE
  ↓
Status: Ready ✅
  ↓
No more reconciliation needed!

The fix recognizes that for OnDelete strategy, outdated pod revisions are acceptable if pods are already on nodes matching the current nodeSelector. This indicates the change was nodeSelector-only and doesn't require pod recreation.

…ith OnDelete strategy

Signed-off-by: Karthik Vetrivel <[email protected]>
@karthikvetrivel karthikvetrivel force-pushed the fix/nvidia-driver-nodeselector-state branch from 3abada7 to c3b65e2 Compare November 7, 2025 13:52
if hash, ok := pod.Labels["controller-revision-hash"]; !ok || hash != dsRevisionHash {
// Pods have outdated revision - verify they're on nodes matching current nodeSelector
reqLogger.V(consts.LogLevelInfo).Info("Pods have outdated revision, verifying node placement")
return s.verifyNodePlacement(ctx, ds, ownedPods, reqLogger)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is node placement is the actual thing we care about for "ready" status? If so, can we just not check for node placement regardless of revision hash on the pods?

I don't think we necessarily need to know if a pod was updated in a level-triggered reconciliation. We just need to periodically check if the final condition is true.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great point. I was originally looking at ways we can see if the pod is updated/not but that's not strictly required. I will look into updating this.

Copy link
Contributor

@rajathagasthya rajathagasthya Nov 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this also applies to the UpdatedNumberScheduled check. I'm thinking the entire status check can be reduced to:

  1. DesiredNumberScheduled == NumberAvailable, AND
  2. Pods are placed on the correct nodes (or, this might be more precise: each node selected by the nodeSelector has a pod scheduled on it)

cc @tariq1890 @cdesiniotis to validate if my assumptions are correct.

@karthikvetrivel
Copy link
Member Author

One issue with this change is that if the driver image spec is updated but the pods are still on the correct nodes, the DaemonSet is marked as Ready. Prior, the DaemonSet was marked as not ready.

The only way I think we can get around this is distinguishing between placement (i.e. node selectory, taint tolerations) changes & workload changes (image, command/args, env variables). If a placement change was made but the pod placement is analogous to the prior policy, then we keep the DaemonSet as marked for ready. For all workload changes, we would mark the DaemonSet as not ready.

@karthikvetrivel
Copy link
Member Author

I'm closing this PR. After discussion, it seems like this requires a larger change to nodeSelector / NVIDIADriver CR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants