-
Notifications
You must be signed in to change notification settings - Fork 21
Description
Prerequisites
- I searched existing issues
- I can reproduce this issue
Manual Node Uncordon Not Propagated to Node Drainer and Fault Remediation Modules
When an operator manually uncordons a node (using kubectl uncordon) that was previously cordoned by the Fault Quarantine (FQ) module, the uncordon event is only registered by FQ but not propagated to downstream modules (Node Drainer and Fault Remediation). This can lead to stale state and incorrect behavior in the remediation pipeline.
Current System Behavior
The normal event flow is:
- Health Monitor detects a fatal health event
- Platform Connector forwards event to MongoDB
- Modules consume events in order: Fault Quarantine → Node Drainer → Fault Remediation
- Fault Remediation creates a maintenance CR (e.g., RebootNode)
- After remediation, Health Monitor sends a healthy event
- Fault Quarantine uncordons the node upon receiving the healthy event
Problem
The FQ module has a callback that detects manual uncordon operations and updates its own state (removes taint, clears quarantineHealthEvent and quarantineHealthEventIsCordoned annotations, adds quarantinedNodeUncordonedManually annotation). However, this information is not propagated to Node Drainer (ND) or Fault Remediation (FR) modules if the Health Monitor doesn't send a healthy event.
Proposed Solutions
Short-term Fix
Add callbacks in Node Drainer and Fault Remediation modules (similar to FQ's callback) to detect and handle manual uncordon events independently.
Pros:
- Quick to implement
- Each module maintains its own state consistency
Cons:
- Duplicates uncordon detection logic across modules
- Doesn't address the root cause of missing event propagation
Long-term Fix
Modify the event stream architecture to naturally handle manual uncordon events:
- When a manual uncordon, generate an unquarantine event in MongoDB
- Fault handling modules (FQ, ND, FR) consume this event and update their state accordingly
- Ensures all modules stay synchronized through the standard event pipeline
Pros:
- Centralized event handling
- Maintains architectural consistency
- Easier to maintain and extend
Cons:
- Requires changes to event schema and flow
- More complex implementation
Impact
- Severity: Medium-High
- Frequency: Occurs whenever operators manually intervene with uncordon operations
- Affected Components: Node Drainer, Fault Remediation
Component
Fault Management
Steps to Reproduce
Reproduction Scenarios
Scenario 1: Node Drainer Stuck Waiting
- Health Monitor sends a fatal health event
- FQ cordons the node
- Node Drainer starts waiting for pods to finish (assuming
AllowCompletionmode) - Operator manually uncordons the node using
kubectl uncordon <node> - FQ registers the uncordon and updates its state
- Problem: Node Drainer continues waiting indefinitely for pods to complete, unaware that the node was manually uncordoned
Scenario 2: Fault Remediation Annotation Not Cleared
- Health Monitor sends a fatal health event
- FQ cordons the node
- ND drains the node successfully
- FR creates a RebootNode CR and the node is remediated
- Health Monitor doesn't send a healthy event for some reason
- Operator manually uncordons the node using
kubectl uncordon <node> - FQ registers the manual uncordon, but FR doesn't receive an unquarantine event
- Problem: Annotation cleanup code is not executed, leaving stale
latestFaultRemediationStateannotation on the node - Later, Health Monitor sends another fatal event
- Node is cordoned and drained again
- When FR checks for existing CR status, it finds the stale annotation indicating remediation is already in progress
- Result: FR skips creating a new maintenance CR, and the node is never remediated
Expected Behavior
When a node is manually uncordoned by an operator:
- FQ should detect the manual uncordon
- ND should stop waiting/draining operations for that node
- FR should clear remediation state annotations
- All modules should be consistent with the node's actual state
Actual Behavior
- Only FQ detects and handles manual uncordon
- ND and FR remain unaware, leading to stuck operations or stale state
- No mechanism exists to propagate manual uncordon events through the pipeline
Environment
- NVSentinel version: v0.1.0 / 1.167.1
- Kubernetes version: 1.31.1
- Deployment method: Argo
Logs/Output
I1023 11:41:02.381627 1 reconciler.go:93] Event received: map[_id:map[_data:82..04] clusterTime:{T:1761219662 I:1} documentKey:map[_id:ObjectID("68..c8")] fullDocument:map[_id:ObjectID("68..c8") createdAt:1761219482208 healthevent:map[agent:syslog-health-monitor checkname:SysLogsXIDError componentclass:GPU drainoverrides:<nil> entitiesimpacted:[map[entitytype:PCI entityvalue:0000:03:00] map[entitytype:GPUID entityvalue:GPU-455-ffc457bff834]] errorcode:[109] generatedtimestamp:map[nanos:207323127 seconds:1761219482] isfatal:true ishealthy:false message:ROBUST_CHANNEL_CTXSW_TIMEOUT_ERROR metadata:<nil> nodename:[redacted]-2206 quarantineoverrides:<nil> recommendedaction:20 version:1] healtheventstatus:map[faultremediated:<nil> nodequarantined:Quarantined userpodsevictionstatus:map[status:Succeeded]]] ns:map[coll:HealthEvents db:HealthEventsDatabase] operationType:update updateDescription:map[removedFields:[] truncatedArrays:[] updatedFields:map[healtheventstatus.userpodsevictionstatus:map[status:Succeeded]]] wallTime:1761219662379]
I1023 11:41:02.390753 1 reconciler.go:415] Found existing CR maintenance-[redacted]-w-2206-68f731bc07936fda07aceb08 for node [redacted]-w-2206 group restart with status Succeeded
I1023 11:41:02.390762 1 reconciler.go:339] Skipping event for node [redacted]-w-2206 - CR maintenance-[redacted]-w-2206-68f731bc07936fda07aceb08 is Succeeded
I1023 11:41:02.390766 1 reconciler.go:258] Skipping event for node [redacted]-w-2206 due to existing CR maintenance-[redacted]-w-2206-68f731bc07936fda07aceb08