Skip to content

[Bug]: Add ability to cancel breakfix pipeline manually #141

@XRFXLP

Description

@XRFXLP

Prerequisites

  • I searched existing issues
  • I can reproduce this issue

Manual Node Uncordon Not Propagated to Node Drainer and Fault Remediation Modules

When an operator manually uncordons a node (using kubectl uncordon) that was previously cordoned by the Fault Quarantine (FQ) module, the uncordon event is only registered by FQ but not propagated to downstream modules (Node Drainer and Fault Remediation). This can lead to stale state and incorrect behavior in the remediation pipeline.

Current System Behavior

The normal event flow is:

  1. Health Monitor detects a fatal health event
  2. Platform Connector forwards event to MongoDB
  3. Modules consume events in order: Fault Quarantine → Node Drainer → Fault Remediation
  4. Fault Remediation creates a maintenance CR (e.g., RebootNode)
  5. After remediation, Health Monitor sends a healthy event
  6. Fault Quarantine uncordons the node upon receiving the healthy event

Problem

The FQ module has a callback that detects manual uncordon operations and updates its own state (removes taint, clears quarantineHealthEvent and quarantineHealthEventIsCordoned annotations, adds quarantinedNodeUncordonedManually annotation). However, this information is not propagated to Node Drainer (ND) or Fault Remediation (FR) modules if the Health Monitor doesn't send a healthy event.

Proposed Solutions

Short-term Fix

Add callbacks in Node Drainer and Fault Remediation modules (similar to FQ's callback) to detect and handle manual uncordon events independently.

Pros:

  • Quick to implement
  • Each module maintains its own state consistency

Cons:

  • Duplicates uncordon detection logic across modules
  • Doesn't address the root cause of missing event propagation

Long-term Fix

Modify the event stream architecture to naturally handle manual uncordon events:

  • When a manual uncordon, generate an unquarantine event in MongoDB
  • Fault handling modules (FQ, ND, FR) consume this event and update their state accordingly
  • Ensures all modules stay synchronized through the standard event pipeline

Pros:

  • Centralized event handling
  • Maintains architectural consistency
  • Easier to maintain and extend

Cons:

  • Requires changes to event schema and flow
  • More complex implementation

Impact

  • Severity: Medium-High
  • Frequency: Occurs whenever operators manually intervene with uncordon operations
  • Affected Components: Node Drainer, Fault Remediation

Component

Fault Management

Steps to Reproduce

Reproduction Scenarios

Scenario 1: Node Drainer Stuck Waiting

  1. Health Monitor sends a fatal health event
  2. FQ cordons the node
  3. Node Drainer starts waiting for pods to finish (assuming AllowCompletion mode)
  4. Operator manually uncordons the node using kubectl uncordon <node>
  5. FQ registers the uncordon and updates its state
  6. Problem: Node Drainer continues waiting indefinitely for pods to complete, unaware that the node was manually uncordoned

Scenario 2: Fault Remediation Annotation Not Cleared

  1. Health Monitor sends a fatal health event
  2. FQ cordons the node
  3. ND drains the node successfully
  4. FR creates a RebootNode CR and the node is remediated
  5. Health Monitor doesn't send a healthy event for some reason
  6. Operator manually uncordons the node using kubectl uncordon <node>
  7. FQ registers the manual uncordon, but FR doesn't receive an unquarantine event
  8. Problem: Annotation cleanup code is not executed, leaving stale latestFaultRemediationState annotation on the node
  9. Later, Health Monitor sends another fatal event
  10. Node is cordoned and drained again
  11. When FR checks for existing CR status, it finds the stale annotation indicating remediation is already in progress
  12. Result: FR skips creating a new maintenance CR, and the node is never remediated

Expected Behavior

When a node is manually uncordoned by an operator:

  • FQ should detect the manual uncordon
  • ND should stop waiting/draining operations for that node
  • FR should clear remediation state annotations
  • All modules should be consistent with the node's actual state

Actual Behavior

  • Only FQ detects and handles manual uncordon
  • ND and FR remain unaware, leading to stuck operations or stale state
  • No mechanism exists to propagate manual uncordon events through the pipeline

Environment

  • NVSentinel version: v0.1.0 / 1.167.1
  • Kubernetes version: 1.31.1
  • Deployment method: Argo

Logs/Output

I1023 11:41:02.381627       1 reconciler.go:93] Event received: map[_id:map[_data:82..04] clusterTime:{T:1761219662 I:1} documentKey:map[_id:ObjectID("68..c8")] fullDocument:map[_id:ObjectID("68..c8") createdAt:1761219482208 healthevent:map[agent:syslog-health-monitor checkname:SysLogsXIDError componentclass:GPU drainoverrides:<nil> entitiesimpacted:[map[entitytype:PCI entityvalue:0000:03:00] map[entitytype:GPUID entityvalue:GPU-455-ffc457bff834]] errorcode:[109] generatedtimestamp:map[nanos:207323127 seconds:1761219482] isfatal:true ishealthy:false message:ROBUST_CHANNEL_CTXSW_TIMEOUT_ERROR metadata:<nil> nodename:[redacted]-2206 quarantineoverrides:<nil> recommendedaction:20 version:1] healtheventstatus:map[faultremediated:<nil> nodequarantined:Quarantined userpodsevictionstatus:map[status:Succeeded]]] ns:map[coll:HealthEvents db:HealthEventsDatabase] operationType:update updateDescription:map[removedFields:[] truncatedArrays:[] updatedFields:map[healtheventstatus.userpodsevictionstatus:map[status:Succeeded]]] wallTime:1761219662379]
I1023 11:41:02.390753       1 reconciler.go:415] Found existing CR maintenance-[redacted]-w-2206-68f731bc07936fda07aceb08 for node [redacted]-w-2206 group restart with status Succeeded
I1023 11:41:02.390762       1 reconciler.go:339] Skipping event for node [redacted]-w-2206 - CR maintenance-[redacted]-w-2206-68f731bc07936fda07aceb08 is Succeeded
I1023 11:41:02.390766       1 reconciler.go:258] Skipping event for node [redacted]-w-2206 due to existing CR maintenance-[redacted]-w-2206-68f731bc07936fda07aceb08

Metadata

Metadata

Labels

bugSomething isn't working

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions