[Bug]: Add ability to cancel breakfix pipeline manually

### Prerequisites

- [x] I searched existing issues
- [x] I can reproduce this issue

## Manual Node Uncordon Not Propagated to Node Drainer and Fault Remediation Modules

When an operator manually uncordons a node (using `kubectl uncordon`) that was previously cordoned by the Fault Quarantine (FQ) module, the uncordon event is only registered by FQ but not propagated to downstream modules (Node Drainer and Fault Remediation). This can lead to stale state and incorrect behavior in the remediation pipeline.

### Current System Behavior

The normal event flow is:
1. Health Monitor detects a fatal health event
2. Platform Connector forwards event to MongoDB
3. Modules consume events in order: **Fault Quarantine → Node Drainer → Fault Remediation**
4. Fault Remediation creates a maintenance CR (e.g., RebootNode)
5. After remediation, Health Monitor sends a healthy event
6. Fault Quarantine uncordons the node upon receiving the healthy event

### Problem

The FQ module has a [callback](https://github.com/NVIDIA/NVSentinel/blob/v0.2.0/fault-quarantine-module/pkg/reconciler/reconciler.go#L170) that detects manual uncordon operations and updates its own state (removes taint, clears `quarantineHealthEvent` and `quarantineHealthEventIsCordoned` annotations, adds `quarantinedNodeUncordonedManually` annotation). However, this information is **not propagated** to Node Drainer (ND) or Fault Remediation (FR) modules if the Health Monitor doesn't send a healthy event.


### Proposed Solutions

#### Short-term Fix
Add callbacks in Node Drainer and Fault Remediation modules (similar to FQ's callback) to detect and handle manual uncordon events independently.

**Pros:**
- Quick to implement
- Each module maintains its own state consistency

**Cons:**
- Duplicates uncordon detection logic across modules
- Doesn't address the root cause of missing event propagation

#### Long-term Fix
Modify the event stream architecture to naturally handle manual uncordon events:
- When a manual uncordon, generate an unquarantine event in MongoDB
- Fault handling modules (FQ, ND, FR) consume this event and update their state accordingly
- Ensures all modules stay synchronized through the standard event pipeline

**Pros:**
- Centralized event handling
- Maintains architectural consistency
- Easier to maintain and extend

**Cons:**
- Requires changes to event schema and flow
- More complex implementation

### Impact
- **Severity:** Medium-High
- **Frequency:** Occurs whenever operators manually intervene with uncordon operations
- **Affected Components:** Node Drainer, Fault Remediation

### Component

Fault Management

### Steps to Reproduce

### Reproduction Scenarios

#### Scenario 1: Node Drainer Stuck Waiting

1. Health Monitor sends a fatal health event
2. FQ cordons the node
3. Node Drainer starts waiting for pods to finish (assuming `AllowCompletion` mode)
4. Operator manually uncordons the node using `kubectl uncordon <node>`
5. FQ registers the uncordon and updates its state
6. **Problem:** Node Drainer continues waiting indefinitely for pods to complete, unaware that the node was manually uncordoned

#### Scenario 2: Fault Remediation Annotation Not Cleared

1. Health Monitor sends a fatal health event
2. FQ cordons the node
3. ND drains the node successfully
4. FR creates a RebootNode CR and the node is remediated
5. Health Monitor doesn't send a healthy event for some reason
6. Operator manually uncordons the node using `kubectl uncordon <node>`
7. FQ registers the manual uncordon, but FR doesn't receive an unquarantine event
8. **Problem:** [Annotation cleanup code](https://github.com/NVIDIA/NVSentinel/blob/main/fault-remediation-module/pkg/reconciler/reconciler.go#L119) is not executed, leaving stale `latestFaultRemediationState` annotation on the node
9. Later, Health Monitor sends another fatal event
10. Node is cordoned and drained again
11. When FR [checks for existing CR status](https://github.com/NVIDIA/NVSentinel/blob/main/fault-remediation-module/pkg/reconciler/reconciler.go#L387), it finds the stale annotation indicating remediation is already in progress
12. **Result:** FR skips creating a new maintenance CR, and the node is never remediated

### Expected Behavior

When a node is manually uncordoned by an operator:
- FQ should detect the manual uncordon
- ND should stop waiting/draining operations for that node
- FR should clear remediation state annotations
- All modules should be consistent with the node's actual state

### Actual Behavior

- Only FQ detects and handles manual uncordon
- ND and FR remain unaware, leading to stuck operations or stale state
- No mechanism exists to propagate manual uncordon events through the pipeline

### Environment

- NVSentinel version: v0.1.0 / 1.167.1
- Kubernetes version: 1.31.1
- Deployment method: Argo


### Logs/Output

```
I1023 11:41:02.381627       1 reconciler.go:93] Event received: map[_id:map[_data:82..04] clusterTime:{T:1761219662 I:1} documentKey:map[_id:ObjectID("68..c8")] fullDocument:map[_id:ObjectID("68..c8") createdAt:1761219482208 healthevent:map[agent:syslog-health-monitor checkname:SysLogsXIDError componentclass:GPU drainoverrides:<nil> entitiesimpacted:[map[entitytype:PCI entityvalue:0000:03:00] map[entitytype:GPUID entityvalue:GPU-455-ffc457bff834]] errorcode:[109] generatedtimestamp:map[nanos:207323127 seconds:1761219482] isfatal:true ishealthy:false message:ROBUST_CHANNEL_CTXSW_TIMEOUT_ERROR metadata:<nil> nodename:[redacted]-2206 quarantineoverrides:<nil> recommendedaction:20 version:1] healtheventstatus:map[faultremediated:<nil> nodequarantined:Quarantined userpodsevictionstatus:map[status:Succeeded]]] ns:map[coll:HealthEvents db:HealthEventsDatabase] operationType:update updateDescription:map[removedFields:[] truncatedArrays:[] updatedFields:map[healtheventstatus.userpodsevictionstatus:map[status:Succeeded]]] wallTime:1761219662379]
I1023 11:41:02.390753       1 reconciler.go:415] Found existing CR maintenance-[redacted]-w-2206-68f731bc07936fda07aceb08 for node [redacted]-w-2206 group restart with status Succeeded
I1023 11:41:02.390762       1 reconciler.go:339] Skipping event for node [redacted]-w-2206 - CR maintenance-[redacted]-w-2206-68f731bc07936fda07aceb08 is Succeeded
I1023 11:41:02.390766       1 reconciler.go:258] Skipping event for node [redacted]-w-2206 due to existing CR maintenance-[redacted]-w-2206-68f731bc07936fda07aceb08
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: Add ability to cancel breakfix pipeline manually #141

Prerequisites

Manual Node Uncordon Not Propagated to Node Drainer and Fault Remediation Modules

Current System Behavior

Problem

Proposed Solutions

Short-term Fix

Long-term Fix

Impact

Component

Steps to Reproduce

Reproduction Scenarios

Scenario 1: Node Drainer Stuck Waiting

Scenario 2: Fault Remediation Annotation Not Cleared

Expected Behavior

Actual Behavior

Environment

Logs/Output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Add ability to cancel breakfix pipeline manually #141

Description

Prerequisites

Manual Node Uncordon Not Propagated to Node Drainer and Fault Remediation Modules

Current System Behavior

Problem

Proposed Solutions

Short-term Fix

Long-term Fix

Impact

Component

Steps to Reproduce

Reproduction Scenarios

Scenario 1: Node Drainer Stuck Waiting

Scenario 2: Fault Remediation Annotation Not Cleared

Expected Behavior

Actual Behavior

Environment

Logs/Output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions