|
| 1 | +# Interrupt Flow and Ordering |
| 2 | + |
| 3 | +This document explains how Skyhook handles packages that require interrupts and the specific ordering of operations to ensure safe and reliable execution. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +When a package requires an interrupt (such as a reboot or service restart), Skyhook follows a specific sequence to ensure that workloads are safely evacuated from the node before any potentially disruptive operations occur. |
| 8 | + |
| 9 | +## Interrupt Flow Sequence |
| 10 | + |
| 11 | +### For packages WITH interrupts: |
| 12 | + |
| 13 | +1. **Uninstall** (if downgrading) - Package uninstallation operations are executed. |
| 14 | +2. **Cordon** - Node is marked as unschedulable to prevent new workloads from being scheduled |
| 15 | +3. **Wait** - System waits for any conflicting workloads to naturally complete or be rescheduled |
| 16 | +4. **Drain** - Remaining workloads are gracefully evicted from the node |
| 17 | +5. **Apply** / **Upgrade** (if upgrading) - Package installation/upgrade operations are executed |
| 18 | +6. **Config** - Configuration and setup operations are performed |
| 19 | +7. **Interrupt** - The actual interrupt operation (reboot, service restart, etc.) is executed |
| 20 | +8. **Post-Interrupt** - Any cleanup or verification operations after the interrupt |
| 21 | + |
| 22 | +### For packages WITHOUT interrupts: |
| 23 | + |
| 24 | +1. **Uninstall** (if downgrading) - Package uninstallation operations are executed. |
| 25 | +2. **Apply** / **Upgrade** (if upgrading) - Package installation/upgrade operations are executed |
| 26 | +3. **Config** - Configuration and setup operations are performed |
| 27 | + |
| 28 | +## Why This Order Matters |
| 29 | + |
| 30 | +The **uninstall → cordon → wait → drain → apply/upgrade → config → interrupt** sequence is critical for several reasons: |
| 31 | + |
| 32 | +### Safety First |
| 33 | +- Workloads are safely removed before any potentially disruptive operations |
| 34 | +- Prevents data loss or service interruption for running applications |
| 35 | +- Ensures the node is in a clean state before package operations begin |
| 36 | + |
| 37 | +### Use Cases |
| 38 | +This ordering is particularly important for scenarios such as: |
| 39 | + |
| 40 | +- **Kernel module changes**: Unloading kernel modules while workloads are present could cause system instability |
| 41 | +- **GPU mode switching**: Changing GPU from graphics to compute mode requires exclusive access |
| 42 | +- **Driver updates**: Hardware driver changes need exclusive access to the hardware |
| 43 | +- **System reboots**: Obviously require all workloads to be evacuated first |
| 44 | + |
| 45 | +### Example Scenario |
| 46 | + |
| 47 | +Consider a package that needs to unload a kernel module, perform some operations, and then reboot: |
| 48 | + |
| 49 | +```yaml |
| 50 | +apiVersion: skyhook.nvidia.com/v1alpha1 |
| 51 | +kind: Skyhook |
| 52 | +metadata: |
| 53 | + name: gpu-mode-switch |
| 54 | +spec: |
| 55 | + packages: |
| 56 | + gpu-driver: |
| 57 | + version: "1.0.0" |
| 58 | + image: "example/gpu-driver" |
| 59 | + interrupt: |
| 60 | + type: "reboot" |
| 61 | +``` |
| 62 | +
|
| 63 | +**Flow:** |
| 64 | +1. **Cordon**: Node becomes unschedulable |
| 65 | +2. **Wait**: Any non-interrupt workloads are given time to complete |
| 66 | +3. **Drain**: Remaining workloads are evicted |
| 67 | +4. **Apply**: GPU driver package operations run (unload old module, install new) |
| 68 | +5. **Config**: Configuration files are updated |
| 69 | +6. **Interrupt**: System reboots to complete the driver change |
| 70 | +7. **Post-Interrupt**: Verification that the new driver is loaded correctly |
| 71 | +
|
| 72 | +## Technical Implementation |
| 73 | +
|
| 74 | +The interrupt flow is managed by the `ProcessInterrupt` and `EnsureNodeIsReadyForInterrupt` functions in the Skyhook controller, which: |
| 75 | + |
| 76 | +- Check for conflicting workloads using label selectors |
| 77 | +- Coordinate the cordon and drain operations |
| 78 | +- Ensure the node is ready before proceeding with package operations |
| 79 | +- Handle the timing and sequencing of all stages |
| 80 | + |
| 81 | +## Best Practices |
| 82 | + |
| 83 | +- Always test interrupt-enabled packages in non-production environments first |
| 84 | +- Use appropriate `podNonInterruptLabels` selectors to identify important workloads that should block interrupts |
| 85 | +- Consider the impact of node cordoning on cluster capacity |
| 86 | +- Monitor package logs during interrupt operations for troubleshooting |
| 87 | +- Use Grafana dashboards to monitor interrupt operations and track package state transitions across your cluster (see [docs/metrics/](metrics/) for dashboard setup and configuration) |
0 commit comments