Skip to content

Commit fb0126b

Browse files
authored
feat(operator): changed interrupt order (#83)
1 parent ca0ff30 commit fb0126b

File tree

21 files changed

+531
-151
lines changed

21 files changed

+531
-151
lines changed

.vscode/launch.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
"mode": "debug",
1212
"program": "${workspaceRoot}/operator/cmd/main.go",
1313
"cwd": "${workspaceRoot}/operator",
14-
"buildFlags": "--ldflags '-X github.com/NVIDIA/skyhook/internal/version.GIT_SHA=foobars -X github.com/NVIDIA/skyhook/internal/version.VERSION=v0.5.0'",
14+
"buildFlags": "--ldflags '-X github.com/NVIDIA/skyhook/operator/internal/version.GIT_SHA=foobars -X github.com/NVIDIA/skyhook/operator/internal/version.VERSION=v0.5.0'",
1515
"env": {
1616
"ENABLE_WEBHOOKS": "false",
1717
"LOG_ENCODER": "console",

README.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -186,9 +186,19 @@ The operator will apply steps in a package throughout different lifecycle stages
186186

187187
The stages are applied in this order:
188188

189+
**Without Interrupts:**
190+
- Uninstall -> Apply -> Config (No Upgrade)
191+
- Upgrade -> Config (With Upgrade)
192+
193+
**With Interrupts:**
194+
For packages that require interrupts, the node is first cordoned and drained to ensure workloads are safely evacuated before package operations begin:
189195
- Uninstall -> Apply -> Config -> Interrupt -> Post-Interrupt (No Upgrade)
190196
- Upgrade -> Config -> Interrupt -> Post-Interrupt (With Upgrade)
191197

198+
This ensures that when operations like kernel module unloading or system reboots are required, they happen after workloads have been safely removed and any necessary pre-interrupt package operations have completed.
199+
200+
**NOTE**: If a package is removed from the SCR, then the uninstall stage for that package will solely be run.
201+
192202
**Semantic versioning is strictly enforced in the operator** in order to support upgrade and uninstall. Semantic versioning allows the
193203
operator to know which way the package is going while also enforcing best versioning practices.
194204

docs/README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,9 @@ This directory contains user and operator documentation for Skyhook. Here you'll
1414
- [Runtime Required](runtime_required.md):
1515
How to use the runtime required taint and feature in Skyhook.
1616

17+
- [Interrupt Flow and Ordering](interrupt_flow.md):
18+
Detailed explanation of how Skyhook handles packages with interrupts, including the interrupt sequence.
19+
1720
- [Strict Ordering](ordering_of_skyhooks.md): How and why the operator applies each Skyhook Custom Resource in a deterministic sequential order.
1821

1922
- **Resources**

docs/interrupt_flow.md

Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
# Interrupt Flow and Ordering
2+
3+
This document explains how Skyhook handles packages that require interrupts and the specific ordering of operations to ensure safe and reliable execution.
4+
5+
## Overview
6+
7+
When a package requires an interrupt (such as a reboot or service restart), Skyhook follows a specific sequence to ensure that workloads are safely evacuated from the node before any potentially disruptive operations occur.
8+
9+
## Interrupt Flow Sequence
10+
11+
### For packages WITH interrupts:
12+
13+
1. **Uninstall** (if downgrading) - Package uninstallation operations are executed.
14+
2. **Cordon** - Node is marked as unschedulable to prevent new workloads from being scheduled
15+
3. **Wait** - System waits for any conflicting workloads to naturally complete or be rescheduled
16+
4. **Drain** - Remaining workloads are gracefully evicted from the node
17+
5. **Apply** / **Upgrade** (if upgrading) - Package installation/upgrade operations are executed
18+
6. **Config** - Configuration and setup operations are performed
19+
7. **Interrupt** - The actual interrupt operation (reboot, service restart, etc.) is executed
20+
8. **Post-Interrupt** - Any cleanup or verification operations after the interrupt
21+
22+
### For packages WITHOUT interrupts:
23+
24+
1. **Uninstall** (if downgrading) - Package uninstallation operations are executed.
25+
2. **Apply** / **Upgrade** (if upgrading) - Package installation/upgrade operations are executed
26+
3. **Config** - Configuration and setup operations are performed
27+
28+
## Why This Order Matters
29+
30+
The **uninstall → cordon → wait → drain → apply/upgrade → config → interrupt** sequence is critical for several reasons:
31+
32+
### Safety First
33+
- Workloads are safely removed before any potentially disruptive operations
34+
- Prevents data loss or service interruption for running applications
35+
- Ensures the node is in a clean state before package operations begin
36+
37+
### Use Cases
38+
This ordering is particularly important for scenarios such as:
39+
40+
- **Kernel module changes**: Unloading kernel modules while workloads are present could cause system instability
41+
- **GPU mode switching**: Changing GPU from graphics to compute mode requires exclusive access
42+
- **Driver updates**: Hardware driver changes need exclusive access to the hardware
43+
- **System reboots**: Obviously require all workloads to be evacuated first
44+
45+
### Example Scenario
46+
47+
Consider a package that needs to unload a kernel module, perform some operations, and then reboot:
48+
49+
```yaml
50+
apiVersion: skyhook.nvidia.com/v1alpha1
51+
kind: Skyhook
52+
metadata:
53+
name: gpu-mode-switch
54+
spec:
55+
packages:
56+
gpu-driver:
57+
version: "1.0.0"
58+
image: "example/gpu-driver"
59+
interrupt:
60+
type: "reboot"
61+
```
62+
63+
**Flow:**
64+
1. **Cordon**: Node becomes unschedulable
65+
2. **Wait**: Any non-interrupt workloads are given time to complete
66+
3. **Drain**: Remaining workloads are evicted
67+
4. **Apply**: GPU driver package operations run (unload old module, install new)
68+
5. **Config**: Configuration files are updated
69+
6. **Interrupt**: System reboots to complete the driver change
70+
7. **Post-Interrupt**: Verification that the new driver is loaded correctly
71+
72+
## Technical Implementation
73+
74+
The interrupt flow is managed by the `ProcessInterrupt` and `EnsureNodeIsReadyForInterrupt` functions in the Skyhook controller, which:
75+
76+
- Check for conflicting workloads using label selectors
77+
- Coordinate the cordon and drain operations
78+
- Ensure the node is ready before proceeding with package operations
79+
- Handle the timing and sequencing of all stages
80+
81+
## Best Practices
82+
83+
- Always test interrupt-enabled packages in non-production environments first
84+
- Use appropriate `podNonInterruptLabels` selectors to identify important workloads that should block interrupts
85+
- Consider the impact of node cordoning on cluster capacity
86+
- Monitor package logs during interrupt operations for troubleshooting
87+
- Use Grafana dashboards to monitor interrupt operations and track package state transitions across your cluster (see [docs/metrics/](metrics/) for dashboard setup and configuration)

docs/operator-status-definitions.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,10 @@ upgrade → config
7575
```
7676

7777
### With Interrupts:
78+
When a package requires an interrupt, the node is first cordoned and drained before package operations begin:
7879
```
79-
uninstall → apply → config → interrupt → post-interrupt
80-
upgrade → config → interrupt → post-interrupt
81-
```
80+
uninstall (if downgrading) → cordon → wait → drain → apply → config → interrupt → post-interrupt
81+
cordon → wait → drain → upgrade (if upgrading) → config → interrupt → post-interrupt
82+
```
83+
84+
**Note**: The cordon, wait, and drain phases ensure that workloads are safely removed from the node before any package operations that require interrupts (such as reboots or kernel module changes) are executed.

k8s-tests/chainsaw/skyhook/config-skyhook/assert.yaml

Lines changed: 19 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ metadata:
2121
skyhook.nvidia.com/test-node: skyhooke2e
2222
skyhook.nvidia.com/status_config-skyhook: in_progress
2323
annotations:
24-
("skyhook.nvidia.com/nodeState_config-skyhook" && parse_json("skyhook.nvidia.com/nodeState_config-skyhook")):
24+
("skyhook.nvidia.com/nodeState_config-skyhook" && parse_json("skyhook.nvidia.com/nodeState_config-skyhook")):
2525
{
2626
"baxter|3.2.1": {
2727
"name": "baxter",
@@ -30,22 +30,26 @@ metadata:
3030
"stage": "apply",
3131
"state": "in_progress"
3232
},
33-
"dexter|1.2.3": {
34-
"name": "dexter",
35-
"version": "1.2.3",
33+
"spencer|3.2.3": {
34+
"name": "spencer",
35+
"version": "3.2.3",
3636
"image": "ghcr.io/nvidia/skyhook/agentless",
3737
"stage": "apply",
3838
"state": "in_progress"
3939
},
40-
"spencer|3.2.3": {
41-
"name": "spencer",
42-
"version": "3.2.3",
40+
"dexter|1.2.3": {
41+
"name": "dexter",
42+
"version": "1.2.3",
4343
"image": "ghcr.io/nvidia/skyhook/agentless",
4444
"stage": "apply",
4545
"state": "in_progress"
46-
}
46+
},
4747
}
4848
skyhook.nvidia.com/status_config-skyhook: in_progress
49+
spec:
50+
taints:
51+
- effect: NoSchedule
52+
key: node.kubernetes.io/unschedulable
4953
status:
5054
(conditions[?type == 'skyhook.nvidia.com/config-skyhook/NotReady']):
5155
- reason: "Incomplete"
@@ -62,13 +66,7 @@ status:
6266
status: in_progress
6367
nodeState:
6468
(values(@)):
65-
- dexter|1.2.3:
66-
name: dexter
67-
state: in_progress
68-
version: '1.2.3'
69-
stage: apply
70-
image: ghcr.io/nvidia/skyhook/agentless
71-
baxter|3.2.1:
69+
- baxter|3.2.1:
7270
name: baxter
7371
state: in_progress
7472
version: '3.2.1'
@@ -80,6 +78,12 @@ status:
8078
version: '3.2.3'
8179
stage: apply
8280
image: ghcr.io/nvidia/skyhook/agentless
81+
dexter|1.2.3:
82+
name: dexter
83+
state: in_progress
84+
version: '1.2.3'
85+
stage: apply
86+
image: ghcr.io/nvidia/skyhook/agentless
8387
nodeStatus:
8488
# grab values should be one and is complete
8589
(values(@)):

k8s-tests/chainsaw/skyhook/interrupt-grouping/assert.yaml

Lines changed: 21 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,27 @@
1313
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
1414
# See the License for the specific language governing permissions and
1515
# limitations under the License.
16-
16+
---
17+
kind: Node
18+
apiVersion: v1
19+
metadata:
20+
labels:
21+
skyhook.nvidia.com/test-node: skyhooke2e
22+
skyhook.nvidia.com/status_interrupt-grouping: in_progress
23+
annotations:
24+
skyhook.nvidia.com/status_interrupt-grouping: in_progress
25+
spec:
26+
taints:
27+
- effect: NoSchedule
28+
key: node.kubernetes.io/unschedulable
29+
status:
30+
(conditions[?type == 'skyhook.nvidia.com/interrupt-grouping/NotReady']):
31+
- reason: "Incomplete"
32+
status: "True"
33+
(conditions[?type == 'skyhook.nvidia.com/interrupt-grouping/Erroring']):
34+
- reason: "Not Erroring"
35+
status: "False"
36+
---
1737
kind: Pod
1838
apiVersion: v1
1939
metadata:
Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
#
5+
# Licensed under the Apache License, Version 2.0 (the "License");
6+
# you may not use this file except in compliance with the License.
7+
# You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
17+
---
18+
kind: Pod
19+
apiVersion: v1
20+
metadata:
21+
namespace: skyhook
22+
labels:
23+
skyhook.nvidia.com/name: interrupt
24+
skyhook.nvidia.com/package: jason-1.3.2
25+
annotations:
26+
("skyhook.nvidia.com/package" && parse_json("skyhook.nvidia.com/package")):
27+
{
28+
"name": "jason",
29+
"version": "1.3.2",
30+
"skyhook": "interrupt",
31+
"stage": "apply",
32+
"image": "ghcr.io/nvidia/skyhook/agentless"
33+
}
34+
ownerReferences:
35+
- apiVersion: skyhook.nvidia.com/v1alpha1
36+
kind: Skyhook
37+
name: interrupt
38+
spec:
39+
initContainers:
40+
- name: jason-init
41+
image: ghcr.io/nvidia/skyhook/agentless:1.3.2
42+
- name: jason-apply
43+
image: ghcr.io/nvidia/skyhook/agentless:3.2.3
44+
args:
45+
([0]): apply
46+
([1]): /root
47+
(length(@)): 3
48+
- name: jason-applycheck
49+
image: ghcr.io/nvidia/skyhook/agentless:3.2.3
50+
args:
51+
([0]): apply-check
52+
([1]): /root
53+
(length(@)): 3
54+
---
55+
apiVersion: v1
56+
kind: Node
57+
metadata:
58+
labels:
59+
skyhook.nvidia.com/test-node: skyhooke2e
60+
skyhook.nvidia.com/status_interrupt: in_progress
61+
annotations:
62+
("skyhook.nvidia.com/nodeState_interrupt" && parse_json("skyhook.nvidia.com/nodeState_interrupt")):
63+
{
64+
"jason|1.3.2": {
65+
"name": "jason",
66+
"version": "1.3.2",
67+
"image": "ghcr.io/nvidia/skyhook/agentless",
68+
"stage": "config",
69+
"state": "complete"
70+
}
71+
}
72+
skyhook.nvidia.com/status_interrupt: in_progress
73+
spec:
74+
taints:
75+
- effect: NoSchedule
76+
key: node.kubernetes.io/unschedulable
77+
status:
78+
(conditions[?type == 'skyhook.nvidia.com/interrupt/NotReady']):
79+
- reason: "Incomplete"
80+
status: "True"
81+
(conditions[?type == 'skyhook.nvidia.com/interrupt/Erroring']):
82+
- reason: "Not Erroring"
83+
status: "False"
84+
---

0 commit comments

Comments
 (0)