Skip to content
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 5 additions & 6 deletions chart/templates/skyhook-crd.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -421,12 +421,6 @@ spec:
type: object
description: Packages are the DAG of packages to be applied to nodes.
type: object
pause:
default: false
description: |-
Pause halt the operator from proceeding. THIS is for admin use to stop skyhook if there is an issue or
concert without needing to delete to ad in discovery of the issue.
type: boolean
podNonInterruptLabels:
description: PodNonInterruptLabels are a set of labels we want to
monitor pods for whether they Interruptible
Expand Down Expand Up @@ -479,6 +473,11 @@ spec:
description: This skyhook is required to have been completed before
any workloads can start
type: boolean
priority:
description: Priority determines the order in which skyhooks are applied. Lower values are applied first.
type: integer
minimum: 0
default: 200
serial:
default: false
description: Serial tells skyhook if it allowed to run in parallel or
Expand Down
24 changes: 24 additions & 0 deletions docs/ordering_of_skyhooks.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Ordering of Skyhooks
## What
With v0.8.0 Skyhooks now always get applied in a repeatable and specific order. This also means that all Skyhooks will now be sequential, though packages within a Skyhook can be parallel. Each custom resource now supports a `priority` field which is a non-zero positive integer. Skyhooks will be processed in order starting from 0, any Skyhooks with the same `priority` will be processed by sorting them by their `metadata.name` field.

**NOTE**: Any Skyhook which does NOT provide a `priority` field will be assigned a priority value of 200.

Two additional flow control features have been added with this and can be set in the annotations of each skyhook:
* `skyhook.nvidia.com/disable`: bool. When `true` it will skip this Skyhook from processing and continue with any other ones further down the priority order.
* `skyhook.nvidia.com/pause`: bool. When `true` it will NOT process this Skyhook and it WILL NOT continue to process any Skyhook's after this one. This will effectively stop all application of Skyhooks starting with this one. NOTE: This ability used to be on the Skyhook spec itself as the `pause` field and has been moved here to be consistent with `disable` and to avoid incrementing the generation of a Skyhook Custom Resource instance when changing it.

## Why
This solves a few problems:

The first is to to better support debugging. Prior to this it was impossible to know the order Skyhooks would get applied to nodes as they would all run in parallel. This can, and has, lead to issues debugging a problem as it isn't deterministic. Now every node will always receive updates in the same order as every other node. Additionaly, this removes the possiblility of conflicts between Skyhooks by heaving each one run in order.

The second is to provide the ability for complex tasks to be sequenced. This comes up when needing to apply different sets of work to different node groups in a particular order.

The third is to provide the community a way to bucket Skyhooks according to where they might live in a stream of updates and therefore better coordinate work without explicit communication. We propose the following buckets:
* 1 - 99 for initialization and infrastucture work
* install security or monitoring tools
* 100 - 199 for configuration work
* configuring ssh access
* 200+ for final user level configuration
* applying tuning for workloads
3 changes: 3 additions & 0 deletions examples/interrupt-wait-for-pod/scr.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,9 @@ metadata:
app.kubernetes.io/part-of: skyhook-operator
app.kubernetes.io/created-by: skyhook-operator
name: demo
annotations:
skyhook.nvidia.com/pause: "false"
skyhook.nvidia.com/disable: "false"
spec:
nodeSelectors:
matchLabels:
Expand Down
10 changes: 8 additions & 2 deletions examples/simple/scr.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -25,17 +25,23 @@ metadata:
app.kubernetes.io/part-of: skyhook-operator
app.kubernetes.io/created-by: skyhook-operator
name: demo
annotations:
skyhook.nvidia.com/pause: "false"
skyhook.nvidia.com/disable: "false"
spec:
nodeSelectors:
matchLabels:
skyhook.nvidia.com/test-node: skyhooke2e
packages:
baz:
version: 1.1.0
image: ghcr.io/nvidia/skyhook-packages/shellscript
configMap:
config.yaml: |-
#!/bin/bash
sleep 30
sleep 1
echo "Hello, config!"
config_check.yaml: |-
#!/bin/bash
sleep 30
sleep 1
echo "Hello, config check!"
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,6 @@ kind: Test
metadata:
name: cleanup-pods
spec:
concurrent: true
description: |
This test runs a simple skyhook with dependsOn. We wait tell completed, then trigger update to force config cycle on package B. Once config
is complete, we update again to make the package error, and at the same clear out the node annotation to trigger cleanup.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,6 @@ spec:
accordingly but the package with the config changes may be skipeed and hang, and this asserts that it doesn't hang in that condition. Then once that completes the same two
packages have the key with a config interrupt changed and it's asserted that the config, interrupt, and post-interrupt runs for both those packages. Then once that completes
again it does one more update on a key for the same two packages which doesn't have a config interrupt defined and makes sure that the config steps are ran for that.
concurrent: true
timeouts:
assert: 360s
catch: ## if errors, print the most important info
Expand Down
9 changes: 9 additions & 0 deletions k8s-tests/chainsaw/skyhook/config-skyhook/skyhook.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,9 @@ spec:
interrupt:
type: service
services: [cron]
env:
- name: SLEEP_LEN
value: "1"
dexter:
version: "1.2.3"
image: ghcr.io/nvidia/skyhook/agentless
Expand All @@ -47,6 +50,9 @@ spec:
game.properties:
type: service
services: [rsyslog]
env:
- name: SLEEP_LEN
value: "1"
configMap:
game.properties: |
enemies=aliens
Expand All @@ -65,6 +71,9 @@ spec:
game.properties:
type: service
services: [rsyslog, cron]
env:
- name: SLEEP_LEN
value: "1"
configMap:
game.properties: |
enemies=aliens
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,9 @@ spec:
interrupt:
type: service
services: [cron]
env:
- name: SLEEP_LEN
value: "1"
dexter:
version: "1.2.3"
image: ghcr.io/nvidia/skyhook/agentless
Expand All @@ -47,6 +50,9 @@ spec:
game.properties:
type: service
services: [rsyslog]
env:
- name: SLEEP_LEN
value: "1"
configMap:
game.properties: |
changed
Expand All @@ -59,6 +65,9 @@ spec:
game.properties:
type: service
services: [rsyslog, cron]
env:
- name: SLEEP_LEN
value: "1"
configMap:
game.properties: |
changed again
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,9 @@ spec:
interrupt:
type: service
services: [cron]
env:
- name: SLEEP_LEN
value: "1"
dexter:
version: "1.2.3"
image: ghcr.io/nvidia/skyhook/agentless
Expand All @@ -47,6 +50,9 @@ spec:
game.properties:
type: service
services: [rsyslog]
env:
- name: SLEEP_LEN
value: "1"
configMap:
game.properties: |
enemies=aliens
Expand All @@ -61,6 +67,9 @@ spec:
baxter:
version: "3.2.1"
image: ghcr.io/nvidia/skyhook/agentless
env:
- name: SLEEP_LEN
value: "1"
configInterrupts:
game.properties:
type: service
Expand Down
9 changes: 9 additions & 0 deletions k8s-tests/chainsaw/skyhook/config-skyhook/update.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,9 @@ spec:
interrupt:
type: service
services: [cron]
env:
- name: SLEEP_LEN
value: "1"
dexter:
version: "1.2.3"
image: ghcr.io/nvidia/skyhook/agentless
Expand All @@ -47,6 +50,9 @@ spec:
game.properties:
type: service
services: [rsyslog]
env:
- name: SLEEP_LEN
value: "1"
configMap:
game.properties: |
changed
Expand All @@ -62,6 +68,9 @@ spec:
game.properties:
type: service
services: [rsyslog, cron]
env:
- name: SLEEP_LEN
value: "1"
configMap:
game.properties: |
changed again
Expand Down
1 change: 0 additions & 1 deletion k8s-tests/chainsaw/skyhook/depends-on/chainsaw-test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,6 @@ kind: Test
metadata:
name: depends-on
spec:
concurrent: true
description: |
Test makes sure depends-on works as expected. c depends on a, and b. Make sure a and b complete before c starts.
timeouts:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,6 @@ kind: Test
metadata:
name: failure-skyhook
spec:
concurrent: true
timeouts:
assert: 240s
catch: ## if errors, print the most important info
Expand Down
2 changes: 2 additions & 0 deletions k8s-tests/chainsaw/skyhook/failure-skyhook/skyhook.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,8 @@ spec:
env:
- name: EXIT_CODE
value: "2"
- name: SLEEP_LEN
value: "1"
dependsOn:
dexter: "1.2.3"
dexter:
Expand Down
1 change: 0 additions & 1 deletion k8s-tests/chainsaw/skyhook/interrupt/chainsaw-test.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,6 @@ spec:
Additionally this test tests the agentImageOverride, interruption budgets, and dependsOn fields.
## this can't run concurrently because it can cause a race condition where the other skyhooks make the node unschedulable
## and the pods won't come up
concurrent: false
timeouts:
assert: 240s
catch: ## if errors, print the most important info
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,6 @@ kind: Test
metadata:
name: pod-finalization
spec:
concurrent: true
timeouts:
assert: 180s
steps:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,6 @@ kind: Test
metadata:
name: runtime-required
spec:
concurrent: false
timeouts:
assert: 240s
catch: ## if errors, print the most important info
Expand Down
3 changes: 3 additions & 0 deletions k8s-tests/chainsaw/skyhook/runtime-required/skyhook.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,4 +34,7 @@ spec:
spencer:
version: "3.2.3"
image: ghcr.io/nvidia/skyhook/agentless
env:
- name: SLEEP_LEN
value: "1"

Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ kind: Test
metadata:
name: simple-skyhook
spec:
concurrent: false ## because this test creates a limit range in the namespace, it must be run serially
# skip: false ## this test doesn't seem to be useful, just slows things down, leaving it for, should delete at some point if still skipped
timeouts:
assert: 240s
steps:
Expand Down
8 changes: 7 additions & 1 deletion k8s-tests/chainsaw/skyhook/simple-skyhook/skyhook.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -40,14 +40,17 @@ spec:
image: ghcr.io/nvidia/skyhook/agentless
dependsOn:
dexter: "1.2.3"
env:
- name: SLEEP_LEN
value: "1"
foobar:
version: "1.2"
image: ghcr.io/nvidia/skyhook/agentless
dependsOn:
dexter: "1.2.3"
env:
- name: SLEEP_LEN
value: "3" ## making faster so the test works for asserting node condition
value: "1" ## making faster so the test works for asserting node condition
resources:
cpuLimit: 50m
cpuRequest: 50m
Expand All @@ -67,3 +70,6 @@ spec:
color.bad=yellow
allow.textmode=true
how.nice.to.look=fairlyNice
env:
- name: SLEEP_LEN
value: "1"
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,6 @@ kind: Test
metadata:
name: simple-update-skyhook
spec:
concurrent: true
timeouts:
assert: 240s
catch: ## if errors, print the most important info
Expand Down
10 changes: 8 additions & 2 deletions k8s-tests/chainsaw/skyhook/simple-update-skyhook/skyhook.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -37,12 +37,15 @@ spec:
image: ghcr.io/nvidia/skyhook/agentless
dependsOn:
dexter: "1.2.3"
env:
- name: SLEEP_LEN
value: "1"
foobar:
version: "1.2"
image: ghcr.io/nvidia/skyhook/agentless
env:
- name: SLEEP_LEN
value: "3" ## making faster so the test works for asserting node condition
value: "1" ## making faster so the test works for asserting node condition
dependsOn:
dexter: "1.2.3"
dexter:
Expand All @@ -58,4 +61,7 @@ spec:
color.good=purple
color.bad=yellow
allow.textmode=true
how.nice.to.look=fairlyNice
how.nice.to.look=fairlyNice
env:
- name: SLEEP_LEN
value: "1"
Original file line number Diff line number Diff line change
Expand Up @@ -46,9 +46,12 @@ spec:
color.bad=yellow
allow.textmode=true
how.nice.to.look=fairlyNice
env:
- name: SLEEP_LEN
value: "1"
jackie-chan: ## TODO: need to handle bad naming, because this are pod names, so can have issues depending on the chars
version: "2024.7.7-test"
image: ghcr.io/nvidia/skyhook/agentless
env:
- name: SLEEP_LEN
value: "3" ## making faster so the test works for asserting node condition
value: "1" ## making faster so the test works for asserting node condition
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,6 @@ kind: Test
metadata:
name: skyhook-upgrade
spec:
concurrent: true
skip: true ## skipping because this test current requires manual updating of the version
description: |
This test is skipped because it is because its not automated to change versions while its running.
Expand Down
2 changes: 1 addition & 1 deletion k8s-tests/chainsaw/skyhook/skyhook-upgrade/skyhook.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -38,4 +38,4 @@ spec:
image: ghcr.io/nvidia/skyhook/agentless
env:
- name: SLEEP_LEN
value: "3" ## making faster so the test works for asserting node condition
value: "1" ## making faster so the test works for asserting node condition
Loading