Skip to content

Commit 9c4d961

Browse files
authored
docs: add/update docs for deployment policy & compartments (#113)
* docs: add/update docs for deployment policy & compartments
1 parent cfb05df commit 9c4d961

File tree

5 files changed

+431
-0
lines changed

5 files changed

+431
-0
lines changed

docs/README.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,9 @@ This directory contains user and operator documentation for Skyhook. Here you'll
1919

2020
- [Strict Ordering](ordering_of_skyhooks.md): How and why the operator applies each Skyhook Custom Resource in a deterministic sequential order.
2121

22+
- [Deployment Policy and Compartments](deployment_policy.md):
23+
Fine-grained rollout control with compartments, budgets, and strategies. Includes overlap resolution, safety mechanisms, and migration from interruptionBudget.
24+
2225
- **Resources**
2326
- [Resource Management](resource_management.md):
2427
How Skyhook manages CPU/memory resources using LimitRange, per-package overrides, and validation rules.

docs/deployment_policy.md

Lines changed: 314 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,314 @@
1+
# Deployment Policy and Compartments
2+
3+
Deployment Policy provides fine-grained control over how Skyhook rolls out updates across your cluster by defining **compartments** — groups of nodes selected by labels — with different rollout strategies and budgets.
4+
5+
---
6+
7+
## Overview
8+
9+
A **DeploymentPolicy** is a Kubernetes Custom Resource that separates rollout configuration from the Skyhook Custom Resource, allowing you to:
10+
- Reuse the same policy across multiple Skyhooks
11+
- Apply different strategies to different node groups (e.g., production vs. test)
12+
- Control rollout speed and safety with configurable thresholds
13+
14+
---
15+
16+
## Basic Structure
17+
18+
```yaml
19+
apiVersion: skyhook.nvidia.com/v1alpha1
20+
kind: DeploymentPolicy
21+
metadata:
22+
name: my-policy
23+
namespace: skyhook
24+
spec:
25+
# Default applies to nodes that don't match any compartment
26+
default:
27+
budget:
28+
percent: 100 # or count: N
29+
strategy:
30+
fixed: # or linear, exponential
31+
initialBatch: 1
32+
batchThreshold: 100
33+
failureThreshold: 3
34+
safetyLimit: 50
35+
36+
# Compartments define specific node groups
37+
compartments:
38+
- name: production
39+
selector:
40+
matchLabels:
41+
env: production
42+
budget:
43+
percent: 25 # Scales with cluster size
44+
strategy:
45+
exponential:
46+
initialBatch: 1
47+
growthFactor: 2
48+
batchThreshold: 100
49+
failureThreshold: 1
50+
safetyLimit: 50
51+
```
52+
53+
---
54+
55+
## Core Concepts
56+
57+
### Compartments
58+
A named group of nodes selected by labels with:
59+
- **Selector**: Kubernetes `LabelSelector` to match nodes
60+
- **Budget**: Maximum nodes in progress at once (count or percent)
61+
- **Strategy**: Rollout pattern (fixed, linear, or exponential)
62+
63+
### Budgets
64+
Defines the ceiling for concurrent nodes:
65+
- **Count**: Fixed number (e.g., `count: 3`)
66+
- **Percent**: Percentage of matched nodes (e.g., `percent: 25`)
67+
68+
**Rounding for Percent**: `ceiling = max(1, int(matched_nodes × percent / 100))`
69+
- Always rounds **down**
70+
- Minimum is **1** (unless 0 nodes match)
71+
72+
**Examples**:
73+
| Matched Nodes | Percent | Ceiling |
74+
|---------------|---------|---------|
75+
| 10 | 25% | 2 |
76+
| 10 | 30% | 3 |
77+
| 5 | 10% | 1 (rounds down from 0.5, then max(1, 0)) |
78+
| 100 | 1% | 1 |
79+
80+
---
81+
82+
## Rollout Strategies
83+
84+
### Fixed Strategy
85+
Constant batch size throughout the rollout.
86+
87+
```yaml
88+
strategy:
89+
fixed:
90+
initialBatch: 5 # Always process 5 nodes
91+
batchThreshold: 100 # Require 100% success
92+
failureThreshold: 3 # Stop after 3 consecutive failures
93+
safetyLimit: 50 # Apply failure threshold only below 50% progress
94+
```
95+
96+
**Use when**: You want predictable, safe rollouts.
97+
98+
---
99+
100+
### Linear Strategy
101+
Increases by delta on success, decreases on failure.
102+
103+
```yaml
104+
strategy:
105+
linear:
106+
initialBatch: 1
107+
delta: 1 # Increase by 1 each success
108+
batchThreshold: 100
109+
failureThreshold: 3
110+
safetyLimit: 50
111+
```
112+
113+
**Progression** (delta=1): `1 → 2 → 3 → 4 → 5`
114+
115+
**Use when**: You want gradual ramp-up with slowdown on failures.
116+
117+
---
118+
119+
### Exponential Strategy
120+
Multiplies by growth factor on success, divides on failure.
121+
122+
```yaml
123+
strategy:
124+
exponential:
125+
initialBatch: 1
126+
growthFactor: 2 # Double on success
127+
batchThreshold: 100
128+
failureThreshold: 2
129+
safetyLimit: 50
130+
```
131+
132+
**Progression** (factor=2): `1 → 2 → 4 → 8 → 16`
133+
134+
**Use when**: You want fast rollouts in large clusters with high confidence.
135+
136+
---
137+
138+
## Strategy Parameters
139+
140+
All strategies share these parameters:
141+
142+
- **`initialBatch`** (≥1): Starting number of nodes (default: 1)
143+
- **`batchThreshold`** (1-100): Minimum success percentage to continue (default: 100)
144+
- **`failureThreshold`** (≥1): Max consecutive failures before stopping (default: 3)
145+
- **`safetyLimit`** (1-100): Progress threshold for failure handling (default: 50)
146+
147+
### Safety Limit Behavior
148+
149+
**Before safetyLimit** (e.g., < 50% progress):
150+
- Failures count toward `failureThreshold`
151+
- Batch sizes slow down (linear/exponential)
152+
- Reaching `failureThreshold` stops the rollout
153+
154+
**After safetyLimit** (e.g., ≥ 50% progress):
155+
- Rollout continues despite failures
156+
- Batch sizes don't slow down
157+
- Assumes rollout is "safe enough" to complete
158+
159+
**Rationale**: Early failures indicate a problem. Late failures are less critical since most nodes are updated.
160+
161+
---
162+
163+
## Selectors and Node Matching
164+
165+
Compartments use standard Kubernetes label selectors:
166+
167+
### Match Labels
168+
```yaml
169+
selector:
170+
matchLabels:
171+
env: production
172+
tier: frontend
173+
```
174+
175+
---
176+
177+
## Overlapping Selectors
178+
179+
When a node matches **multiple compartments**, the operator uses a **safety heuristic** to choose the safest one.
180+
181+
### Tie-Breaking Algorithm (3 levels)
182+
183+
1. **Strategy Safety**: Prefer safer strategies
184+
- **Fixed** (safest) > **Linear** > **Exponential** (least safe)
185+
186+
2. **Effective Ceiling**: If strategies are the same, prefer smaller ceiling
187+
- Smaller ceiling = fewer nodes at risk
188+
189+
3. **Lexicographic**: If still tied, alphabetically by compartment name
190+
- Ensures deterministic behavior
191+
192+
### Example
193+
194+
```yaml
195+
compartments:
196+
- name: us-west
197+
selector:
198+
matchLabels:
199+
region: us-west
200+
budget:
201+
count: 20 # Ceiling = 20
202+
strategy:
203+
exponential: {}
204+
205+
- name: production
206+
selector:
207+
matchLabels:
208+
env: production
209+
budget:
210+
count: 10 # Ceiling = 10 (smaller)
211+
strategy:
212+
linear: {}
213+
214+
- name: critical
215+
selector:
216+
matchLabels:
217+
priority: critical
218+
budget:
219+
count: 3
220+
strategy:
221+
fixed: {} # Fixed (safest)
222+
```
223+
224+
**Node with labels** `region=us-west, env=production, priority=critical`:
225+
- Matches all three compartments
226+
- **Winner**: `critical` (fixed strategy is safest)
227+
228+
**Node with labels** `region=us-west, env=production`:
229+
- Matches `us-west` (exponential) and `production` (linear)
230+
- **Winner**: `production` (linear is safer than exponential)
231+
232+
---
233+
234+
## Using with Skyhooks
235+
236+
Reference a policy by name:
237+
238+
```yaml
239+
apiVersion: skyhook.nvidia.com/v1alpha1
240+
kind: Skyhook
241+
metadata:
242+
name: my-skyhook
243+
spec:
244+
deploymentPolicy: my-policy # References DeploymentPolicy
245+
nodeSelectors:
246+
matchLabels:
247+
workload: gpu
248+
packages:
249+
# ...
250+
```
251+
252+
**Behavior**:
253+
- DeploymentPolicy must be in the **same namespace**
254+
- Each node is assigned to a compartment based on selectors
255+
- Nodes not matching any compartment use the `default` settings
256+
257+
---
258+
259+
## Migration from InterruptionBudget
260+
261+
The legacy `interruptionBudget` field is still supported but **DeploymentPolicy is recommended**.
262+
263+
### Before
264+
```yaml
265+
spec:
266+
interruptionBudget:
267+
percent: 25
268+
```
269+
270+
### After
271+
```yaml
272+
# 1. Create DeploymentPolicy
273+
apiVersion: skyhook.nvidia.com/v1alpha1
274+
kind: DeploymentPolicy
275+
metadata:
276+
name: legacy-equivalent
277+
namespace: skyhook
278+
spec:
279+
default:
280+
budget:
281+
percent: 25
282+
strategy:
283+
fixed:
284+
initialBatch: 1
285+
batchThreshold: 100
286+
failureThreshold: 3
287+
safetyLimit: 50
288+
```
289+
290+
```yaml
291+
# 2. Update Skyhook
292+
spec:
293+
deploymentPolicy: legacy-equivalent
294+
# Remove interruptionBudget field
295+
```
296+
297+
---
298+
299+
## Monitoring
300+
301+
Deployment Policy rollout behavior is exposed via Prometheus metrics. See [Metrics documentation](metrics/README.md) for details.
302+
303+
---
304+
305+
## Examples
306+
307+
See `/operator/config/samples/deploymentpolicy_v1alpha1_deploymentpolicy.yaml` for a complete sample showing:
308+
- Critical nodes (count=1, fixed strategy)
309+
- Production nodes (count=3, linear strategy)
310+
- Staging nodes (percent=33, exponential strategy)
311+
- Test nodes (percent=50, fast exponential)
312+
313+
---
314+

docs/metrics/README.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,50 @@ The current metrics supplied by the Operator are intended to be sufficient to de
2929
* `package_name` : The name of the package
3030
* `package_version`: The version of the package
3131

32+
## Rollout Metrics (Deployment Policy)
33+
These metrics track the rollout progress and health of compartments defined in a DeploymentPolicy. See [Deployment Policy documentation](../deployment_policy.md) for details on compartments and strategies.
34+
35+
* `skyhook_rollout_matched_nodes` : Number of nodes matched by this compartment's selector. Tags:
36+
* `skyhook_name` : The name of the Skyhook Custom Resource
37+
* `policy_name` : The name of the DeploymentPolicy (or "legacy" if using interruptionBudget)
38+
* `compartment_name` : The name of the compartment (or "__default__" for unmatched nodes)
39+
* `strategy` : The rollout strategy type (fixed, linear, exponential, or unknown)
40+
* `skyhook_rollout_ceiling` : Maximum number of nodes that can be in progress at once in this compartment. Tags:
41+
* `skyhook_name` : The name of the Skyhook Custom Resource
42+
* `policy_name` : The name of the DeploymentPolicy
43+
* `compartment_name` : The name of the compartment
44+
* `strategy` : The rollout strategy type
45+
* `skyhook_rollout_in_progress` : Number of nodes currently in progress in this compartment. Tags:
46+
* `skyhook_name` : The name of the Skyhook Custom Resource
47+
* `policy_name` : The name of the DeploymentPolicy
48+
* `compartment_name` : The name of the compartment
49+
* `strategy` : The rollout strategy type
50+
* `skyhook_rollout_completed` : Number of nodes completed in this compartment. Tags:
51+
* `skyhook_name` : The name of the Skyhook Custom Resource
52+
* `policy_name` : The name of the DeploymentPolicy
53+
* `compartment_name` : The name of the compartment
54+
* `strategy` : The rollout strategy type
55+
* `skyhook_rollout_progress_percent` : Percentage of nodes completed in this compartment (0-100). Tags:
56+
* `skyhook_name` : The name of the Skyhook Custom Resource
57+
* `policy_name` : The name of the DeploymentPolicy
58+
* `compartment_name` : The name of the compartment
59+
* `strategy` : The rollout strategy type
60+
* `skyhook_rollout_current_batch` : Current batch number in the rollout strategy (0 if no batch processing). Tags:
61+
* `skyhook_name` : The name of the Skyhook Custom Resource
62+
* `policy_name` : The name of the DeploymentPolicy
63+
* `compartment_name` : The name of the compartment
64+
* `strategy` : The rollout strategy type
65+
* `skyhook_rollout_consecutive_failures` : Number of consecutive batch failures in this compartment. Tags:
66+
* `skyhook_name` : The name of the Skyhook Custom Resource
67+
* `policy_name` : The name of the DeploymentPolicy
68+
* `compartment_name` : The name of the compartment
69+
* `strategy` : The rollout strategy type
70+
* `skyhook_rollout_should_stop` : Binary metric indicating if rollout should be stopped due to failures (1 = stopped, 0 = continuing). Tags:
71+
* `skyhook_name` : The name of the Skyhook Custom Resource
72+
* `policy_name` : The name of the DeploymentPolicy
73+
* `compartment_name` : The name of the compartment
74+
* `strategy` : The rollout strategy type
75+
3276
Note: When a Skyhook is deleted all metrics for that Skyhook are no longer reported.
3377

3478
# Testing

0 commit comments

Comments
 (0)