Skip to content

Commit f1279b8

Browse files
committed
fix(scheduling shards): docs and default
1 parent e4c9d64 commit f1279b8

File tree

2 files changed

+64
-35
lines changed

2 files changed

+64
-35
lines changed

deployments/kai-scheduler/templates/kai-config.yaml

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -12,9 +12,6 @@ metadata:
1212
spec:
1313
namespace: {{ .Release.Namespace }}
1414
global:
15-
schedulerName: "kai-scheduler"
16-
queueLabelKey: "kai.scheduler/queue"
17-
nodePoolLabelKey: "kai.scheduler/nodepool"
1815
{{- if .Values.global.namespaceLabelSelector }}
1916
namespaceLabelSelector:
2017
{{- toYaml .Values.global.namespaceLabelSelector | nindent 6 }}

docs/operator/scheduling-shards.md

Lines changed: 64 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -85,75 +85,107 @@ kubectl label nodes node-5 kai.scheduler/node-pool=high-memory-nodes
8585
Create queues that target specific shards:
8686

8787
```yaml
88-
apiVersion: kai.scheduler/v1
88+
apiVersion: scheduling.run.ai/v2
89+
kind: Queue
90+
metadata:
91+
name: gpu-queue-parent
92+
labels:
93+
kai.scheduler/node-pool: gpu-nodes # Targets GPU shard
94+
spec:
95+
priority: 100
96+
resources:
97+
gpu:
98+
quota: -1
99+
---
100+
apiVersion: scheduling.run.ai/v2
89101
kind: Queue
90102
metadata:
91103
name: gpu-queue
92104
labels:
93105
kai.scheduler/node-pool: gpu-nodes # Targets GPU shard
94106
spec:
107+
parentQueue: gpu-queue-parent
95108
priority: 100
96-
resourceQuota:
97-
gpu: 10
98-
cpu: 100
99-
memory: 500Gi
109+
resources:
110+
gpu:
111+
quota: 10
100112
```
101113
102114
```yaml
103-
apiVersion: kai.scheduler/v1
115+
apiVersion: scheduling.run.ai/v2
104116
kind: Queue
105117
metadata:
106118
name: cpu-queue
107119
labels:
108120
kai.scheduler/node-pool: cpu-nodes # Targets CPU shard
109121
spec:
110122
priority: 50
123+
parentQueue: cpu-queue-parent
111124
resourceQuota:
112-
cpu: 200
113-
memory: 1Ti
125+
cpu:
126+
quota: 200
127+
memory:
128+
quota: -1
114129
```
115130
116-
## Pod Group Configuration
131+
## Job Submission
117132
118-
### Shard-Specific Pod Groups
133+
### Direct Shard Targeting
119134
120-
Pod groups will automatically inherit shard targeting from their queue:
135+
Jobs should be directly submitted to a specific shard:
121136
122137
```yaml
123-
apiVersion: kai.scheduler/v1
124-
kind: PodGroup
138+
apiVersion: v1
139+
kind: Pod
125140
metadata:
126-
name: gpu-training-job
141+
name: gpu-pod
142+
namespace: test
127143
labels:
128-
kai.scheduler/queue: gpu-queue # Inherits shard from queue
144+
kai.scheduler/queue: foo-queue-test
145+
kai.scheduler/node-pool: foo
129146
spec:
130-
minMember: 1
131-
priority: 100
132-
resourceQuota:
133-
gpu: 2
134-
cpu: 8
135-
memory: 32Gi
147+
schedulerName: kai-scheduler
148+
containers:
149+
- name: main
150+
image: ubuntu
151+
command: ["bash", "-c"]
152+
args: ["nvidia-smi; trap 'exit 0' TERM; sleep infinity & wait"]
153+
resources:
154+
limits:
155+
nvidia.com/gpu: "1"
136156
```
137157
138-
### Direct Shard Targeting
139-
140-
Pod groups can also directly target shards:
158+
The created pod group will have the same labels as the top owner of the pod, which will then include the node-pool label
141159
142160
```yaml
143-
apiVersion: kai.scheduler/v1
161+
apiVersion: scheduling.run.ai/v2alpha2
144162
kind: PodGroup
145163
metadata:
146-
name: memory-intensive-job
164+
annotations:
165+
kai.scheduler/top-owner-metadata: |
166+
name: gpu-pod
167+
uid:
168+
group: ""
169+
version: v1
170+
kind: Pod
147171
labels:
148-
kai.scheduler/node-pool: high-memory-nodes
172+
kai.scheduler/queue: foo-queue-test
173+
name: pg-gpu-pod-d81e6f2c-8da7-4e61-8758-d8a2c38d2bfb
174+
namespace: test
175+
ownerReferences:
176+
- apiVersion: v1
177+
kind: Pod
178+
name: gpu-pod
179+
uid:
180+
uid:
149181
spec:
150182
minMember: 1
151-
priority: 75
152-
resourceQuota:
153-
cpu: 16
154-
memory: 128Gi
183+
priorityClassName: train
184+
queue: foo-queue-test
155185
```
156186
187+
The PodGroup's label can later be updated manually to direct the job to a different shard.
188+
157189
## Monitoring and Observability
158190
159191
### Shard Status
@@ -176,4 +208,4 @@ View logs for specific shards:
176208
```bash
177209
# View shard scheduler logs
178210
kubectl logs -n kai-system deployment/kai-scheduler-gpu-shard
179-
```
211+
```

0 commit comments

Comments
 (0)