Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/examples/jobset/jobset.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@

Install [JobSet API](https://github.com/kubernetes-sigs/jobset) in your cluster:
```shell
kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/v0.5.2/manifests.yaml
JOBSET_VERSION=v0.8.1
kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/${JOBSET_VERSION}/manifests.yaml
```

Run a jobset with workers:
Expand Down
9 changes: 7 additions & 2 deletions docs/examples/kai/kai.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,15 @@
## Example of running `KAI` with `knavigator`

### Running workflows with `MPI job`
### Running workflows with `MPI job` and `Job`

Install [KAI scheduler](https://github.com/NVIDIA/KAI-Scheduler/blob/main/README.md) in your cluster.

Run an MPI job:
Run an MPI job:
```shell
./bin/knavigator -workflow resources/workflows/kai/test-mpijob.yaml
```

Run a multi-replica Job:
```shell
./bin/knavigator -workflow resources/workflows/kai/test-job.yaml
```
2 changes: 1 addition & 1 deletion docs/examples/kueue/kueue.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
Install `kueue` by following these [instructions](https://kueue.sigs.k8s.io/docs/installation/):

```bash
KUEUE_VERSION=v0.9.0
KUEUE_VERSION=v0.11.4
kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/${KUEUE_VERSION}/manifests.yaml

kubectl apply -f charts/overrides/kueue/priority.yaml
Expand Down
43 changes: 43 additions & 0 deletions resources/benchmarks/gang-scheduling/workflows/config-kai.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Copyright (c) 2024-2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

name: test-kai-job
description: register, deploy and configure kai custom resources
tasks:
- id: register-queue
type: RegisterObj
params:
template: "resources/templates/kai/queue.yaml"
- id: register
type: RegisterObj
params:
template: "resources/benchmarks/templates/kai/job.yaml"
nameFormat: "job{{._ENUM_}}"
podNameFormat: "{{._NAME_}}-[a-z0-9]+"
podCount: "{{.replicas}}"
- id: default-queue
type: SubmitObj
params:
refTaskId: register-queue
canExist: true
params:
name: default
- id: test-queue
type: SubmitObj
params:
refTaskId: register-queue
canExist: true
params:
name: test
parentQueue: default
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,6 @@ tasks:
- "ray.io/rayjob"
- "ray.io/raycluster"
- "jobset.x-k8s.io/jobset"
- "kubeflow.org/mxjob"
- "kubeflow.org/paddlejob"
- "kubeflow.org/pytorchjob"
- "kubeflow.org/tfjob"
Expand Down
43 changes: 43 additions & 0 deletions resources/benchmarks/scaling/workflows/config-kai.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Copyright (c) 2024-2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

name: test-kai-job
description: register, deploy and configure kai custom resources
tasks:
- id: register-queue
type: RegisterObj
params:
template: "resources/templates/kai/queue.yaml"
- id: register
type: RegisterObj
params:
template: "resources/benchmarks/templates/kai/job.yaml"
nameFormat: "job{{._ENUM_}}"
podNameFormat: "{{._NAME_}}-[a-z0-9]+"
podCount: "{{.replicas}}"
- id: default-queue
type: SubmitObj
params:
refTaskId: register-queue
canExist: true
params:
name: default
- id: test-queue
type: SubmitObj
params:
refTaskId: register-queue
canExist: true
params:
name: test
parentQueue: default
1 change: 0 additions & 1 deletion resources/benchmarks/scaling/workflows/config-kueue.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,6 @@ tasks:
- "ray.io/rayjob"
- "ray.io/raycluster"
- "jobset.x-k8s.io/jobset"
- "kubeflow.org/mxjob"
- "kubeflow.org/paddlejob"
- "kubeflow.org/pytorchjob"
- "kubeflow.org/tfjob"
Expand Down
45 changes: 45 additions & 0 deletions resources/benchmarks/templates/kai/job.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Copyright (c) 2024-2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: batch/v1
kind: Job
metadata:
name: "{{._NAME_}}"
namespace: "default"
spec:
completions: {{.replicas}}
parallelism: {{.replicas}}
template:
metadata:
labels:
runai/queue: "test"
annotations:
pod-complete.stage.kwok.x-k8s.io/delay: {{.ttl}}
pod-complete.stage.kwok.x-k8s.io/jitter-delay: {{.ttl}}
spec:
schedulerName: kai-scheduler
containers:
- name: test
image: busybox
imagePullPolicy: IfNotPresent
resources:
limits:
cpu: 100m
memory: 250M
nvidia.com/gpu: "8"
requests:
cpu: 100m
memory: 250M
nvidia.com/gpu: "8"
restartPolicy: Never
35 changes: 35 additions & 0 deletions resources/benchmarks/templates/kai/queue.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Copyright (c) 2024-2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: scheduling.run.ai/v2
kind: Queue
metadata:
name: "{{.name}}"
spec:
{{- if .parentQueue }}
parentQueue: "{{.parentQueue}}"
{{- end }}
resources:
cpu:
quota: -1
limit: -1
overQuotaWeight: 1
gpu:
quota: -1
limit: -1
overQuotaWeight: 1
memory:
quota: -1
limit: -1
overQuotaWeight: 1
45 changes: 45 additions & 0 deletions resources/templates/kai/job.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Copyright (c) 2024-2025, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: batch/v1
kind: Job
metadata:
name: "{{._NAME_}}"
namespace: "{{.namespace}}"
spec:
completions: {{.replicas}}
parallelism: {{.replicas}}
template:
metadata:
labels:
runai/queue: "{{.queue}}"
annotations:
pod-complete.stage.kwok.x-k8s.io/delay: {{.ttl}}
pod-complete.stage.kwok.x-k8s.io/jitter-delay: {{.ttl}}
spec:
schedulerName: kai-scheduler
containers:
- name: test
image: {{.image}}
imagePullPolicy: IfNotPresent
resources:
limits:
cpu: "{{.cpu}}"
memory: {{.memory}}
nvidia.com/gpu: "{{.gpu}}"
requests:
cpu: "{{.cpu}}"
memory: {{.memory}}
nvidia.com/gpu: "{{.gpu}}"
restartPolicy: Never
14 changes: 0 additions & 14 deletions resources/templates/kueue/cluster-queue.yaml
Original file line number Diff line number Diff line change
@@ -1,17 +1,3 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
Expand Down
14 changes: 0 additions & 14 deletions resources/templates/kueue/job.yaml
Original file line number Diff line number Diff line change
@@ -1,17 +1,3 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: batch/v1
kind: Job
metadata:
Expand Down
14 changes: 0 additions & 14 deletions resources/templates/kueue/local-queue.yaml
Original file line number Diff line number Diff line change
@@ -1,17 +1,3 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
Expand Down
14 changes: 0 additions & 14 deletions resources/templates/kueue/resource-flavor.yaml
Original file line number Diff line number Diff line change
@@ -1,17 +1,3 @@
# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
Expand Down
Loading