Skip to content

Commit 6282e04

Browse files
authored
feat: Specify fraction container name (#654)
* Allow user to specify container tame and type for fractions
1 parent 2492508 commit 6282e04

File tree

20 files changed

+784
-23
lines changed

20 files changed

+784
-23
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
1515
- Added a preferred podAntiAffinity term by default for all services, can be set to required instead by setting `global.requireDefaultPodAffinityTerm`
1616
- Added support for service-level affinities
1717
- Added [time aware scheduling](docs/timeaware/README.md) capabilities
18+
- Added option to specify container name and type for fraction containers
1819

1920
### Fixed
2021
- (Openshift only) - High CPU usage for the operator pod due to continues reconciles

docs/batch/batch-job.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ spec:
1111
template:
1212
metadata:
1313
labels:
14-
kai.scheduler/queue: test
14+
kai.scheduler/queue: default-queue
1515
spec:
1616
schedulerName: kai-scheduler
1717
restartPolicy: OnFailure

docs/batch/pytorch-job.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ kind: "PyTorchJob"
66
metadata:
77
name: "pytorch-dist-mnist-nccl"
88
labels:
9-
kai.scheduler/queue: test
9+
kai.scheduler/queue: default-queue
1010
spec:
1111
pytorchReplicaSpecs:
1212
Master:

docs/dra/gpu-imex-pod.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ kind: Pod
1616
metadata:
1717
name: gpu-imex-pod
1818
labels:
19-
kai.scheduler/queue: test
19+
kai.scheduler/queue: default-queue
2020
spec:
2121
schedulerName: kai-scheduler
2222
containers:

docs/elastic/pytorch-elastic.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ kind: PyTorchJob
66
metadata:
77
name: elastic-example-imagenet
88
labels:
9-
kai.scheduler/queue: test
9+
kai.scheduler/queue: default-queue
1010
spec:
1111
elasticPolicy:
1212
rdzvBackend: c10d

docs/gpu-sharing/README.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,3 +42,31 @@ kubectl apply -f gpu-memory.yaml
4242
In the gpu-memory.yaml file, the pod includes a `gpu-memory` annotation with a value of 2000 (in Mib), meaning:
4343
* The pod is allowed to consume up to 2000 Mib of a GPU device memory
4444
* The remaining GPU device memory can be shared with other pods in the cluster
45+
46+
### GPU Fraction with Non-Default Container
47+
By default, GPU fraction allocation is applied to the first container (index 0) in the pod. However, you can specify a different container to receive the GPU allocation using the `gpu-fraction-container-name` annotation.
48+
49+
#### Specific Container
50+
To allocate GPU fraction to a specific container in a multi-container pod:
51+
```
52+
kubectl apply -f gpu-sharing-non-default-container.yaml
53+
```
54+
55+
In the gpu-sharing-non-default-container.yaml file, the pod includes:
56+
* `gpu-fraction: "0.5"` - Requests half of a GPU device memory
57+
* `gpu-fraction-container-name: "gpu-workload"` - Specifies that the container named "gpu-workload" should receive the GPU allocation instead of the default first container
58+
59+
This is useful for pods with sidecar containers where only one specific container needs GPU access.
60+
61+
#### Init Container
62+
To allocate GPU fraction to an init container:
63+
```
64+
kubectl apply -f gpu-sharing-init-container.yaml
65+
```
66+
67+
In the gpu-sharing-init-container.yaml file, the pod includes:
68+
* `gpu-fraction: "0.5"` - Requests half of a GPU device memory
69+
* `gpu-fraction-container-name: "gpu-init"` - Specifies the init container name. If not defined, will default to the first container.
70+
* `gpu-fraction-container-type: "InitContainer"` - Indicates the container is an init container
71+
72+
This is useful for workloads that need GPU access during initialization (e.g., model loading, dataset preprocessing) before the main application container starts.

docs/gpu-sharing/gpu-memory.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ kind: Pod
66
metadata:
77
name: gpu-sharing
88
labels:
9-
kai.scheduler/queue: test
9+
kai.scheduler/queue: default-queue
1010
annotations:
1111
gpu-memory: "2000" # in Mib
1212
spec:
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# Copyright 2025 NVIDIA CORPORATION
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
apiVersion: v1
5+
kind: Pod
6+
metadata:
7+
name: gpu-sharing-init-container
8+
labels:
9+
kai.scheduler/queue: default-queue
10+
annotations:
11+
gpu-fraction: "0.5"
12+
# Specify an init container to receive the GPU fraction allocation
13+
gpu-fraction-container-name: "gpu-init"
14+
gpu-fraction-container-type: "InitContainer"
15+
spec:
16+
schedulerName: kai-scheduler
17+
initContainers:
18+
- name: gpu-init
19+
image: nvidia/cuda:11.0-base
20+
command: ["nvidia-smi"]
21+
args: ["-L"]
22+
containers:
23+
- name: main-app
24+
image: ubuntu
25+
args: ["sleep", "infinity"]
26+
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# Copyright 2025 NVIDIA CORPORATION
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
apiVersion: v1
5+
kind: Pod
6+
metadata:
7+
name: gpu-sharing-non-default
8+
labels:
9+
kai.scheduler/queue: default-queue
10+
annotations:
11+
gpu-fraction: "0.5"
12+
# Specify which container should receive the GPU fraction allocation
13+
# By default, the first container (index 0) receives the GPU allocation
14+
# Use this annotation to specify a different container by name
15+
gpu-fraction-container-name: "gpu-workload"
16+
spec:
17+
schedulerName: kai-scheduler
18+
containers:
19+
- name: sidecar
20+
image: busybox
21+
args: ["sleep", "infinity"]
22+
- name: gpu-workload
23+
image: nvidia/cuda:11.0-base
24+
command: ["nvidia-smi"]
25+
args: ["-L"]
26+
- name: another-sidecar
27+
image: busybox
28+
args: ["sleep", "infinity"]
29+

docs/gpu-sharing/gpu-sharing.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ kind: Pod
66
metadata:
77
name: gpu-sharing
88
labels:
9-
kai.scheduler/queue: test
9+
kai.scheduler/queue: default-queue
1010
annotations:
1111
gpu-fraction: "0.5"
1212
spec:

0 commit comments

Comments
 (0)