Skip to content

Commit c8a8a5b

Browse files
authored
Merge branch 'main' into patch-1
2 parents 0c64b42 + 489b9f0 commit c8a8a5b

File tree

29 files changed

+1052
-46
lines changed

29 files changed

+1052
-46
lines changed

CHANGELOG.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
66

77
## [Unreleased]
88

9+
## [v0.10.0] - 20250-11-18
10+
911
### Added
1012
- Added parent reference to SubGroup struct in PodGroup CRD to create a hierarchical SubGroup structure
1113
- Added the option to configure the names of the webhook configuration resources.
@@ -14,7 +16,8 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
1416
- Added enforcement of the `nvidia` runtime class for GPU pods, with the option to enforce a custom runtime class, or disable enforcement entirely.
1517
- Added a preferred podAntiAffinity term by default for all services, can be set to required instead by setting `global.requireDefaultPodAffinityTerm`
1618
- Added support for service-level affinities
17-
- Added time aware scheduling configurations in scheduling shard
19+
- Added [time aware scheduling](docs/timeaware/README.md) capabilities
20+
- Added option to specify container name and type for fraction containers
1821

1922
### Fixed
2023
- (Openshift only) - High CPU usage for the operator pod due to continues reconciles

cmd/time-aware-simulator/examples/plot_simple.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,8 @@
99
parser = argparse.ArgumentParser(description='Plot simulation results from CSV file')
1010
parser.add_argument('input', nargs='?', default='simulation_results.csv',
1111
help='Path to the CSV file (default: simulation_results.csv)')
12+
parser.add_argument('--output', '-o', type=str, default=None,
13+
help='Save plot to PNG file instead of displaying it')
1214
args = parser.parse_args()
1315

1416
df = pd.read_csv(args.input)
@@ -38,5 +40,10 @@
3840
ax2.grid(True, alpha=0.3)
3941

4042
plt.tight_layout()
41-
plt.show()
43+
44+
if args.output:
45+
plt.savefig(args.output, dpi=300, bbox_inches='tight')
46+
print(f"Plot saved to {args.output}")
47+
else:
48+
plt.show()
4249

deployments/kai-scheduler/templates/rbac/operator.yaml

Lines changed: 22 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -43,24 +43,42 @@ rules:
4343
- validatingwebhookconfigurations
4444
verbs:
4545
- create
46-
- delete
4746
- get
4847
- list
48+
- watch
49+
- apiGroups:
50+
- admissionregistration.k8s.io
51+
resourceNames:
52+
- kai-podgroup-validation-v2alpha2
53+
- kai-queue-validation-v2
54+
- mutating-kai-admission
55+
- validating-kai-admission
56+
resources:
57+
- mutatingwebhookconfigurations
58+
- validatingwebhookconfigurations
59+
verbs:
60+
- delete
4961
- patch
5062
- update
51-
- watch
5263
- apiGroups:
5364
- apiextensions.k8s.io
5465
resources:
5566
- customresourcedefinitions
5667
verbs:
5768
- create
58-
- delete
5969
- get
6070
- list
71+
- watch
72+
- apiGroups:
73+
- apiextensions.k8s.io
74+
resourceNames:
75+
- queues.scheduling.run.ai
76+
resources:
77+
- customresourcedefinitions
78+
verbs:
79+
- delete
6180
- patch
6281
- update
63-
- watch
6482
- apiGroups:
6583
- apps
6684
resources:

docs/batch/batch-job.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ spec:
1111
template:
1212
metadata:
1313
labels:
14-
kai.scheduler/queue: test
14+
kai.scheduler/queue: default-queue
1515
spec:
1616
schedulerName: kai-scheduler
1717
restartPolicy: OnFailure

docs/batch/pytorch-job.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ kind: "PyTorchJob"
66
metadata:
77
name: "pytorch-dist-mnist-nccl"
88
labels:
9-
kai.scheduler/queue: test
9+
kai.scheduler/queue: default-queue
1010
spec:
1111
pytorchReplicaSpecs:
1212
Master:

docs/developer/designs/time-aware-fairness/time-aware-fairness.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -116,7 +116,7 @@ Where:
116116
- **$C$** is the remaining capacity (max amount to give in current round)
117117
- **$P'_i$** is the normalized portion for queue i, defined as:
118118

119-
$$P_i = \max{\{W'_i - k \cdot (W'_i - U'_i), 0\}}$$
119+
$$P_i = \max{\{W'_i + k \cdot (W'_i - U'_i), 0\}}$$
120120

121121
$$P'_i = \frac{P_i}{\sum{P}}$$
122122

docs/dra/gpu-imex-pod.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ kind: Pod
1616
metadata:
1717
name: gpu-imex-pod
1818
labels:
19-
kai.scheduler/queue: test
19+
kai.scheduler/queue: default-queue
2020
spec:
2121
schedulerName: kai-scheduler
2222
containers:

docs/elastic/pytorch-elastic.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ kind: PyTorchJob
66
metadata:
77
name: elastic-example-imagenet
88
labels:
9-
kai.scheduler/queue: test
9+
kai.scheduler/queue: default-queue
1010
spec:
1111
elasticPolicy:
1212
rdzvBackend: c10d

docs/gpu-sharing/README.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,3 +42,31 @@ kubectl apply -f gpu-memory.yaml
4242
In the gpu-memory.yaml file, the pod includes a `gpu-memory` annotation with a value of 2000 (in Mib), meaning:
4343
* The pod is allowed to consume up to 2000 Mib of a GPU device memory
4444
* The remaining GPU device memory can be shared with other pods in the cluster
45+
46+
### GPU Fraction with Non-Default Container
47+
By default, GPU fraction allocation is applied to the first container (index 0) in the pod. However, you can specify a different container to receive the GPU allocation using the `gpu-fraction-container-name` annotation.
48+
49+
#### Specific Container
50+
To allocate GPU fraction to a specific container in a multi-container pod:
51+
```
52+
kubectl apply -f gpu-sharing-non-default-container.yaml
53+
```
54+
55+
In the gpu-sharing-non-default-container.yaml file, the pod includes:
56+
* `gpu-fraction: "0.5"` - Requests half of a GPU device memory
57+
* `gpu-fraction-container-name: "gpu-workload"` - Specifies that the container named "gpu-workload" should receive the GPU allocation instead of the default first container
58+
59+
This is useful for pods with sidecar containers where only one specific container needs GPU access.
60+
61+
#### Init Container
62+
To allocate GPU fraction to an init container:
63+
```
64+
kubectl apply -f gpu-sharing-init-container.yaml
65+
```
66+
67+
In the gpu-sharing-init-container.yaml file, the pod includes:
68+
* `gpu-fraction: "0.5"` - Requests half of a GPU device memory
69+
* `gpu-fraction-container-name: "gpu-init"` - Specifies the init container name. If not defined, will default to the first container.
70+
* `gpu-fraction-container-type: "InitContainer"` - Indicates the container is an init container
71+
72+
This is useful for workloads that need GPU access during initialization (e.g., model loading, dataset preprocessing) before the main application container starts.

docs/gpu-sharing/gpu-memory.yaml

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,12 +6,13 @@ kind: Pod
66
metadata:
77
name: gpu-sharing
88
labels:
9-
kai.scheduler/queue: test
9+
kai.scheduler/queue: default-queue
1010
annotations:
1111
gpu-memory: "2000" # in Mib
1212
spec:
1313
schedulerName: kai-scheduler
1414
containers:
15-
- name: ubuntu
16-
image: ubuntu
17-
args: ["sleep", "infinity"]
15+
- name: gpu-workload
16+
image: nvidia/cuda:13.0.2-base-ubi8
17+
command: ["nvidia-smi"]
18+
args: ["-L"]

0 commit comments

Comments
 (0)