Releases · NVIDIA/KAI-Scheduler

18 Nov 11:26

itsomri

v0.10.0

ad3ac0e

v0.10.0 Latest

Latest

What's Changed

Added

Added parent reference to SubGroup struct in PodGroup CRD to allow a hierarchical SubGroup structure
Added time aware scheduling capabilities
Added a tool to run time-aware fairness simulations over multiple cycles (see Time-Aware Fairness Simulator)
Added the option to configure the names of the webhook configuration resources
Added an option to configure reservation pods runtime class
Added enforcement of the nvidia runtime class for GPU pods, with the option to enforce a custom runtime class, or disable enforcement entirely
Added a preferred podAntiAffinity term by default for all KAI system services, can be set to required instead by setting global.requireDefaultPodAffinityTerm
Added support for service-level affinities
Added option to specify container name and type for fraction containers

Fixed

(Openshift only) - High CPU usage for the operator pod due to continues reconciles
Fixed a bug where the scheduler would not re-try updating podgroup status after failure
Fixed a bug where ray workloads gang scheduling would ignore minReplicas if autoscaling was not set
Fixed wrong status when prometheus operand is enabled in KAI Config
GPU-Operator v25.10.0 support for CDI enabled environments

Assets 3

16 Nov 12:12

davidLif

v0.10.0-rc6

2297956

v0.10.0-rc6 Pre-release

Pre-release

What's Changed

ci: Extend the amount of ci nodes of the kind cluster used in the "Validate & test" step by @davidLif in #618
fix: scheduling shards docs and defaults by @enoodle in #622
fix(chart): scope CRD manager permissions to specific resource names by @lokielse in #631
fix(chart): Protect resource rendering when resources value is null by @lokielse in #630
feat(chart): Add flexible image tag configuration with priority-based overrides by @lokielse in #628
refactor: Fix serialization of conf object by @itsomri in #633
fix: Prometheus operand by @itsomri in #634
refactor(operator): prometheus operand by @enoodle in #629
feat(binder): specify CPU and memory requests and limits for GPU reservation pod by @lokielse in #626
fix(operator): idempotent sa image pull secrets by @enoodle in #637
feat: Time aware configs in scheduling shard by @itsomri in #635
feat(podgrouper): Publish the pod-grouper DefaultPluginsHub by @davidLif in #632
fix(operator): support latest gpu operator cdi detection by @enoodle in #641
ci: add support for custom GOPROXY and GOSUMDB in Docker environment by @lokielse in #643
docs: fix quickstart and queue docs by @enoodle in #642
docs: add missing default queues example by @enoodle in #648
ci: add auto-generated comments to RBAC and CRD YAML files by @lokielse in #644

New Contributors

@lokielse made their first contribution in #631

Full Changelog: v0.10.0-rc5...v0.10.0-rc6

Contributors

lokielse, enoodle, and 2 other contributors

Assets 3

14 Nov 11:35

enoodle

v0.9.8

96fa84d

v0.9.8

What's Changed

fix; openshift operator sa reconcile by @enoodle in #646
fix: 0.9 - support gpu operator 25.10.0 better by @enoodle in #647

Full Changelog: v0.9.7...v0.9.8

Contributors

enoodle

Assets 3

10 Nov 13:02

enoodle

v0.9.7

6ffb425

v0.9.7

What's Changed

feat(admission): Explicitly apply 'nvidia' runtimeClass to GPU pods (v0.9) by @omeryahud in #625

Full Changelog: v0.9.6...v0.9.7

Contributors

omeryahud

Assets 3

05 Nov 11:09

enoodle

v0.10.0-rc5

e4c9d64

v0.10.0-rc5 Pre-release

Pre-release

What's Changed

test: Remove explicit queue v2 storage in env tests by @itsomri in #607
chore(ci,docs): add conventional PR title guidelines and validation + pull request template by @gshaibi in #581
fix: queue version in docs to v2 by @enoodle in #608
refactor: move PR title validation to a separate workflow file by @gshaibi in #610
feat: Time aware simulator runner by @itsomri in #609
feat: Better topology allocation for non homogeneous jobs by @davidLif in #604
fix: Convert Ephemeral-Storage in MaxNodePoolResources from Bytes to GB by @lakshyaj02 in #566
fix: Fix ray grouper by @itsomri in #617
feat(admission): Explicitly apply 'nvidia' runtimeClass to GPU pods by @omeryahud in #602
fix(chart): Fix templating indentation for service resource configuration by @omeryahud in #620
refactor: add configurable resource names in operator by @enoodle in #613
feat(operator): Add default podAntiAffinity and service-level Affinity support by @omeryahud in #619

New Contributors

@lakshyaj02 made their first contribution in #566
@omeryahud made their first contribution in #602

Full Changelog: v0.10.0-rc3...v0.10.0-rc5

Contributors

enoodle, omeryahud, and 4 other contributors

Assets 3

29 Oct 08:04

romanbaron

v0.10.0-rc3

d7fde66

v0.10.0-rc3 Pre-release

Pre-release

What's Changed

feat: Configure kValue for time aware fairness by @itsomri in #583
feat: add stale issue action by @SiorMeir in #592
fix: topology plugin crush with no requested resources by @enoodle in #595
refactor: Time aware fairness refactor by @itsomri in #600
Topology plugin node sorting performance improvement by @gshaibi in #588
test: Time aware fairness burst simulation by @itsomri in #598
docs: expand topology scheduling strategy section by @gshaibi in #603
Added multilevel topology doc by @romanbaron in #596
Bump github.com/argoproj/argo-workflows/v3 from 3.6.4 to 3.6.12 by @dependabot[bot] in #568
Topology domain sort based on resource ratio by @davidLif in #601
feat: add externalURL config by @SiorMeir in #563
Topology plugin goes into infinite loop on empty tasks list by @romanbaron in #606

Full Changelog: v0.10.0-rc2...v0.10.0-rc3

Contributors

enoodle, davidLif, and 5 other contributors

Assets 3

22 Oct 21:31

enoodle

v0.10.0-rc2

39b7ce2

v0.10.0-rc2 Pre-release

Pre-release

What's Changed

always install default scheduling shard by @enoodle in #589
Time aware fairness simulation env tests by @itsomri in #554
fix: scheduler dra feature gate auto detect by @enoodle in #591

Full Changelog: v0.10.0-rc1...v0.10.0-rc2

Contributors

enoodle and itsomri

Assets 3

21 Oct 17:43

enoodle

v0.10.0-rc1

96b4d22

v0.10.0-rc1 Pre-release

Pre-release

What's Changed

Added PodGroup Validating webhook that served by the PodGroup Controller by @romanbaron in #515
Refactor topology constraint PodGroupInfo by @omer-dayan in #513
Bigfix - TopologyConstraintInfo clone by @omer-dayan in #520
Hierarchical subgroup structure by @romanbaron in #518
Topology plugin implemented as SubsetNodes by @omer-dayan in #503
Binder env tests by @itsomri in #517
Priority-Preemptibility Separation Design by @gshaibi in #521
Refactor topology IDE warnings by @omer-dayan in #514
Pod-group-controller env tests by @itsomri in #524
Topology scheduling - on require single relevant level by @omer-dayan in #527
Queuecontroller env tests by @itsomri in #530
add pod affinity tests by @enoodle in #532
Topology scheduling plugin - Multi domain decision by @omer-dayan in #529
CI E2E - Deploy image registry by @omer-dayan in #542
Moved SubGroupInfo into a separate package by @romanbaron in #538
Topology plugin - Filter out worse case domains by @omer-dayan in #531
Moved TopologyConstraintInfo to a separate package by @romanbaron in #539
Set job fit error for topology job misconfiguration by @omer-dayan in #545
Topology consolidation test by @omer-dayan in #547
Introducing PodSet struct for subgroups by @romanbaron in #540
Add default queue creation and configuration by @singh1203 in #499
Removed SetDefaultMinAvailable from PodGroupInfo by @romanbaron in #541
feat: add TSDB PVC by @SiorMeir in #511
support k8s 1.34 DRA by @enoodle in #533
Priority-Preemptability Separation P0 Implementation by @gshaibi in #526
Topology Plugin Small Refactor by @gshaibi in #553
TAS: Normalize usage to cluster capacity in prometheus by @itsomri in #555
E2E Flakiness fix by @gshaibi in #557
Prepare infra for time aware env tests by @itsomri in #534
enable scheduler deployment by operator & SchedulingShards by @enoodle in #551
add delay in tests to allow cache wrappers to update by @enoodle in #559
Fix SCC for OCP by @itsomri in #544
Topology aware subGroupSet by @omer-dayan in #556
Changed SubSetNodes signature to use SubGroupInfo instead of SubGroupSet by @romanbaron in #561
configure webhook names by @enoodle in #564
Topology constraint at any subgroup hierarchy level by @romanbaron in #560
fix(deployments/kai-scheduler): respect helm values set under nodescaleadjuster.scalingPodImage by @BradenM in #572
Fix: Preserve default SchedulingShard on Helm upgrades by @gshaibi in #573
configurable reservation runtime class by @enoodle in #569
Simplifying subgroup tests by @romanbaron in #567
Scheduler logger enhancements by @itsomri in #579
feat: set up service monitor by @SiorMeir in #552
Renamed SubGroupOrderFn to PodSetOrderFn by @romanbaron in #578
Extending hierarhical podgroup structure to support multiple levels o… by @romanbaron in #427
Added SubGroupSetOrderFn by @romanbaron in #580
Fixed bug in setSubGroups method by @romanbaron in #585
fix: update default scaling pod image name by @avi-airis in #582
fix: impove docs and READMEs by @SiorMeir in #525
Topology Plugin - Domain Packing + Node Sorting by @gshaibi in #558
Added topology docs by @romanbaron in #584

New Contributors

@BradenM made their first contribution in #572
@avi-airis made their first contribution in #582

Full Changelog: v0.9.3...v0.10.0-rc1

Contributors

enoodle, BradenM, and 7 other contributors

Assets 3

15 Oct 19:26

enoodle

v0.9.6

57b7c03

v0.9.6

What's Changed

configurable reservation runtime class by @enoodle in #576

Full Changelog: v0.9.5...v0.9.6

Contributors

enoodle

Assets 3

09 Oct 10:44

itsomri

v0.9.5

df48505

v0.9.5

What's Changed

Fix SCC for OCP v0.9 by @itsomri in #562

Full Changelog: v0.9.4...v0.9.5

Contributors

itsomri

Assets 3

Releases: NVIDIA/KAI-Scheduler

v0.10.0

What's Changed

Added

Fixed

Uh oh!

v0.10.0-rc6

What's Changed

New Contributors

Contributors

Uh oh!

v0.9.8

What's Changed

Contributors

Uh oh!

v0.9.7

What's Changed

Contributors

Uh oh!

v0.10.0-rc5

What's Changed

New Contributors

Contributors

Uh oh!

v0.10.0-rc3

What's Changed

Contributors

Uh oh!

v0.10.0-rc2

What's Changed

Contributors

Uh oh!

v0.10.0-rc1

What's Changed

New Contributors

Contributors

Uh oh!

v0.9.6

What's Changed

Contributors

Uh oh!

v0.9.5

What's Changed

Contributors

Uh oh!