Releases: NVIDIA/KAI-Scheduler
Releases · NVIDIA/KAI-Scheduler
v0.10.0
What's Changed
Added
- Added parent reference to SubGroup struct in PodGroup CRD to allow a hierarchical SubGroup structure
- Added time aware scheduling capabilities
- Added a tool to run time-aware fairness simulations over multiple cycles (see Time-Aware Fairness Simulator)
- Added the option to configure the names of the webhook configuration resources
- Added an option to configure reservation pods runtime class
- Added enforcement of the
nvidiaruntime class for GPU pods, with the option to enforce a custom runtime class, or disable enforcement entirely - Added a preferred podAntiAffinity term by default for all KAI system services, can be set to required instead by setting
global.requireDefaultPodAffinityTerm - Added support for service-level affinities
- Added option to specify container name and type for fraction containers
Fixed
- (Openshift only) - High CPU usage for the operator pod due to continues reconciles
- Fixed a bug where the scheduler would not re-try updating podgroup status after failure
- Fixed a bug where ray workloads gang scheduling would ignore
minReplicasif autoscaling was not set - Fixed wrong status when prometheus operand is enabled in KAI Config
- GPU-Operator v25.10.0 support for CDI enabled environments
v0.10.0-rc6
What's Changed
- ci: Extend the amount of ci nodes of the kind cluster used in the "Validate & test" step by @davidLif in #618
- fix: scheduling shards docs and defaults by @enoodle in #622
- fix(chart): scope CRD manager permissions to specific resource names by @lokielse in #631
- fix(chart): Protect resource rendering when resources value is null by @lokielse in #630
- feat(chart): Add flexible image tag configuration with priority-based overrides by @lokielse in #628
- refactor: Fix serialization of conf object by @itsomri in #633
- fix: Prometheus operand by @itsomri in #634
- refactor(operator): prometheus operand by @enoodle in #629
- feat(binder): specify CPU and memory requests and limits for GPU reservation pod by @lokielse in #626
- fix(operator): idempotent sa image pull secrets by @enoodle in #637
- feat: Time aware configs in scheduling shard by @itsomri in #635
- feat(podgrouper): Publish the pod-grouper DefaultPluginsHub by @davidLif in #632
- fix(operator): support latest gpu operator cdi detection by @enoodle in #641
- ci: add support for custom GOPROXY and GOSUMDB in Docker environment by @lokielse in #643
- docs: fix quickstart and queue docs by @enoodle in #642
- docs: add missing default queues example by @enoodle in #648
- ci: add auto-generated comments to RBAC and CRD YAML files by @lokielse in #644
New Contributors
Full Changelog: v0.10.0-rc5...v0.10.0-rc6
v0.9.8
v0.9.7
What's Changed
- feat(admission): Explicitly apply 'nvidia' runtimeClass to GPU pods (v0.9) by @omeryahud in #625
Full Changelog: v0.9.6...v0.9.7
v0.10.0-rc5
What's Changed
- test: Remove explicit queue v2 storage in env tests by @itsomri in #607
- chore(ci,docs): add conventional PR title guidelines and validation + pull request template by @gshaibi in #581
- fix: queue version in docs to v2 by @enoodle in #608
- refactor: move PR title validation to a separate workflow file by @gshaibi in #610
- feat: Time aware simulator runner by @itsomri in #609
- feat: Better topology allocation for non homogeneous jobs by @davidLif in #604
- fix: Convert Ephemeral-Storage in MaxNodePoolResources from Bytes to GB by @lakshyaj02 in #566
- fix: Fix ray grouper by @itsomri in #617
- feat(admission): Explicitly apply 'nvidia' runtimeClass to GPU pods by @omeryahud in #602
- fix(chart): Fix templating indentation for service resource configuration by @omeryahud in #620
- refactor: add configurable resource names in operator by @enoodle in #613
- feat(operator): Add default podAntiAffinity and service-level Affinity support by @omeryahud in #619
New Contributors
- @lakshyaj02 made their first contribution in #566
- @omeryahud made their first contribution in #602
Full Changelog: v0.10.0-rc3...v0.10.0-rc5
v0.10.0-rc3
What's Changed
- feat: Configure kValue for time aware fairness by @itsomri in #583
- feat: add stale issue action by @SiorMeir in #592
- fix: topology plugin crush with no requested resources by @enoodle in #595
- refactor: Time aware fairness refactor by @itsomri in #600
- Topology plugin node sorting performance improvement by @gshaibi in #588
- test: Time aware fairness burst simulation by @itsomri in #598
- docs: expand topology scheduling strategy section by @gshaibi in #603
- Added multilevel topology doc by @romanbaron in #596
- Bump github.com/argoproj/argo-workflows/v3 from 3.6.4 to 3.6.12 by @dependabot[bot] in #568
- Topology domain sort based on resource ratio by @davidLif in #601
- feat: add externalURL config by @SiorMeir in #563
- Topology plugin goes into infinite loop on empty tasks list by @romanbaron in #606
Full Changelog: v0.10.0-rc2...v0.10.0-rc3
v0.10.0-rc2
v0.10.0-rc1
What's Changed
- Added PodGroup Validating webhook that served by the PodGroup Controller by @romanbaron in #515
- Refactor topology constraint PodGroupInfo by @omer-dayan in #513
- Bigfix - TopologyConstraintInfo clone by @omer-dayan in #520
- Hierarchical subgroup structure by @romanbaron in #518
- Topology plugin implemented as SubsetNodes by @omer-dayan in #503
- Binder env tests by @itsomri in #517
- Priority-Preemptibility Separation Design by @gshaibi in #521
- Refactor topology IDE warnings by @omer-dayan in #514
- Pod-group-controller env tests by @itsomri in #524
- Topology scheduling - on require single relevant level by @omer-dayan in #527
- Queuecontroller env tests by @itsomri in #530
- add pod affinity tests by @enoodle in #532
- Topology scheduling plugin - Multi domain decision by @omer-dayan in #529
- CI E2E - Deploy image registry by @omer-dayan in #542
- Moved SubGroupInfo into a separate package by @romanbaron in #538
- Topology plugin - Filter out worse case domains by @omer-dayan in #531
- Moved TopologyConstraintInfo to a separate package by @romanbaron in #539
- Set job fit error for topology job misconfiguration by @omer-dayan in #545
- Topology consolidation test by @omer-dayan in #547
- Introducing PodSet struct for subgroups by @romanbaron in #540
- Add default queue creation and configuration by @singh1203 in #499
- Removed SetDefaultMinAvailable from PodGroupInfo by @romanbaron in #541
- feat: add TSDB PVC by @SiorMeir in #511
- support k8s 1.34 DRA by @enoodle in #533
- Priority-Preemptability Separation P0 Implementation by @gshaibi in #526
- Topology Plugin Small Refactor by @gshaibi in #553
- TAS: Normalize usage to cluster capacity in prometheus by @itsomri in #555
- E2E Flakiness fix by @gshaibi in #557
- Prepare infra for time aware env tests by @itsomri in #534
- enable scheduler deployment by operator & SchedulingShards by @enoodle in #551
- add delay in tests to allow cache wrappers to update by @enoodle in #559
- Fix SCC for OCP by @itsomri in #544
- Topology aware subGroupSet by @omer-dayan in #556
- Changed SubSetNodes signature to use SubGroupInfo instead of SubGroupSet by @romanbaron in #561
- configure webhook names by @enoodle in #564
- Topology constraint at any subgroup hierarchy level by @romanbaron in #560
- fix(deployments/kai-scheduler): respect helm values set under
nodescaleadjuster.scalingPodImageby @BradenM in #572 - Fix: Preserve default SchedulingShard on Helm upgrades by @gshaibi in #573
- configurable reservation runtime class by @enoodle in #569
- Simplifying subgroup tests by @romanbaron in #567
- Scheduler logger enhancements by @itsomri in #579
- feat: set up service monitor by @SiorMeir in #552
- Renamed SubGroupOrderFn to PodSetOrderFn by @romanbaron in #578
- Extending hierarhical podgroup structure to support multiple levels o… by @romanbaron in #427
- Added SubGroupSetOrderFn by @romanbaron in #580
- Fixed bug in setSubGroups method by @romanbaron in #585
- fix: update default scaling pod image name by @avi-airis in #582
- fix: impove docs and READMEs by @SiorMeir in #525
- Topology Plugin - Domain Packing + Node Sorting by @gshaibi in #558
- Added topology docs by @romanbaron in #584
New Contributors
- @BradenM made their first contribution in #572
- @avi-airis made their first contribution in #582
Full Changelog: v0.9.3...v0.10.0-rc1