Skip to content

Releases: NVIDIA/KAI-Scheduler

v0.10.0

18 Nov 11:26
ad3ac0e

Choose a tag to compare

What's Changed

Added

  • Added parent reference to SubGroup struct in PodGroup CRD to allow a hierarchical SubGroup structure
  • Added time aware scheduling capabilities
  • Added a tool to run time-aware fairness simulations over multiple cycles (see Time-Aware Fairness Simulator)
  • Added the option to configure the names of the webhook configuration resources
  • Added an option to configure reservation pods runtime class
  • Added enforcement of the nvidia runtime class for GPU pods, with the option to enforce a custom runtime class, or disable enforcement entirely
  • Added a preferred podAntiAffinity term by default for all KAI system services, can be set to required instead by setting global.requireDefaultPodAffinityTerm
  • Added support for service-level affinities
  • Added option to specify container name and type for fraction containers

Fixed

  • (Openshift only) - High CPU usage for the operator pod due to continues reconciles
  • Fixed a bug where the scheduler would not re-try updating podgroup status after failure
  • Fixed a bug where ray workloads gang scheduling would ignore minReplicas if autoscaling was not set
  • Fixed wrong status when prometheus operand is enabled in KAI Config
  • GPU-Operator v25.10.0 support for CDI enabled environments

v0.10.0-rc6

16 Nov 12:12
2297956

Choose a tag to compare

v0.10.0-rc6 Pre-release
Pre-release

What's Changed

  • ci: Extend the amount of ci nodes of the kind cluster used in the "Validate & test" step by @davidLif in #618
  • fix: scheduling shards docs and defaults by @enoodle in #622
  • fix(chart): scope CRD manager permissions to specific resource names by @lokielse in #631
  • fix(chart): Protect resource rendering when resources value is null by @lokielse in #630
  • feat(chart): Add flexible image tag configuration with priority-based overrides by @lokielse in #628
  • refactor: Fix serialization of conf object by @itsomri in #633
  • fix: Prometheus operand by @itsomri in #634
  • refactor(operator): prometheus operand by @enoodle in #629
  • feat(binder): specify CPU and memory requests and limits for GPU reservation pod by @lokielse in #626
  • fix(operator): idempotent sa image pull secrets by @enoodle in #637
  • feat: Time aware configs in scheduling shard by @itsomri in #635
  • feat(podgrouper): Publish the pod-grouper DefaultPluginsHub by @davidLif in #632
  • fix(operator): support latest gpu operator cdi detection by @enoodle in #641
  • ci: add support for custom GOPROXY and GOSUMDB in Docker environment by @lokielse in #643
  • docs: fix quickstart and queue docs by @enoodle in #642
  • docs: add missing default queues example by @enoodle in #648
  • ci: add auto-generated comments to RBAC and CRD YAML files by @lokielse in #644

New Contributors

Full Changelog: v0.10.0-rc5...v0.10.0-rc6

v0.9.8

14 Nov 11:35
96fa84d

Choose a tag to compare

What's Changed

  • fix; openshift operator sa reconcile by @enoodle in #646
  • fix: 0.9 - support gpu operator 25.10.0 better by @enoodle in #647

Full Changelog: v0.9.7...v0.9.8

v0.9.7

10 Nov 13:02
6ffb425

Choose a tag to compare

What's Changed

  • feat(admission): Explicitly apply 'nvidia' runtimeClass to GPU pods (v0.9) by @omeryahud in #625

Full Changelog: v0.9.6...v0.9.7

v0.10.0-rc5

05 Nov 11:09
e4c9d64

Choose a tag to compare

v0.10.0-rc5 Pre-release
Pre-release

What's Changed

  • test: Remove explicit queue v2 storage in env tests by @itsomri in #607
  • chore(ci,docs): add conventional PR title guidelines and validation + pull request template by @gshaibi in #581
  • fix: queue version in docs to v2 by @enoodle in #608
  • refactor: move PR title validation to a separate workflow file by @gshaibi in #610
  • feat: Time aware simulator runner by @itsomri in #609
  • feat: Better topology allocation for non homogeneous jobs by @davidLif in #604
  • fix: Convert Ephemeral-Storage in MaxNodePoolResources from Bytes to GB by @lakshyaj02 in #566
  • fix: Fix ray grouper by @itsomri in #617
  • feat(admission): Explicitly apply 'nvidia' runtimeClass to GPU pods by @omeryahud in #602
  • fix(chart): Fix templating indentation for service resource configuration by @omeryahud in #620
  • refactor: add configurable resource names in operator by @enoodle in #613
  • feat(operator): Add default podAntiAffinity and service-level Affinity support by @omeryahud in #619

New Contributors

Full Changelog: v0.10.0-rc3...v0.10.0-rc5

v0.10.0-rc3

29 Oct 08:04
d7fde66

Choose a tag to compare

v0.10.0-rc3 Pre-release
Pre-release

What's Changed

  • feat: Configure kValue for time aware fairness by @itsomri in #583
  • feat: add stale issue action by @SiorMeir in #592
  • fix: topology plugin crush with no requested resources by @enoodle in #595
  • refactor: Time aware fairness refactor by @itsomri in #600
  • Topology plugin node sorting performance improvement by @gshaibi in #588
  • test: Time aware fairness burst simulation by @itsomri in #598
  • docs: expand topology scheduling strategy section by @gshaibi in #603
  • Added multilevel topology doc by @romanbaron in #596
  • Bump github.com/argoproj/argo-workflows/v3 from 3.6.4 to 3.6.12 by @dependabot[bot] in #568
  • Topology domain sort based on resource ratio by @davidLif in #601
  • feat: add externalURL config by @SiorMeir in #563
  • Topology plugin goes into infinite loop on empty tasks list by @romanbaron in #606

Full Changelog: v0.10.0-rc2...v0.10.0-rc3

v0.10.0-rc2

22 Oct 21:31
39b7ce2

Choose a tag to compare

v0.10.0-rc2 Pre-release
Pre-release

What's Changed

  • always install default scheduling shard by @enoodle in #589
  • Time aware fairness simulation env tests by @itsomri in #554
  • fix: scheduler dra feature gate auto detect by @enoodle in #591

Full Changelog: v0.10.0-rc1...v0.10.0-rc2

v0.10.0-rc1

21 Oct 17:43
96b4d22

Choose a tag to compare

v0.10.0-rc1 Pre-release
Pre-release

What's Changed

New Contributors

Full Changelog: v0.9.3...v0.10.0-rc1

v0.9.6

15 Oct 19:26
57b7c03

Choose a tag to compare

What's Changed

  • configurable reservation runtime class by @enoodle in #576

Full Changelog: v0.9.5...v0.9.6

v0.9.5

09 Oct 10:44
df48505

Choose a tag to compare

What's Changed

Full Changelog: v0.9.4...v0.9.5