feat(binder): specify CPU and memory requests and limits for GPU reservation pod #626

lokielse · 2025-11-10T12:05:48Z

Description

There is a admission web-hook in our enterprise k8s cluster that checks and requires resources limits to be specified in a pod, otherwise the pod is not allowed to be created.

This PR adds support for configuring CPU and memory resource requests and limits for GPU reservation pods created by the binder. Previously, GPU reservation pods only had GPU resource specifications without explicit CPU/Memory limits or requests, relying on Kubernetes defaults.

Screenshots

Before

After

What Changed

Added new PodResources field to the ResourceReservation configuration in the binder API
Introduced --resource-reservation-pod-resources command-line flag accepting JSON-serialized resource requirements
Updated Helm chart values to support optional resourceReservationPodResources configuration
Modified GPU reservation pod creation logic to merge configured CPU/Memory resources while preserving GPU resource specifications
Added comprehensive unit tests for the new functionality

Key Features

Optional configuration: When not specified, the system maintains backward compatibility by using Kubernetes defaults (no explicit CPU/Memory resources)
Flexible resource specifications: Supports partial configuration (e.g., CPU-only or Memory-only)
GPU resource protection: Ensures that GPU resources are never overridden by the podResources configuration
Helm integration: Configuration can be specified through Helm values with clear documentation and examples

Implementation Details

The implementation flows through multiple layers:

Helm values → JSON configuration in the KAI config
Operator → Serializes to JSON and passes to binder via command-line flag
Binder → Deserializes and passes to resource reservation service
Service → Merges with GPU resources when creating reservation pods

Related Issues

Fixes #

Checklist

Self-reviewed
Added/updated tests (if needed)
Updated CHANGELOG.md (if needed)
Updated documentation (if needed)

Breaking Changes

None. This is a backward-compatible change. When the new configuration is not specified, the system behaves exactly as before.

Additional Notes

Example Configuration

Users can configure resource limits in their Helm values:

binder:
  resourceReservationPodResources:
    requests:
      cpu: 1m
      memory: 10Mi
    limits:
      cpu: 50m
      memory: 100Mi

Testing Coverage

Unit tests for API configuration preservation and defaults
Unit tests for resource merging logic in reservation pod creation
Tests verifying GPU resources cannot be overridden
Tests for partial configurations (CPU-only, Memory-only)
Tests for nil/empty configuration maintaining backward compatibility

…s for GPU pod reservation

enoodle · 2025-11-10T18:37:48Z

As long as this is hard coded like that it can fit your specific system but not another that has some other weird admission controller - this has to be configurable, and preferably leave the default as minimal as possible like today.

…and limits for GPU reservation pods

…e configuration in GPU reservation pods

pkg/apis/kai/v1/binder/binder.go

cmd/binder/app/app.go

pkg/operator/operands/binder/resources.go

… resources in GPU reservation pods

…pu-pod-reservation

enoodle · 2025-11-11T13:01:54Z

Look good, I think that you will need to run make validate to re-generate some CRD files

…pu-pod-reservation # Conflicts: # deployments/kai-scheduler/values.yaml

github-actions · 2025-11-12T08:38:08Z

Merging this branch changes the coverage (2 decrease, 1 increase)

Impacted Packages	Coverage Δ	🤖
github.com/NVIDIA/KAI-scheduler/cmd/binder/app	0.00% (ø)
github.com/NVIDIA/KAI-scheduler/pkg/apis/kai/v1/binder	28.42% (-1.25%)	👎
github.com/NVIDIA/KAI-scheduler/pkg/binder/binding/resourcereservation	88.78% (+0.52%)	👍
github.com/NVIDIA/KAI-scheduler/pkg/binder/controllers	47.52% (ø)
github.com/NVIDIA/KAI-scheduler/pkg/binder/controllers/integration_tests	0.00% (ø)
github.com/NVIDIA/KAI-scheduler/pkg/operator/operands/binder	66.67% (-2.27%)	👎

Coverage by file

Changed files (no unit tests)

Changed File	Coverage Δ	Total	Covered	Missed	🤖
github.com/NVIDIA/KAI-scheduler/cmd/binder/app/app.go	0.00% (ø)	68 (+6)	0	68 (+6)
github.com/NVIDIA/KAI-scheduler/cmd/binder/app/options.go	0.00% (ø)	25 (+1)	0	25 (+1)
github.com/NVIDIA/KAI-scheduler/pkg/apis/kai/v1/binder/binder.go	100.00% (ø)	27	27	0
github.com/NVIDIA/KAI-scheduler/pkg/apis/kai/v1/binder/zz_generated.deepcopy.go	0.00% (ø)	68 (+4)	0	68 (+4)
github.com/NVIDIA/KAI-scheduler/pkg/binder/binding/resourcereservation/resource_reservation.go	88.78% (+0.52%)	205 (+9)	182 (+9)	23	👍
github.com/NVIDIA/KAI-scheduler/pkg/operator/operands/binder/resources.go	66.29% (-2.76%)	89 (+5)	59 (+1)	30 (+4)	👎

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

github.com/NVIDIA/KAI-scheduler/pkg/apis/kai/v1/binder/binder_test.go
github.com/NVIDIA/KAI-scheduler/pkg/binder/binding/resourcereservation/resource_reservation_test.go
github.com/NVIDIA/KAI-scheduler/pkg/binder/controllers/bindrequest_controller_test.go
github.com/NVIDIA/KAI-scheduler/pkg/binder/controllers/integration_tests/suite_test.go

enoodle · 2025-11-12T08:43:51Z

Thanks @lokielse

feat(resource reservation): specify CPU and memory requests and limit…

0c17c1d

…s for GPU pod reservation

lokielse changed the title ~~feat(binder): specify CPU and memory requests and limits for GPU pod reservation~~ feat(binder): specify CPU and memory requests and limits for GPU reservation pod Nov 10, 2025

lokielse mentioned this pull request Nov 11, 2025

feat(chart): Add flexible image tag configuration with priority-based overrides #628

Merged

4 tasks

lokielse added 2 commits November 11, 2025 09:49

feat(resource reservation): add configurable CPU and memory requests …

a46a1c8

…and limits for GPU reservation pods

test(resource reservation): add unit tests for CPU and memory resourc…

6114d85

…e configuration in GPU reservation pods

enoodle reviewed Nov 11, 2025

View reviewed changes

pkg/apis/kai/v1/binder/binder.go Outdated Show resolved Hide resolved

cmd/binder/app/app.go Outdated Show resolved Hide resolved

pkg/operator/operands/binder/resources.go Outdated Show resolved Hide resolved

lokielse added 2 commits November 11, 2025 17:46

feat(resource reservation): add JSON configuration for CPU and memory…

02344bf

… resources in GPU reservation pods

Merge branch 'main' into specify-cpu-memory-requests-and-limits-for-g…

eafbb4b

…pu-pod-reservation

lokielse added 2 commits November 12, 2025 09:27

Merge branch 'main' into specify-cpu-memory-requests-and-limits-for-g…

b1aa642

…pu-pod-reservation # Conflicts: # deployments/kai-scheduler/values.yaml

regenerate CRDs and fix UT

f17fb55

enoodle approved these changes Nov 12, 2025

View reviewed changes

enoodle merged commit 70a3f03 into NVIDIA:main Nov 12, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(binder): specify CPU and memory requests and limits for GPU reservation pod #626

feat(binder): specify CPU and memory requests and limits for GPU reservation pod #626

Uh oh!

lokielse commented Nov 10, 2025 •

edited

Loading

Uh oh!

enoodle commented Nov 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

enoodle commented Nov 11, 2025

Uh oh!

github-actions bot commented Nov 12, 2025

Changed files (no unit tests)

Changed unit test files

Uh oh!

Uh oh!

enoodle commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(binder): specify CPU and memory requests and limits for GPU reservation pod #626

feat(binder): specify CPU and memory requests and limits for GPU reservation pod #626

Uh oh!

Conversation

lokielse commented Nov 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Screenshots

What Changed

Key Features

Implementation Details

Related Issues

Checklist

Breaking Changes

Additional Notes

Example Configuration

Testing Coverage

Uh oh!

enoodle commented Nov 10, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

enoodle commented Nov 11, 2025

Uh oh!

github-actions bot commented Nov 12, 2025

Merging this branch changes the coverage (2 decrease, 1 increase)

Changed files (no unit tests)

Changed unit test files

Uh oh!

Uh oh!

enoodle commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lokielse commented Nov 10, 2025 •

edited

Loading