Skip to content

Conversation

@lokielse
Copy link
Contributor

@lokielse lokielse commented Nov 10, 2025

Description

There is a admission web-hook in our enterprise k8s cluster that checks and requires resources limits to be specified in a pod, otherwise the pod is not allowed to be created.

This PR adds support for configuring CPU and memory resource requests and limits for GPU reservation pods created by the binder. Previously, GPU reservation pods only had GPU resource specifications without explicit CPU/Memory limits or requests, relying on Kubernetes defaults.

Screenshots

Before
image

After
image

What Changed

  • Added new PodResources field to the ResourceReservation configuration in the binder API
  • Introduced --resource-reservation-pod-resources command-line flag accepting JSON-serialized resource requirements
  • Updated Helm chart values to support optional resourceReservationPodResources configuration
  • Modified GPU reservation pod creation logic to merge configured CPU/Memory resources while preserving GPU resource specifications
  • Added comprehensive unit tests for the new functionality

Key Features

  • Optional configuration: When not specified, the system maintains backward compatibility by using Kubernetes defaults (no explicit CPU/Memory resources)
  • Flexible resource specifications: Supports partial configuration (e.g., CPU-only or Memory-only)
  • GPU resource protection: Ensures that GPU resources are never overridden by the podResources configuration
  • Helm integration: Configuration can be specified through Helm values with clear documentation and examples

Implementation Details

The implementation flows through multiple layers:

  1. Helm values → JSON configuration in the KAI config
  2. Operator → Serializes to JSON and passes to binder via command-line flag
  3. Binder → Deserializes and passes to resource reservation service
  4. Service → Merges with GPU resources when creating reservation pods

Related Issues

Fixes #

Checklist

  • Self-reviewed
  • Added/updated tests (if needed)
  • Updated CHANGELOG.md (if needed)
  • Updated documentation (if needed)

Breaking Changes

None. This is a backward-compatible change. When the new configuration is not specified, the system behaves exactly as before.

Additional Notes

Example Configuration

Users can configure resource limits in their Helm values:

binder:
  resourceReservationPodResources:
    requests:
      cpu: 1m
      memory: 10Mi
    limits:
      cpu: 50m
      memory: 100Mi

Testing Coverage

  • Unit tests for API configuration preservation and defaults
  • Unit tests for resource merging logic in reservation pod creation
  • Tests verifying GPU resources cannot be overridden
  • Tests for partial configurations (CPU-only, Memory-only)
  • Tests for nil/empty configuration maintaining backward compatibility

@lokielse lokielse changed the title feat(binder): specify CPU and memory requests and limits for GPU pod reservation feat(binder): specify CPU and memory requests and limits for GPU reservation pod Nov 10, 2025
@enoodle
Copy link
Collaborator

enoodle commented Nov 10, 2025

As long as this is hard coded like that it can fit your specific system but not another that has some other weird admission controller - this has to be configurable, and preferably leave the default as minimal as possible like today.

@enoodle
Copy link
Collaborator

enoodle commented Nov 11, 2025

Look good, I think that you will need to run make validate to re-generate some CRD files

…pu-pod-reservation

# Conflicts:
#	deployments/kai-scheduler/values.yaml
@github-actions
Copy link

Merging this branch changes the coverage (2 decrease, 1 increase)

Impacted Packages Coverage Δ 🤖
github.com/NVIDIA/KAI-scheduler/cmd/binder/app 0.00% (ø)
github.com/NVIDIA/KAI-scheduler/pkg/apis/kai/v1/binder 28.42% (-1.25%) 👎
github.com/NVIDIA/KAI-scheduler/pkg/binder/binding/resourcereservation 88.78% (+0.52%) 👍
github.com/NVIDIA/KAI-scheduler/pkg/binder/controllers 47.52% (ø)
github.com/NVIDIA/KAI-scheduler/pkg/binder/controllers/integration_tests 0.00% (ø)
github.com/NVIDIA/KAI-scheduler/pkg/operator/operands/binder 66.67% (-2.27%) 👎

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/NVIDIA/KAI-scheduler/cmd/binder/app/app.go 0.00% (ø) 68 (+6) 0 68 (+6)
github.com/NVIDIA/KAI-scheduler/cmd/binder/app/options.go 0.00% (ø) 25 (+1) 0 25 (+1)
github.com/NVIDIA/KAI-scheduler/pkg/apis/kai/v1/binder/binder.go 100.00% (ø) 27 27 0
github.com/NVIDIA/KAI-scheduler/pkg/apis/kai/v1/binder/zz_generated.deepcopy.go 0.00% (ø) 68 (+4) 0 68 (+4)
github.com/NVIDIA/KAI-scheduler/pkg/binder/binding/resourcereservation/resource_reservation.go 88.78% (+0.52%) 205 (+9) 182 (+9) 23 👍
github.com/NVIDIA/KAI-scheduler/pkg/operator/operands/binder/resources.go 66.29% (-2.76%) 89 (+5) 59 (+1) 30 (+4) 👎

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/NVIDIA/KAI-scheduler/pkg/apis/kai/v1/binder/binder_test.go
  • github.com/NVIDIA/KAI-scheduler/pkg/binder/binding/resourcereservation/resource_reservation_test.go
  • github.com/NVIDIA/KAI-scheduler/pkg/binder/controllers/bindrequest_controller_test.go
  • github.com/NVIDIA/KAI-scheduler/pkg/binder/controllers/integration_tests/suite_test.go

@enoodle enoodle merged commit 70a3f03 into NVIDIA:main Nov 12, 2025
4 checks passed
@enoodle
Copy link
Collaborator

enoodle commented Nov 12, 2025

Thanks @lokielse

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants