Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
- Fixed GPU memory pods Fair Share and Queue Order calculations
- Interpret negative or zero half-life value as disabled [#818](https://github.com/NVIDIA/KAI-Scheduler/pull/818) [itsomri](https://github.com/itsomri)
- Handle invalid CSI StorageCapacities gracefully [#817](https://github.com/NVIDIA/KAI-Scheduler/pull/817) [rich7420](https://github.com/rich7420)
- Embed CRD definitions in binary for env-test and time-aware-simulations to allow binary portability [#818](https://github.com/NVIDIA/KAI-Scheduler/pull/818) [itsomri](https://github.com/itsomri)

### Changed
- Removed the constraint that prohibited direct nesting of subgroups alongside podsets within the same subgroupset.
Expand Down
2 changes: 2 additions & 0 deletions deployments/kai-scheduler/.helmignore
Original file line number Diff line number Diff line change
Expand Up @@ -36,3 +36,5 @@ stable-index/*
.github/
tests/*

# Go source files (used for embedding CRDs in Go binaries)
*.go
45 changes: 45 additions & 0 deletions deployments/kai-scheduler/crds/embed.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
// Copyright 2025 NVIDIA CORPORATION
// SPDX-License-Identifier: Apache-2.0

package crds

import (
"embed"
"fmt"

apiextensionsv1 "k8s.io/apiextensions-apiserver/pkg/apis/apiextensions/v1"
"k8s.io/apimachinery/pkg/util/yaml"
)

//go:embed *.yaml
var embeddedCRDs embed.FS

// LoadEmbeddedCRDs parses all embedded CRD YAML files and returns them as CRD objects.
// This allows the CRDs to be bundled into binaries without depending on file paths.
func LoadEmbeddedCRDs() ([]*apiextensionsv1.CustomResourceDefinition, error) {
entries, err := embeddedCRDs.ReadDir(".")
if err != nil {
return nil, fmt.Errorf("failed to read embedded crds directory: %w", err)
}

var crds []*apiextensionsv1.CustomResourceDefinition
for _, entry := range entries {
if entry.IsDir() || entry.Name() == "embed.go" {
continue
}

content, err := embeddedCRDs.ReadFile(entry.Name())
if err != nil {
return nil, fmt.Errorf("failed to read embedded CRD file %s: %w", entry.Name(), err)
}

crd := &apiextensionsv1.CustomResourceDefinition{}
if err := yaml.Unmarshal(content, crd); err != nil {
return nil, fmt.Errorf("failed to unmarshal CRD %s: %w", entry.Name(), err)
}

crds = append(crds, crd)
}

return crds, nil
}
9 changes: 9 additions & 0 deletions examples/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# KAI Scheduler Examples

This directory contains example configurations and YAML files to help you get started with KAI Scheduler.

## Quick Links

- [Quickstart Examples](quickstart/README.md) - Get started with basic queue and pod setup
- [Time-Aware Fairness](time-aware-fairness/README.md) - Configure historical usage-based fair scheduling

71 changes: 71 additions & 0 deletions examples/quickstart/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Quick Start Examples

This directory contains basic examples to get you started with KAI Scheduler.

## Scheduling Queues

A queue represents a job queue in the cluster. Queues are an essential scheduling primitive and can reflect different scheduling guarantees, such as resource quota and priority. Queues are typically assigned to different consumers in the cluster (users, groups, or initiatives). A workload must belong to a queue in order to be scheduled.

KAI Scheduler operates with a two-level hierarchical scheduling queue system.

### Default Queues

After installing KAI Scheduler, a two-level queue hierarchy is automatically created:
- `default-parent-queue` – Top-level (parent) queue. By default, this queue has no reserved resource quotas, allowing governance of resource distribution for its leaf queues.
- `default-queue` – Leaf (child) queue under the `default-parent-queue` top-level queue. Workloads should reference this queue.

The default queues are defined in [default-queues.yaml](default-queues.yaml).

No manual queue setup is required. Both queues will exist immediately after installation, allowing you to start submitting workloads right away.

### Creating Additional Queues

To add custom queues, apply your queue configuration:

```bash
kubectl apply -f queues.yaml
```

For detailed configuration options, refer to the [Scheduling Queues documentation](../../docs/queues/README.md).

## Assigning Pods to Queues

To schedule a pod using KAI Scheduler, ensure the following:

1. Specify the queue name using the `kai.scheduler/queue: default-queue` label on the pod/workload.
2. Set the scheduler name in the pod specification as `kai-scheduler`.

This ensures the pod is placed in the correct scheduling queue and managed by KAI Scheduler.

> **⚠️ Workload Namespaces**
>
> When submitting workloads, make sure to use a dedicated namespace. Do not use the `kai-scheduler` namespace for workload submission.

## Submitting Example Pods

### CPU-Only Pods

To submit a simple pod that requests CPU and memory resources:

```bash
kubectl apply -f pods/cpu-only-pod.yaml
```

### GPU Pods

Before running GPU workloads, ensure the [NVIDIA GPU-Operator](https://github.com/NVIDIA/gpu-operator) is installed in the cluster.

To submit a pod that requests a GPU resource:

```bash
kubectl apply -f pods/gpu-pod.yaml
```

## Files

| File | Description |
|------|-------------|
| [default-queues.yaml](default-queues.yaml) | Default parent and leaf queue configuration |
| [pods/cpu-only-pod.yaml](pods/cpu-only-pod.yaml) | Example CPU-only pod |
| [pods/gpu-pod.yaml](pods/gpu-pod.yaml) | Example GPU pod |

45 changes: 45 additions & 0 deletions examples/quickstart/default-queues.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Copyright 2025 NVIDIA CORPORATION
# SPDX-License-Identifier: Apache-2.0

# Default queue hierarchy created by KAI Scheduler on installation
# Top-level parent queue - manages resource distribution for its children
apiVersion: scheduling.run.ai/v2
kind: Queue
metadata:
name: default-parent-queue
spec:
resources:
cpu:
limit: -1 # No limit
overQuotaWeight: 1 # Equal weight for over-quota resources
quota: 0 # No guaranteed quota
gpu:
limit: -1
overQuotaWeight: 1
quota: 0
memory:
limit: -1
overQuotaWeight: 1
quota: 0
---
# Leaf queue - workloads should reference this queue
apiVersion: scheduling.run.ai/v2
kind: Queue
metadata:
name: default-queue
spec:
parentQueue: default-parent-queue
resources:
cpu:
limit: -1
overQuotaWeight: 1
quota: 0
gpu:
limit: -1
overQuotaWeight: 1
quota: 0
memory:
limit: -1
overQuotaWeight: 1
quota: 0

21 changes: 21 additions & 0 deletions examples/quickstart/pods/cpu-only-pod.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Copyright 2025 NVIDIA CORPORATION
# SPDX-License-Identifier: Apache-2.0

# Example: Simple CPU-only pod scheduled by KAI Scheduler
apiVersion: v1
kind: Pod
metadata:
name: cpu-only-pod
labels:
kai.scheduler/queue: default-queue # Required: assigns pod to a queue
spec:
schedulerName: kai-scheduler # Required: use KAI Scheduler
containers:
- name: main
image: ubuntu
args: ["sleep", "infinity"]
resources:
requests:
cpu: 100m
memory: 250M

22 changes: 22 additions & 0 deletions examples/quickstart/pods/gpu-pod.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Copyright 2025 NVIDIA CORPORATION
# SPDX-License-Identifier: Apache-2.0

# Example: GPU pod scheduled by KAI Scheduler
# Prerequisites: NVIDIA GPU-Operator must be installed in the cluster
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
labels:
kai.scheduler/queue: default-queue # Required: assigns pod to a queue
spec:
schedulerName: kai-scheduler # Required: use KAI Scheduler
containers:
- name: main
image: ubuntu
command: ["bash", "-c"]
args: ["nvidia-smi; trap 'exit 0' TERM; sleep infinity & wait"]
resources:
limits:
nvidia.com/gpu: "1" # Request 1 GPU

156 changes: 156 additions & 0 deletions examples/time-aware-fairness/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
# Time-Aware Fairness Examples

Time-aware fairness is a feature in KAI Scheduler that uses historical resource usage by queues for making allocation and reclaim decisions.

## Key Features

1. **Historical Usage Consideration**: All else being equal, queues with higher past usage will get to run jobs after queues with lower usage.
2. **Usage-Based Reclaim**: Queues that are starved over time will reclaim resources from queues that used a lot of resources.
> Note: This does not affect in-quota allocation—deserved quota still takes precedence over time-aware fairness.

## How It Works

Resource usage data is collected and persisted in Prometheus. The scheduler uses this data to make resource fairness calculations: the more resources consumed by a queue, the less over-quota resources it will receive compared to other queues.

### Time Decay (Optional)

If configured, the scheduler applies an [exponential time decay](https://en.wikipedia.org/wiki/Exponential_decay) formula controlled by a half-life period. For example, with a half-life of one hour, a GPU-second consumed an hour ago will be considered half as significant as a GPU-second consumed just now.

## Examples in This Directory

| File | Description |
|------|-------------|
| [scheduling-shard-minimal.yaml](scheduling-shard-minimal.yaml) | Minimal configuration to enable time-aware fairness |
| [scheduling-shard-managed-prometheus.yaml](scheduling-shard-managed-prometheus.yaml) | Full configuration using KAI-managed Prometheus |
| [scheduling-shard-external-prometheus.yaml](scheduling-shard-external-prometheus.yaml) | Configuration for using an external Prometheus instance |
| [two-queue-oscillation/](two-queue-oscillation/) | Complete example demonstrating fair resource oscillation between two queues |

## Quick Setup

### Step 0: Install Prometheus (Optional)

> **Note**: If you already have Prometheus and kube-state-metrics installed, skip to Step 1.

If you don't already have Prometheus installed in your cluster, you can install it using the [kube-prometheus-stack](https://artifacthub.io/packages/helm/prometheus-community/kube-prometheus-stack) Helm chart. This chart includes the Prometheus Operator and kube-state-metrics.

```bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace
```

Wait for the pods to be ready:

```bash
kubectl wait --for=condition=Ready pods --all -n monitoring --timeout=300s
```

### Step 1: Enable Prometheus

First, enable Prometheus via the KAI operator:

```bash
kubectl patch config kai-config --type merge -p '{"spec":{"prometheus":{"enabled":true}}}'
```

Wait for the Prometheus pod to be ready:

```bash
watch kubectl get pod -n kai-scheduler prometheus-prometheus-0
```

### Step 2: Configure the Scheduler

Apply the minimal scheduling shard configuration:

```bash
kubectl apply -f scheduling-shard-minimal.yaml
```

Or patch the existing shard:

```bash
kubectl patch schedulingshard default --type merge -p '{"spec":{"usageDBConfig":{"clientType":"prometheus"}}}'
```

The scheduler will restart and connect to Prometheus.

## Configuration Options

### Usage Parameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `windowSize` | `1w` (1 week) | Time period considered for fairness calculations |
| `windowType` | `sliding` | Window type: `sliding`, `tumbling`, or `cron` |
| `halfLifePeriod` | disabled | Half-life for exponential decay (e.g., `10m`, `1h`) |
| `fetchInterval` | `1m` | How often to fetch usage data from Prometheus |
| `stalenessPeriod` | `5m` | Maximum age of usage data before considered stale |

### kValue

The `kValue` parameter controls the impact of historical usage on fairness calculations:
- Higher values = more aggressive correction based on historical usage
- Lower values = more weight on over-quota weights, less on history
- Default: `1.0`

### Window Types

- **Sliding**: Considers usage from the last `windowSize` duration (rolling window)
- **Tumbling**: Non-overlapping fixed windows that reset at `tumblingWindowStartTime`
- **Cron**: Windows defined by a cron expression

## Using External Prometheus

If you have an existing Prometheus instance, configure it in the KAI config:

```bash
kubectl patch config kai-config --type merge -p '{
"spec": {
"prometheus": {
"enabled": true,
"externalPrometheusUrl": "http://prometheus.monitoring.svc.cluster.local:9090"
}
}
}'
```

See [scheduling-shard-external-prometheus.yaml](scheduling-shard-external-prometheus.yaml) for a complete example.

## Troubleshooting

### Prerequisites

Ensure the [Prometheus Operator](https://prometheus-operator.dev/docs/getting-started/installation/) is installed:

```bash
kubectl get crd prometheuses.monitoring.coreos.com
```

For cluster capacity metrics, [kube-state-metrics](https://artifacthub.io/packages/helm/prometheus-community/kube-state-metrics/) must also be installed.

### Check Scheduler Logs

If the scheduler cannot fetch usage metrics:

```bash
kubectl logs -n kai-scheduler deployment/kai-scheduler-default | grep -i usage
```

### Verify Prometheus Connection

Check if the scheduler can reach Prometheus:

```bash
kubectl exec -n kai-scheduler deployment/kai-scheduler-default -- wget -q -O- http://prometheus-operated.kai-scheduler.svc.cluster.local:9090/api/v1/status/config
```

## Further Reading

- [Time-Aware Fairness Documentation](../../docs/timeaware/README.md)
- [Fairness Concepts](../../docs/fairness/README.md)
- [Time-Aware Design Document](../../docs/developer/designs/time-aware-fairness/time-aware-fairness.md)
- [Time-Aware Simulator](../../cmd/time-aware-simulator/README.md)

Loading
Loading