-
Notifications
You must be signed in to change notification settings - Fork 135
docs: Time aware scheduling setup examples #843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -36,3 +36,5 @@ stable-index/* | |
| .github/ | ||
| tests/* | ||
|
|
||
| # Go source files (used for embedding CRDs in Go binaries) | ||
| *.go | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,45 @@ | ||
| // Copyright 2025 NVIDIA CORPORATION | ||
| // SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| package crds | ||
|
|
||
| import ( | ||
| "embed" | ||
| "fmt" | ||
|
|
||
| apiextensionsv1 "k8s.io/apiextensions-apiserver/pkg/apis/apiextensions/v1" | ||
| "k8s.io/apimachinery/pkg/util/yaml" | ||
| ) | ||
|
|
||
| //go:embed *.yaml | ||
| var embeddedCRDs embed.FS | ||
|
|
||
| // LoadEmbeddedCRDs parses all embedded CRD YAML files and returns them as CRD objects. | ||
| // This allows the CRDs to be bundled into binaries without depending on file paths. | ||
| func LoadEmbeddedCRDs() ([]*apiextensionsv1.CustomResourceDefinition, error) { | ||
| entries, err := embeddedCRDs.ReadDir(".") | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to read embedded crds directory: %w", err) | ||
| } | ||
|
|
||
| var crds []*apiextensionsv1.CustomResourceDefinition | ||
| for _, entry := range entries { | ||
| if entry.IsDir() || entry.Name() == "embed.go" { | ||
| continue | ||
| } | ||
|
|
||
| content, err := embeddedCRDs.ReadFile(entry.Name()) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to read embedded CRD file %s: %w", entry.Name(), err) | ||
| } | ||
|
|
||
| crd := &apiextensionsv1.CustomResourceDefinition{} | ||
| if err := yaml.Unmarshal(content, crd); err != nil { | ||
| return nil, fmt.Errorf("failed to unmarshal CRD %s: %w", entry.Name(), err) | ||
| } | ||
|
|
||
| crds = append(crds, crd) | ||
| } | ||
|
|
||
| return crds, nil | ||
| } | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| # KAI Scheduler Examples | ||
|
|
||
| This directory contains example configurations and YAML files to help you get started with KAI Scheduler. | ||
|
|
||
| ## Quick Links | ||
|
|
||
| - [Quickstart Examples](quickstart/README.md) - Get started with basic queue and pod setup | ||
| - [Time-Aware Fairness](time-aware-fairness/README.md) - Configure historical usage-based fair scheduling | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,71 @@ | ||
| # Quick Start Examples | ||
|
|
||
| This directory contains basic examples to get you started with KAI Scheduler. | ||
|
|
||
| ## Scheduling Queues | ||
|
|
||
| A queue represents a job queue in the cluster. Queues are an essential scheduling primitive and can reflect different scheduling guarantees, such as resource quota and priority. Queues are typically assigned to different consumers in the cluster (users, groups, or initiatives). A workload must belong to a queue in order to be scheduled. | ||
|
|
||
| KAI Scheduler operates with a two-level hierarchical scheduling queue system. | ||
|
|
||
| ### Default Queues | ||
|
|
||
| After installing KAI Scheduler, a two-level queue hierarchy is automatically created: | ||
| - `default-parent-queue` – Top-level (parent) queue. By default, this queue has no reserved resource quotas, allowing governance of resource distribution for its leaf queues. | ||
| - `default-queue` – Leaf (child) queue under the `default-parent-queue` top-level queue. Workloads should reference this queue. | ||
|
|
||
| The default queues are defined in [default-queues.yaml](default-queues.yaml). | ||
|
|
||
| No manual queue setup is required. Both queues will exist immediately after installation, allowing you to start submitting workloads right away. | ||
|
|
||
| ### Creating Additional Queues | ||
|
|
||
| To add custom queues, apply your queue configuration: | ||
|
|
||
| ```bash | ||
| kubectl apply -f queues.yaml | ||
| ``` | ||
|
|
||
| For detailed configuration options, refer to the [Scheduling Queues documentation](../../docs/queues/README.md). | ||
|
|
||
| ## Assigning Pods to Queues | ||
|
|
||
| To schedule a pod using KAI Scheduler, ensure the following: | ||
|
|
||
| 1. Specify the queue name using the `kai.scheduler/queue: default-queue` label on the pod/workload. | ||
| 2. Set the scheduler name in the pod specification as `kai-scheduler`. | ||
|
|
||
| This ensures the pod is placed in the correct scheduling queue and managed by KAI Scheduler. | ||
|
|
||
| > **⚠️ Workload Namespaces** | ||
| > | ||
| > When submitting workloads, make sure to use a dedicated namespace. Do not use the `kai-scheduler` namespace for workload submission. | ||
|
|
||
| ## Submitting Example Pods | ||
|
|
||
| ### CPU-Only Pods | ||
|
|
||
| To submit a simple pod that requests CPU and memory resources: | ||
|
|
||
| ```bash | ||
| kubectl apply -f pods/cpu-only-pod.yaml | ||
| ``` | ||
|
|
||
| ### GPU Pods | ||
|
|
||
| Before running GPU workloads, ensure the [NVIDIA GPU-Operator](https://github.com/NVIDIA/gpu-operator) is installed in the cluster. | ||
|
|
||
| To submit a pod that requests a GPU resource: | ||
|
|
||
| ```bash | ||
| kubectl apply -f pods/gpu-pod.yaml | ||
| ``` | ||
|
|
||
| ## Files | ||
|
|
||
| | File | Description | | ||
| |------|-------------| | ||
| | [default-queues.yaml](default-queues.yaml) | Default parent and leaf queue configuration | | ||
| | [pods/cpu-only-pod.yaml](pods/cpu-only-pod.yaml) | Example CPU-only pod | | ||
| | [pods/gpu-pod.yaml](pods/gpu-pod.yaml) | Example GPU pod | | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,45 @@ | ||
| # Copyright 2025 NVIDIA CORPORATION | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| # Default queue hierarchy created by KAI Scheduler on installation | ||
| # Top-level parent queue - manages resource distribution for its children | ||
| apiVersion: scheduling.run.ai/v2 | ||
| kind: Queue | ||
| metadata: | ||
| name: default-parent-queue | ||
| spec: | ||
| resources: | ||
| cpu: | ||
| limit: -1 # No limit | ||
| overQuotaWeight: 1 # Equal weight for over-quota resources | ||
| quota: 0 # No guaranteed quota | ||
| gpu: | ||
| limit: -1 | ||
| overQuotaWeight: 1 | ||
| quota: 0 | ||
| memory: | ||
| limit: -1 | ||
| overQuotaWeight: 1 | ||
| quota: 0 | ||
| --- | ||
| # Leaf queue - workloads should reference this queue | ||
| apiVersion: scheduling.run.ai/v2 | ||
| kind: Queue | ||
| metadata: | ||
| name: default-queue | ||
| spec: | ||
| parentQueue: default-parent-queue | ||
| resources: | ||
| cpu: | ||
| limit: -1 | ||
| overQuotaWeight: 1 | ||
| quota: 0 | ||
| gpu: | ||
| limit: -1 | ||
| overQuotaWeight: 1 | ||
| quota: 0 | ||
| memory: | ||
| limit: -1 | ||
| overQuotaWeight: 1 | ||
| quota: 0 | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,21 @@ | ||
| # Copyright 2025 NVIDIA CORPORATION | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| # Example: Simple CPU-only pod scheduled by KAI Scheduler | ||
| apiVersion: v1 | ||
| kind: Pod | ||
| metadata: | ||
| name: cpu-only-pod | ||
| labels: | ||
| kai.scheduler/queue: default-queue # Required: assigns pod to a queue | ||
| spec: | ||
| schedulerName: kai-scheduler # Required: use KAI Scheduler | ||
| containers: | ||
| - name: main | ||
| image: ubuntu | ||
| args: ["sleep", "infinity"] | ||
| resources: | ||
| requests: | ||
| cpu: 100m | ||
| memory: 250M | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,22 @@ | ||
| # Copyright 2025 NVIDIA CORPORATION | ||
| # SPDX-License-Identifier: Apache-2.0 | ||
|
|
||
| # Example: GPU pod scheduled by KAI Scheduler | ||
| # Prerequisites: NVIDIA GPU-Operator must be installed in the cluster | ||
| apiVersion: v1 | ||
| kind: Pod | ||
| metadata: | ||
| name: gpu-pod | ||
| labels: | ||
| kai.scheduler/queue: default-queue # Required: assigns pod to a queue | ||
| spec: | ||
| schedulerName: kai-scheduler # Required: use KAI Scheduler | ||
| containers: | ||
| - name: main | ||
| image: ubuntu | ||
| command: ["bash", "-c"] | ||
| args: ["nvidia-smi; trap 'exit 0' TERM; sleep infinity & wait"] | ||
| resources: | ||
| limits: | ||
| nvidia.com/gpu: "1" # Request 1 GPU | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,156 @@ | ||
| # Time-Aware Fairness Examples | ||
|
|
||
| Time-aware fairness is a feature in KAI Scheduler that uses historical resource usage by queues for making allocation and reclaim decisions. | ||
|
|
||
| ## Key Features | ||
|
|
||
| 1. **Historical Usage Consideration**: All else being equal, queues with higher past usage will get to run jobs after queues with lower usage. | ||
| 2. **Usage-Based Reclaim**: Queues that are starved over time will reclaim resources from queues that used a lot of resources. | ||
| > Note: This does not affect in-quota allocation—deserved quota still takes precedence over time-aware fairness. | ||
|
|
||
| ## How It Works | ||
|
|
||
| Resource usage data is collected and persisted in Prometheus. The scheduler uses this data to make resource fairness calculations: the more resources consumed by a queue, the less over-quota resources it will receive compared to other queues. | ||
|
|
||
| ### Time Decay (Optional) | ||
|
|
||
| If configured, the scheduler applies an [exponential time decay](https://en.wikipedia.org/wiki/Exponential_decay) formula controlled by a half-life period. For example, with a half-life of one hour, a GPU-second consumed an hour ago will be considered half as significant as a GPU-second consumed just now. | ||
|
|
||
| ## Examples in This Directory | ||
|
|
||
| | File | Description | | ||
| |------|-------------| | ||
| | [scheduling-shard-minimal.yaml](scheduling-shard-minimal.yaml) | Minimal configuration to enable time-aware fairness | | ||
| | [scheduling-shard-managed-prometheus.yaml](scheduling-shard-managed-prometheus.yaml) | Full configuration using KAI-managed Prometheus | | ||
| | [scheduling-shard-external-prometheus.yaml](scheduling-shard-external-prometheus.yaml) | Configuration for using an external Prometheus instance | | ||
| | [two-queue-oscillation/](two-queue-oscillation/) | Complete example demonstrating fair resource oscillation between two queues | | ||
|
|
||
| ## Quick Setup | ||
|
|
||
| ### Step 0: Install Prometheus (Optional) | ||
|
|
||
| > **Note**: If you already have Prometheus and kube-state-metrics installed, skip to Step 1. | ||
|
|
||
| If you don't already have Prometheus installed in your cluster, you can install it using the [kube-prometheus-stack](https://artifacthub.io/packages/helm/prometheus-community/kube-prometheus-stack) Helm chart. This chart includes the Prometheus Operator and kube-state-metrics. | ||
|
|
||
| ```bash | ||
| helm repo add prometheus-community https://prometheus-community.github.io/helm-charts | ||
| helm repo update | ||
| helm install prometheus prometheus-community/kube-prometheus-stack \ | ||
| --namespace monitoring \ | ||
| --create-namespace | ||
| ``` | ||
|
|
||
| Wait for the pods to be ready: | ||
|
|
||
| ```bash | ||
| kubectl wait --for=condition=Ready pods --all -n monitoring --timeout=300s | ||
| ``` | ||
|
|
||
| ### Step 1: Enable Prometheus | ||
|
|
||
| First, enable Prometheus via the KAI operator: | ||
|
|
||
| ```bash | ||
| kubectl patch config kai-config --type merge -p '{"spec":{"prometheus":{"enabled":true}}}' | ||
| ``` | ||
|
|
||
| Wait for the Prometheus pod to be ready: | ||
|
|
||
| ```bash | ||
| watch kubectl get pod -n kai-scheduler prometheus-prometheus-0 | ||
| ``` | ||
|
|
||
| ### Step 2: Configure the Scheduler | ||
|
|
||
| Apply the minimal scheduling shard configuration: | ||
|
|
||
| ```bash | ||
| kubectl apply -f scheduling-shard-minimal.yaml | ||
| ``` | ||
|
|
||
| Or patch the existing shard: | ||
|
|
||
| ```bash | ||
| kubectl patch schedulingshard default --type merge -p '{"spec":{"usageDBConfig":{"clientType":"prometheus"}}}' | ||
| ``` | ||
|
|
||
| The scheduler will restart and connect to Prometheus. | ||
|
|
||
| ## Configuration Options | ||
|
|
||
| ### Usage Parameters | ||
|
|
||
| | Parameter | Default | Description | | ||
| |-----------|---------|-------------| | ||
| | `windowSize` | `1w` (1 week) | Time period considered for fairness calculations | | ||
| | `windowType` | `sliding` | Window type: `sliding`, `tumbling`, or `cron` | | ||
| | `halfLifePeriod` | disabled | Half-life for exponential decay (e.g., `10m`, `1h`) | | ||
| | `fetchInterval` | `1m` | How often to fetch usage data from Prometheus | | ||
| | `stalenessPeriod` | `5m` | Maximum age of usage data before considered stale | | ||
|
|
||
| ### kValue | ||
|
|
||
| The `kValue` parameter controls the impact of historical usage on fairness calculations: | ||
| - Higher values = more aggressive correction based on historical usage | ||
| - Lower values = more weight on over-quota weights, less on history | ||
| - Default: `1.0` | ||
|
|
||
| ### Window Types | ||
|
|
||
| - **Sliding**: Considers usage from the last `windowSize` duration (rolling window) | ||
| - **Tumbling**: Non-overlapping fixed windows that reset at `tumblingWindowStartTime` | ||
| - **Cron**: Windows defined by a cron expression | ||
|
|
||
| ## Using External Prometheus | ||
|
|
||
| If you have an existing Prometheus instance, configure it in the KAI config: | ||
|
|
||
| ```bash | ||
| kubectl patch config kai-config --type merge -p '{ | ||
| "spec": { | ||
| "prometheus": { | ||
| "enabled": true, | ||
| "externalPrometheusUrl": "http://prometheus.monitoring.svc.cluster.local:9090" | ||
| } | ||
| } | ||
| }' | ||
| ``` | ||
|
|
||
| See [scheduling-shard-external-prometheus.yaml](scheduling-shard-external-prometheus.yaml) for a complete example. | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| ### Prerequisites | ||
|
|
||
| Ensure the [Prometheus Operator](https://prometheus-operator.dev/docs/getting-started/installation/) is installed: | ||
|
|
||
| ```bash | ||
| kubectl get crd prometheuses.monitoring.coreos.com | ||
| ``` | ||
|
|
||
| For cluster capacity metrics, [kube-state-metrics](https://artifacthub.io/packages/helm/prometheus-community/kube-state-metrics/) must also be installed. | ||
|
|
||
| ### Check Scheduler Logs | ||
|
|
||
| If the scheduler cannot fetch usage metrics: | ||
|
|
||
| ```bash | ||
| kubectl logs -n kai-scheduler deployment/kai-scheduler-default | grep -i usage | ||
| ``` | ||
|
|
||
| ### Verify Prometheus Connection | ||
|
|
||
| Check if the scheduler can reach Prometheus: | ||
|
|
||
| ```bash | ||
| kubectl exec -n kai-scheduler deployment/kai-scheduler-default -- wget -q -O- http://prometheus-operated.kai-scheduler.svc.cluster.local:9090/api/v1/status/config | ||
| ``` | ||
|
|
||
| ## Further Reading | ||
|
|
||
| - [Time-Aware Fairness Documentation](../../docs/timeaware/README.md) | ||
| - [Fairness Concepts](../../docs/fairness/README.md) | ||
| - [Time-Aware Design Document](../../docs/developer/designs/time-aware-fairness/time-aware-fairness.md) | ||
| - [Time-Aware Simulator](../../cmd/time-aware-simulator/README.md) | ||
|
|
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.