diff --git a/CHANGELOG.md b/CHANGELOG.md index 0a435b009..de5e09011 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -14,7 +14,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). - Added enforcement of the `nvidia` runtime class for GPU pods, with the option to enforce a custom runtime class, or disable enforcement entirely. - Added a preferred podAntiAffinity term by default for all services, can be set to required instead by setting `global.requireDefaultPodAffinityTerm` - Added support for service-level affinities -- Added time aware scheduling configurations in scheduling shard +- Added [time aware scheduling](docs/timeaware/README.md) capabilities ### Fixed - (Openshift only) - High CPU usage for the operator pod due to continues reconciles diff --git a/cmd/time-aware-simulator/examples/plot_simple.py b/cmd/time-aware-simulator/examples/plot_simple.py index 0d15a4808..8a9534602 100644 --- a/cmd/time-aware-simulator/examples/plot_simple.py +++ b/cmd/time-aware-simulator/examples/plot_simple.py @@ -9,6 +9,8 @@ parser = argparse.ArgumentParser(description='Plot simulation results from CSV file') parser.add_argument('input', nargs='?', default='simulation_results.csv', help='Path to the CSV file (default: simulation_results.csv)') +parser.add_argument('--output', '-o', type=str, default=None, + help='Save plot to PNG file instead of displaying it') args = parser.parse_args() df = pd.read_csv(args.input) @@ -38,5 +40,10 @@ ax2.grid(True, alpha=0.3) plt.tight_layout() -plt.show() + +if args.output: + plt.savefig(args.output, dpi=300, bbox_inches='tight') + print(f"Plot saved to {args.output}") +else: + plt.show() diff --git a/docs/timeaware/README.md b/docs/timeaware/README.md new file mode 100644 index 000000000..e914dfad0 --- /dev/null +++ b/docs/timeaware/README.md @@ -0,0 +1,189 @@ +# Time Aware fairness + +Time aware fairness is a feature in KAI-Scheduler which makes use of historical resource usage by queues for making allocation and reclaim decisions. Key features are: + +1. All else being equal, queues with higher past usage will get to run jobs after queues with lower usage +2. Reclaim based on usage: queues which are starved over time will reclaim resources from queues which used a lot of resources. + 1. Note: this does not effect in-quota allocation: deserved quota still takes precedence over time-aware fairness + + +> **Prerequisites**: Familiarity with [fairness](../fairness/README.md) + +## How it works + +In high level: resource usage data in the cluster is collected and persisted in prometheus. It is then used by the scheduler to make resource fairness calculations: the more resources consumed by a queue, the less over-quota resources it will get compared to other queues. This will eventually result in the queues' over-quota resources being reclaimed by more starved queues, thus achieving a more fair allocation of resources over time. + +### Resource usage data + +Queue historical resource usage data is collected in a prometheus instance in the cluster *(external prometheus instances can be used - see [External prometheus](#external-prometheus))*. The scheduler configuration determines the time period that will be considered, as well as allows configuration for time-decay, which, if configured, gives more weight to recent usage than past usage. + +The metrics are collected continuously: the pod-group-controller publishes resource usage for individual pod-groups on their status, which are then aggregated by the queue-controller and published as a metric, which gets collected and persisted by prometheus. + +If configured, the scheduler applies an [exponential time decay](https://en.wikipedia.org/wiki/Exponential_decay) formula which is configured by a half-life period. This can be more intuitively understood with an example: for a half life of one hour, a usage (for example, 1 gpu-second) that occurred an hour ago will be considered half as significant as a gpu-second that was consumed just now. + +Mathematically, the following formula is applied to historical usage: + +$$U = 0.5^{\frac{\Delta{t}}{t_{1/2}}}*A$$ + +Where: + +- $U$ is the usage +- $t_{1/2}$ is the half life constant set by the user +- $\Delta{t}$ is the time elapsed since that usage +- $A$ is the allocated resource + +#### Normalization to cluster capacity + +The aggregated usage for each queue is then normalized to the **cluster capacity** at the relevant time period: the scheduler looks at the available resources in the cluster for that time period, and normalizes all resource usage to it. For example, in a cluster with 10 GPUs, and considering a time period of 10 hours, a queue which consumed 24 GPU hours (wether it's 8 GPUs for 3 hours, or 12 GPUs for 2 hours), will get a normalized usage score of 0.24 (used 24 GPU hours out of a potential 100). This normalization ensures that a small amount of resource usage relative to the cluster size will not result in a heavy penalty. + +### Effect on fair share + +Usually, over quota resources are assigned to each queue proportionally to it's Over Quota Weight. With time-aware fairness, queues with historical usage will get relatively less resources in over-quota. The significance of the resource usage in this calculation can be controlled with a parameter called "kValue": the bigger it is, the more impact (or weight) the historical usage has on the calculated fairshare, i.e. it will decrease the fairshare of that queue. + +Check out the [time aware simulator](../../cmd/time-aware-simulator/README.md) to understand scheduling behavior over time better. + +### Example + +The following plot demonstrates the GPU allocation over time in a 16 GPU cluster, with two queues, each having 0 deserved quota and 1 Over Quota weight for GPUs, each trying to run 16-GPU, single-pod Jobs. + +![Time-aware fairness GPU allocation over time](./results.png) + +*Time units are intentionally omitted* + +## Setup and Configurations + +### Quick setup + +Enable prometheus in KAI operator: + +```sh +kubectl patch config kai-config --type merge -p '{"spec":{"prometheus":{"enabled":true}}}' +``` + +It's recommended to wait for the prometheus pod to be available. Look for it in `kai-scheduler` namespace: + +```sh +watch kubectl get pod -n kai-scheduler prometheus-prometheus-0 +``` + +And configure the scheduler to connect to it by patching the scheduling shard: + +```sh +kubectl patch schedulingshard -nkai-scheudler default --type merge -p '{"spec":{"usageDBConfig":{"clientType":"prometheus"}}}' +``` + +The scheduler should now restart and attempt to connect to prometheus. + +### Scheduler configurations + +You can further configure the scheduler by editing the scheduling shard: + +```sh +kubectl edit schedulingshard default +``` +*Replace `default` with the shard name if relevant* + +Add the following section under `spec`: +```yaml + usageDBConfig: + clientType: prometheus + connectionString: http://prometheus-operated.kai-scheduler.svc.cluster.local:9090 # Optional: if not configured, the kai config will populate it + usageParams: + windowSize: 1w # The time period considered for fairness calculations. One week is the default + windowType: sliding # Change to the desired value (sliding/tumbling). Sliding is the default + halfLifePeriod: 10m # Leave empty to not use time decay. Off by default +``` + +#### kValue + +KValue is a parameter used by the proportion plugin to determine the impact of historical usage in fairness calculations - higher values mean more aggressive effects on fairness. To set it, add it to the scheduling shard spec: +```sh +kubectl edit schedulingshard default +``` + +```yaml +spec: + ... # Other configurations + kValue: 0.5 + usageDBConfig: + ... # Other configurations +``` + +#### Advanced: overriding metrics + +> *This configuration should not be changed under normal conditions* + +In some cases, the admin might want to configure the scheduler to query different metrics for usage and capacity of certain resources. This can be done with the following config: + +```sh +kubectl edit schedulingshard default +``` + +```yaml + usageDBConfig: + extraParams: + gpuAllocationMetric: kai_queue_allocated_gpus + cpuAllocationMetric: kai_queue_allocated_cpu_cores + memoryAllocationMetric: kai_queue_allocated_memory_bytes + gpuCapacityMetric: sum(kube_node_status_capacity{resource=\"nvidia_com_gpu\"}) + cpuCapacityMetric: sum(kube_node_status_capacity{resource=\"cpu\"}) + memoryCapacityMetric: sum(kube_node_status_capacity{resource=\"memory\"}) +``` + +### Prometheus configurations + +> Using a kai-operated prometheus assumes that the [prometheus operator](https://prometheus-operator.dev/docs/getting-started/installation/) is installed in the cluster + +To enable prometheus via kai-operator, apply the following patch: +```sh +kubectl patch config kai-config --type merge -p '{"spec":{"prometheus":{"enabled":true}}}' +``` + +You can also customize the following configurations: + +``` + externalPrometheusHealthProbe # defines the configuration for external Prometheus connectivity validation, with defaults. + externalPrometheusUrl # defines the URL of an external Prometheus instance to use. When set, KAI will not deploy its own Prometheus but will configure ServiceMonitors for the external instance and validate connectivity + retentionPeriod # defines how long to retain data (e.g., "2w", "1d", "30d") + sampleInterval # defines the interval of sampling (e.g., "1m", "30s", "5m") + serviceMonitor # defines ServiceMonitor configuration for KAI services + storageClassName # defines the name of the storageClass that will be used to store the TSDB data. defaults to "standard". + storageSize # defines the size of the storage (e.g., "20Gi", "30Gi") +``` + +## Troubleshooting + +### Dependencies + +Before enabling prometheus in kai config, make sure that the prometheus is installed. If it's not, you will see the following condition in the kai config: + +``` sh +kubectl describe config kai-config +``` +``` +Status: + Conditions: + ... + Last Transition Time: 2025-11-10T11:25:48Z + Message: KAI-prometheus: no matches for kind "Prometheus" in version "monitoring.coreos.com/v1" +KAI-prometheus: not available + Observed Generation: 2 + Reason: available + Status: False + Type: Available +``` + +Simply follow the [prometheus installation instructions](https://prometheus-operator.dev/docs/getting-started/installation/). + +In order to collect cluster capacity metrics, [kube-state-metrics](https://artifacthub.io/packages/helm/prometheus-community/kube-state-metrics/) needs to be installed as well. By default, the kai operator creates a ServiceMonitor for it, assuming it's installed in `monitoring` or `default` namespace. + +### Missing metrics + +If the scheduler is unable to collect the usage metrics from prometheus, you will see a message in the logs, similar to this: + +``` +2025-11-10T12:33:07.318Z ERROR usagedb/usagedb.go:142 failed to fetch usage data: error querying nvidia.com/gpu and capacity: error querying cluster capacity metric ((sum(kube_node_status_capacity{resource="nvidia_com_gpu"})) * (0.5^((1762777987 - time()) / 600.000000))): bad_data: invalid parameter "query": 1:124: parse error: unexpected character in duration expression: '&' +``` + +Prometheus connectivity +Metrics availability diff --git a/docs/timeaware/results.png b/docs/timeaware/results.png new file mode 100644 index 000000000..f11116ee2 Binary files /dev/null and b/docs/timeaware/results.png differ