|
| 1 | +# Time Aware fairness |
| 2 | + |
| 3 | +Time aware fairness is a feature in KAI-Scheduler which makes use of historical resource usage by queues for making allocation and reclaim decisions. Key features are: |
| 4 | + |
| 5 | +1. All else being equal, queues with higher past usage will get to run jobs after queues with lower usage |
| 6 | +2. Reclaim based on usage: queues which are starved over time will reclaim resources from queues which used a lot of resources. |
| 7 | + 1. Note: this does not effect in-quota allocation: deserved quota still takes precedence over time-aware fairness |
| 8 | + |
| 9 | + |
| 10 | +> **Prerequisites**: Familiarity with [fairness](../fairness/README.md) |
| 11 | +
|
| 12 | +## How it works |
| 13 | + |
| 14 | +In high level: resource usage data in the cluster is collected and persisted in prometheus. It is then used by the scheduler to make resource fairness calculations: the more resources consumed by a queue, the less over-quota resources it will get compared to other queues. This will eventually result in the queues' over-quota resources being reclaimed by more starved queues, thus achieving a more fair allocation of resources over time. |
| 15 | + |
| 16 | +### Resource usage data |
| 17 | + |
| 18 | +Queue historical resource usage data is collected in a prometheus instance in the cluster *(external prometheus instances can be used - see [External prometheus](#external-prometheus))*. The scheduler configuration determines the time period that will be considered, as well as allows configuration for time-decay, which, if configured, gives more weight to recent usage than past usage. |
| 19 | + |
| 20 | +The metrics are collected continuously: the pod-group-controller publishes resource usage for individual pod-groups on their status, which are then aggregated by the queue-controller and published as a metric, which gets collected and persisted by prometheus. |
| 21 | + |
| 22 | +If configured, the scheduler applies an [exponential time decay](https://en.wikipedia.org/wiki/Exponential_decay) formula which is configured by a half-life period. This can be more intuitively understood with an example: for a half life of one hour, a usage (for example, 1 gpu-second) that occurred an hour ago will be considered half as significant as a gpu-second that was consumed just now. |
| 23 | + |
| 24 | +Mathematically, the following formula is applied to historical usage: |
| 25 | + |
| 26 | +$$U = 0.5^{\frac{\Delta{t}}{t_{1/2}}}*A$$ |
| 27 | + |
| 28 | +Where: |
| 29 | + |
| 30 | +- $U$ is the usage |
| 31 | +- $t_{1/2}$ is the half life constant set by the user |
| 32 | +- $\Delta{t}$ is the time elapsed since that usage |
| 33 | +- $A$ is the allocated resource |
| 34 | + |
| 35 | +#### Normalization to cluster capacity |
| 36 | + |
| 37 | +The aggregated usage for each queue is then normalized to the **cluster capacity** at the relevant time period: the scheduler looks at the available resources in the cluster for that time period, and normalizes all resource usage to it. For example, in a cluster with 10 GPUs, and considering a time period of 10 hours, a queue which consumed 24 GPU hours (wether it's 8 GPUs for 3 hours, or 12 GPUs for 2 hours), will get a normalized usage score of 0.24 (used 24 GPU hours out of a potential 100). This normalization ensures that a small amount of resource usage relative to the cluster size will not result in a heavy penalty. |
| 38 | + |
| 39 | +### Effect on fair share |
| 40 | + |
| 41 | +Usually, over quota resources are assigned to each queue proportionally to it's Over Quota Weight. With time-aware fairness, queues with historical usage will get relatively less resources in over-quota. The significance of the resource usage in this calculation can be controlled with a parameter called "kValue": the bigger it is, the more impact (or weight) the historical usage has on the calculated fairshare, i.e. it will decrease the fairshare of that queue. |
| 42 | + |
| 43 | +Check out the [time aware simulator](../../cmd/time-aware-simulator/README.md) to understand scheduling behavior over time better. |
| 44 | + |
| 45 | +### Example |
| 46 | + |
| 47 | +The following plot demonstrates the GPU allocation over time in a 16 GPU cluster, with two queues, each having 0 deserved quota and 1 Over Quota weight for GPUs, each trying to run 16-GPU, single-pod Jobs. |
| 48 | + |
| 49 | + |
| 50 | + |
| 51 | +*Time units are intentionally omitted* |
| 52 | + |
| 53 | +## Setup and Configurations |
| 54 | + |
| 55 | +### Quick setup |
| 56 | + |
| 57 | +Enable prometheus in KAI operator: |
| 58 | + |
| 59 | +```sh |
| 60 | +kubectl patch config kai-config --type merge -p '{"spec":{"prometheus":{"enabled":true}}}' |
| 61 | +``` |
| 62 | + |
| 63 | +It's recommended to wait for the prometheus pod to be available. Look for it in `kai-scheduler` namespace: |
| 64 | + |
| 65 | +```sh |
| 66 | +watch kubectl get pod -n kai-scheduler prometheus-prometheus-0 |
| 67 | +``` |
| 68 | + |
| 69 | +And configure the scheduler to connect to it by patching the scheduling shard: |
| 70 | + |
| 71 | +```sh |
| 72 | +kubectl patch schedulingshard -nkai-scheudler default --type merge -p '{"spec":{"usageDBConfig":{"clientType":"prometheus"}}}' |
| 73 | +``` |
| 74 | + |
| 75 | +The scheduler should now restart and attempt to connect to prometheus. |
| 76 | + |
| 77 | +### Scheduler configurations |
| 78 | + |
| 79 | +You can further configure the scheduler by editing the scheduling shard: |
| 80 | + |
| 81 | +```sh |
| 82 | +kubectl edit schedulingshard default |
| 83 | +``` |
| 84 | +*Replace `default` with the shard name if relevant* |
| 85 | + |
| 86 | +Add the following section under `spec`: |
| 87 | +```yaml |
| 88 | + usageDBConfig: |
| 89 | + clientType: prometheus |
| 90 | + connectionString: http://prometheus-operated.kai-scheduler.svc.cluster.local:9090 # Optional: if not configured, the kai config will populate it |
| 91 | + usageParams: |
| 92 | + windowSize: 1w # The time period considered for fairness calculations. One week is the default |
| 93 | + windowType: sliding # Change to the desired value (sliding/tumbling). Sliding is the default |
| 94 | + halfLifePeriod: 10m # Leave empty to not use time decay. Off by default |
| 95 | +``` |
| 96 | +
|
| 97 | +#### kValue |
| 98 | +
|
| 99 | +KValue is a parameter used by the proportion plugin to determine the impact of historical usage in fairness calculations - higher values mean more aggressive effects on fairness. To set it, add it to the scheduling shard spec: |
| 100 | +```sh |
| 101 | +kubectl edit schedulingshard default |
| 102 | +``` |
| 103 | + |
| 104 | +```yaml |
| 105 | +spec: |
| 106 | + ... # Other configurations |
| 107 | + kValue: 0.5 |
| 108 | + usageDBConfig: |
| 109 | + ... # Other configurations |
| 110 | +``` |
| 111 | +
|
| 112 | +#### Advanced: overriding metrics |
| 113 | +
|
| 114 | +> *This configuration should not be changed under normal conditions* |
| 115 | +
|
| 116 | +In some cases, the admin might want to configure the scheduler to query different metrics for usage and capacity of certain resources. This can be done with the following config: |
| 117 | +
|
| 118 | +```sh |
| 119 | +kubectl edit schedulingshard default |
| 120 | +``` |
| 121 | + |
| 122 | +```yaml |
| 123 | + usageDBConfig: |
| 124 | + extraParams: |
| 125 | + gpuAllocationMetric: kai_queue_allocated_gpus |
| 126 | + cpuAllocationMetric: kai_queue_allocated_cpu_cores |
| 127 | + memoryAllocationMetric: kai_queue_allocated_memory_bytes |
| 128 | + gpuCapacityMetric: sum(kube_node_status_capacity{resource=\"nvidia_com_gpu\"}) |
| 129 | + cpuCapacityMetric: sum(kube_node_status_capacity{resource=\"cpu\"}) |
| 130 | + memoryCapacityMetric: sum(kube_node_status_capacity{resource=\"memory\"}) |
| 131 | +``` |
| 132 | +
|
| 133 | +### Prometheus configurations |
| 134 | +
|
| 135 | +> Using a kai-operated prometheus assumes that the [prometheus operator](https://prometheus-operator.dev/docs/getting-started/installation/) is installed in the cluster |
| 136 | +
|
| 137 | +To enable prometheus via kai-operator, apply the following patch: |
| 138 | +```sh |
| 139 | +kubectl patch config kai-config --type merge -p '{"spec":{"prometheus":{"enabled":true}}}' |
| 140 | +``` |
| 141 | + |
| 142 | +You can also customize the following configurations: |
| 143 | + |
| 144 | +``` |
| 145 | + externalPrometheusHealthProbe # defines the configuration for external Prometheus connectivity validation, with defaults. |
| 146 | + externalPrometheusUrl # defines the URL of an external Prometheus instance to use. When set, KAI will not deploy its own Prometheus but will configure ServiceMonitors for the external instance and validate connectivity |
| 147 | + retentionPeriod # defines how long to retain data (e.g., "2w", "1d", "30d") |
| 148 | + sampleInterval # defines the interval of sampling (e.g., "1m", "30s", "5m") |
| 149 | + serviceMonitor # defines ServiceMonitor configuration for KAI services |
| 150 | + storageClassName # defines the name of the storageClass that will be used to store the TSDB data. defaults to "standard". |
| 151 | + storageSize # defines the size of the storage (e.g., "20Gi", "30Gi") |
| 152 | +``` |
| 153 | + |
| 154 | +## Troubleshooting |
| 155 | + |
| 156 | +### Dependencies |
| 157 | + |
| 158 | +Before enabling prometheus in kai config, make sure that the prometheus is installed. If it's not, you will see the following condition in the kai config: |
| 159 | + |
| 160 | +``` sh |
| 161 | +kubectl describe config kai-config |
| 162 | +``` |
| 163 | +``` |
| 164 | +Status: |
| 165 | + Conditions: |
| 166 | + ... |
| 167 | + Last Transition Time: 2025-11-10T11:25:48Z |
| 168 | + Message: KAI-prometheus: no matches for kind "Prometheus" in version "monitoring.coreos.com/v1" |
| 169 | +KAI-prometheus: not available |
| 170 | + Observed Generation: 2 |
| 171 | + Reason: available |
| 172 | + Status: False |
| 173 | + Type: Available |
| 174 | +``` |
| 175 | + |
| 176 | +Simply follow the [prometheus installation instructions](https://prometheus-operator.dev/docs/getting-started/installation/). |
| 177 | + |
| 178 | +In order to collect cluster capacity metrics, [kube-state-metrics](https://artifacthub.io/packages/helm/prometheus-community/kube-state-metrics/) needs to be installed as well. By default, the kai operator creates a ServiceMonitor for it, assuming it's installed in `monitoring` or `default` namespace. |
| 179 | + |
| 180 | +### Missing metrics |
| 181 | + |
| 182 | +If the scheduler is unable to collect the usage metrics from prometheus, you will see a message in the logs, similar to this: |
| 183 | + |
| 184 | +``` |
| 185 | +2025-11-10T12:33:07.318Z ERROR usagedb/usagedb.go:142 failed to fetch usage data: error querying nvidia.com/gpu and capacity: error querying cluster capacity metric ((sum(kube_node_status_capacity{resource="nvidia_com_gpu"})) * (0.5^((1762777987 - time()) / 600.000000))): bad_data: invalid parameter "query": 1:124: parse error: unexpected character in duration expression: '&' |
| 186 | +``` |
| 187 | + |
| 188 | +Prometheus connectivity |
| 189 | +Metrics availability |
0 commit comments