You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/timeaware/README.md
+42-43Lines changed: 42 additions & 43 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -52,49 +52,31 @@ The following plot demonstrates the GPU allocation over time in a 16 GPU cluster
52
52
53
53
## Setup and Configurations
54
54
55
-
> Note: this section is not finalized and is expected to change in an upcoming KAI release
55
+
### Quick setup
56
56
57
-
### Enabling prometheus
57
+
Enable prometheus in KAI operator:
58
58
59
-
> Using a kai-operated prometheus assumes that the [prometheus operator](https://prometheus-operator.dev/docs/getting-started/installation/) is installed in the cluster
60
-
61
-
To enable prometheus via kai-operator, apply the following patch:
You can also customize the following configurations:
63
+
It's recommended to wait for the prometheus pod to be available. Look for it in `kai-scheduler` namespace:
67
64
68
-
```
69
-
externalPrometheusHealthProbe # defines the configuration for external Prometheus connectivity validation, with defaults.
70
-
externalPrometheusUrl # defines the URL of an external Prometheus instance to use. When set, KAI will not deploy its own Prometheus but will configure ServiceMonitors for the external instance and validate connectivity
71
-
retentionPeriod # defines how long to retain data (e.g., "2w", "1d", "30d")
72
-
sampleInterval # defines the interval of sampling (e.g., "1m", "30s", "5m")
73
-
serviceMonitor # defines ServiceMonitor configuration for KAI services
74
-
storageClassName # defines the name of the storageClass that will be used to store the TSDB data. defaults to "standard".
75
-
storageSize # defines the size of the storage (e.g., "20Gi", "30Gi")
65
+
```sh
66
+
watch kubectl get pod -n kai-scheduler prometheus-prometheus-0
76
67
```
77
68
78
-
If you choose to use your own prometheus, make sure that it's configured to watch the relevant service monitors with `accounting: kai` labels. For example:
79
-
```yaml
80
-
apiVersion: monitoring.coreos.com/v1
81
-
kind: Prometheus
82
-
metadata:
83
-
name: external-prometheus
84
-
namespace: other-namespace
85
-
spec:
86
-
... # Other prometheus configurations..
87
-
serviceMonitorSelector:
88
-
matchLabels:
89
-
accounting: kai
90
-
...
69
+
And configure the scheduler to connect to it by patching the scheduling shard:
The scheduler should now restart and attempt to connect to prometheus.
94
76
95
-
In order to use time-aware fairness, you need to configure the scheduler to connect to prometheus. If using more than one scheduling shards in the cluster, each shard can be configured independently.
77
+
### Scheduler configurations
96
78
97
-
To edit the default scheduling shard:
79
+
You can further configure the scheduler by editing the scheduling shard:
98
80
99
81
```sh
100
82
kubectl edit schedulingshard default
@@ -105,21 +87,14 @@ Add the following section under `spec`:
connectionString: http://prometheus-operated.kai-scheduler.svc.cluster.local:9090# Optional: if not configured, the kai config will populate it
109
91
usageParams:
110
-
halfLifePeriod: 10m#Change to the desired value
111
-
windowSize: 10m# Change to the desired value
112
-
windowType: sliding#Change to the desired value (sliding/tumbling)
92
+
windowSize: 1w#The time period considered for fairness calculations. One week is the default
93
+
windowType: sliding# Change to the desired value (sliding/tumbling). Sliding is the default
94
+
halfLifePeriod: 10m#Leave empty to not use time decay
113
95
```
114
-
*This configuration assumes using the kai operated prometheus. Change connectionString if relevant.*
115
-
116
-
Configure windowSize and halfLifePeriod to desired values.
117
-
118
-
### External prometheus
119
-
120
-
You can configure kai-scheduler to connect to any external DB that's compatible with the prometheus API - simply edit the connectionString accordingly. Note that it has to be accessible from the scheduler pod, and have access to queue controller and kube-state metrics.
121
96
122
-
### kValue
97
+
#### kValue
123
98
124
99
KValue is a parameter used by the proportion plugin to determine the significance of historical usage in fairness calculations - higher values mean more aggressive effects on fairness. To set it, add it to the scheduling shard spec:
> Using a kai-operated prometheus assumes that the [prometheus operator](https://prometheus-operator.dev/docs/getting-started/installation/) is installed in the cluster
136
+
137
+
To enable prometheus via kai-operator, apply the following patch:
You can also customize the following configurations:
143
+
144
+
```
145
+
externalPrometheusHealthProbe # defines the configuration for external Prometheus connectivity validation, with defaults.
146
+
externalPrometheusUrl # defines the URL of an external Prometheus instance to use. When set, KAI will not deploy its own Prometheus but will configure ServiceMonitors for the external instance and validate connectivity
147
+
retentionPeriod # defines how long to retain data (e.g., "2w", "1d", "30d")
148
+
sampleInterval # defines the interval of sampling (e.g., "1m", "30s", "5m")
149
+
serviceMonitor # defines ServiceMonitor configuration for KAI services
150
+
storageClassName # defines the name of the storageClass that will be used to store the TSDB data. defaults to "standard".
151
+
storageSize # defines the size of the storage (e.g., "20Gi", "30Gi")
0 commit comments