You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/timeaware/README.md
+44-43Lines changed: 44 additions & 43 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -52,49 +52,33 @@ The following plot demonstrates the GPU allocation over time in a 16 GPU cluster
52
52
53
53
## Setup and Configurations
54
54
55
-
> Note: this section is not finalized and is expected to change in an upcoming KAI release
55
+
### Quick setup
56
56
57
-
### Enabling prometheus
57
+
Enable prometheus in KAI operator:
58
58
59
-
> Using a kai-operated prometheus assumes that the [prometheus operator](https://prometheus-operator.dev/docs/getting-started/installation/) is installed in the cluster
60
-
61
-
To enable prometheus via kai-operator, apply the following patch:
You can also customize the following configurations:
63
+
It's recommended to wait for the prometheus pod to be available. Look for it in `kai-scheduler` namespace:
67
64
68
-
```
69
-
externalPrometheusHealthProbe # defines the configuration for external Prometheus connectivity validation, with defaults.
70
-
externalPrometheusUrl # defines the URL of an external Prometheus instance to use. When set, KAI will not deploy its own Prometheus but will configure ServiceMonitors for the external instance and validate connectivity
71
-
retentionPeriod # defines how long to retain data (e.g., "2w", "1d", "30d")
72
-
sampleInterval # defines the interval of sampling (e.g., "1m", "30s", "5m")
73
-
serviceMonitor # defines ServiceMonitor configuration for KAI services
74
-
storageClassName # defines the name of the storageClass that will be used to store the TSDB data. defaults to "standard".
75
-
storageSize # defines the size of the storage (e.g., "20Gi", "30Gi")
65
+
```sh
66
+
watch kubectl get pod -n kai-scheduler prometheus-prometheus-0
76
67
```
77
68
78
-
If you choose to use your own prometheus, make sure that it's configured to watch the relevant service monitors with `accounting: kai` labels. For example:
79
-
```yaml
80
-
apiVersion: monitoring.coreos.com/v1
81
-
kind: Prometheus
82
-
metadata:
83
-
name: external-prometheus
84
-
namespace: other-namespace
85
-
spec:
86
-
... # Other prometheus configurations..
87
-
serviceMonitorSelector:
88
-
matchLabels:
89
-
accounting: kai
90
-
...
69
+
And configure the scheduler to connect to it by patching the scheduling shard:
The scheduler should now restart and attempt to connect to prometheus.
94
76
95
-
In order to use time-aware fairness, you need to configure the scheduler to connect to prometheus. If using more than one scheduling shards in the cluster, each shard can be configured independently.
77
+
Continue reading for more configuration options.
96
78
97
-
To edit the default scheduling shard:
79
+
### Scheduler configurations
80
+
81
+
You can further configure the scheduler by editing the scheduling shard:
98
82
99
83
```sh
100
84
kubectl edit schedulingshard default
@@ -105,21 +89,14 @@ Add the following section under `spec`:
connectionString: http://prometheus-operated.kai-scheduler.svc.cluster.local:9090# Optional: if not configured, the kai config will populate it
109
93
usageParams:
110
-
halfLifePeriod: 10m#Change to the desired value
111
-
windowSize: 10m# Change to the desired value
112
-
windowType: sliding#Change to the desired value (sliding/tumbling)
94
+
windowSize: 1w#The time period considered for fairness calculations. One week is the default
95
+
windowType: sliding# Change to the desired value (sliding/tumbling). Sliding is the default
96
+
halfLifePeriod: 10m#Leave empty to not use time decay
113
97
```
114
-
*This configuration assumes using the kai operated prometheus. Change connectionString if relevant.*
115
-
116
-
Configure windowSize and halfLifePeriod to desired values.
117
-
118
-
### External prometheus
119
-
120
-
You can configure kai-scheduler to connect to any external DB that's compatible with the prometheus API - simply edit the connectionString accordingly. Note that it has to be accessible from the scheduler pod, and have access to queue controller and kube-state metrics.
121
98
122
-
### kValue
99
+
#### kValue
123
100
124
101
KValue is a parameter used by the proportion plugin to determine the significance of historical usage in fairness calculations - higher values mean more aggressive effects on fairness. To set it, add it to the scheduling shard spec:
> Using a kai-operated prometheus assumes that the [prometheus operator](https://prometheus-operator.dev/docs/getting-started/installation/) is installed in the cluster
138
+
139
+
To enable prometheus via kai-operator, apply the following patch:
You can also customize the following configurations:
145
+
146
+
```
147
+
externalPrometheusHealthProbe # defines the configuration for external Prometheus connectivity validation, with defaults.
148
+
externalPrometheusUrl # defines the URL of an external Prometheus instance to use. When set, KAI will not deploy its own Prometheus but will configure ServiceMonitors for the external instance and validate connectivity
149
+
retentionPeriod # defines how long to retain data (e.g., "2w", "1d", "30d")
150
+
sampleInterval # defines the interval of sampling (e.g., "1m", "30s", "5m")
151
+
serviceMonitor # defines ServiceMonitor configuration for KAI services
152
+
storageClassName # defines the name of the storageClass that will be used to store the TSDB data. defaults to "standard".
153
+
storageSize # defines the size of the storage (e.g., "20Gi", "30Gi")
0 commit comments