Skip to content

Commit a03253d

Browse files
committed
Fix docs
1 parent 3129f1d commit a03253d

File tree

1 file changed

+42
-43
lines changed

1 file changed

+42
-43
lines changed

docs/timeaware/README.md

Lines changed: 42 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -52,49 +52,31 @@ The following plot demonstrates the GPU allocation over time in a 16 GPU cluster
5252

5353
## Setup and Configurations
5454

55-
> Note: this section is not finalized and is expected to change in an upcoming KAI release
55+
### Quick setup
5656

57-
### Enabling prometheus
57+
Enable prometheus in KAI operator:
5858

59-
> Using a kai-operated prometheus assumes that the [prometheus operator](https://prometheus-operator.dev/docs/getting-started/installation/) is installed in the cluster
60-
61-
To enable prometheus via kai-operator, apply the following patch:
6259
```sh
6360
kubectl patch config kai-config --type merge -p '{"spec":{"prometheus":{"enabled":true}}}'
6461
```
6562

66-
You can also customize the following configurations:
63+
It's recommended to wait for the prometheus pod to be available. Look for it in `kai-scheduler` namespace:
6764

68-
```
69-
externalPrometheusHealthProbe # defines the configuration for external Prometheus connectivity validation, with defaults.
70-
externalPrometheusUrl # defines the URL of an external Prometheus instance to use. When set, KAI will not deploy its own Prometheus but will configure ServiceMonitors for the external instance and validate connectivity
71-
retentionPeriod # defines how long to retain data (e.g., "2w", "1d", "30d")
72-
sampleInterval # defines the interval of sampling (e.g., "1m", "30s", "5m")
73-
serviceMonitor # defines ServiceMonitor configuration for KAI services
74-
storageClassName # defines the name of the storageClass that will be used to store the TSDB data. defaults to "standard".
75-
storageSize # defines the size of the storage (e.g., "20Gi", "30Gi")
65+
```sh
66+
watch kubectl get pod -n kai-scheduler prometheus-prometheus-0
7667
```
7768

78-
If you choose to use your own prometheus, make sure that it's configured to watch the relevant service monitors with `accounting: kai` labels. For example:
79-
``` yaml
80-
apiVersion: monitoring.coreos.com/v1
81-
kind: Prometheus
82-
metadata:
83-
name: external-prometheus
84-
namespace: other-namespace
85-
spec:
86-
... # Other prometheus configurations..
87-
serviceMonitorSelector:
88-
matchLabels:
89-
accounting: kai
90-
...
69+
And configure the scheduler to connect to it by patching the scheduling shard:
70+
71+
```sh
72+
kubectl patch schedulingshard -nkai-scheudler default --type merge -p '{"spec":{"usageDBConfig":{"clientType":"prometheus"}}}'
9173
```
9274

93-
### Scheduler configurations
75+
The scheduler should now restart and attempt to connect to prometheus.
9476

95-
In order to use time-aware fairness, you need to configure the scheduler to connect to prometheus. If using more than one scheduling shards in the cluster, each shard can be configured independently.
77+
### Scheduler configurations
9678

97-
To edit the default scheduling shard:
79+
You can further configure the scheduler by editing the scheduling shard:
9880

9981
```sh
10082
kubectl edit schedulingshard default
@@ -105,21 +87,14 @@ Add the following section under `spec`:
10587
```yaml
10688
usageDBConfig:
10789
clientType: prometheus
108-
connectionString: http://prometheus-operated.kai-scheduler.svc.cluster.local:9090
90+
connectionString: http://prometheus-operated.kai-scheduler.svc.cluster.local:9090 # Optional: if not configured, the kai config will populate it
10991
usageParams:
110-
halfLifePeriod: 10m # Change to the desired value
111-
windowSize: 10m # Change to the desired value
112-
windowType: sliding # Change to the desired value (sliding/tumbling)
92+
windowSize: 1w # The time period considered for fairness calculations. One week is the default
93+
windowType: sliding # Change to the desired value (sliding/tumbling). Sliding is the default
94+
halfLifePeriod: 10m # Leave empty to not use time decay
11395
```
114-
*This configuration assumes using the kai operated prometheus. Change connectionString if relevant.*
115-
116-
Configure windowSize and halfLifePeriod to desired values.
117-
118-
### External prometheus
119-
120-
You can configure kai-scheduler to connect to any external DB that's compatible with the prometheus API - simply edit the connectionString accordingly. Note that it has to be accessible from the scheduler pod, and have access to queue controller and kube-state metrics.
12196
122-
### kValue
97+
#### kValue
12398
12499
KValue is a parameter used by the proportion plugin to determine the significance of historical usage in fairness calculations - higher values mean more aggressive effects on fairness. To set it, add it to the scheduling shard spec:
125100
```sh
@@ -128,10 +103,13 @@ kubectl edit schedulingshard default
128103

129104
```yaml
130105
spec:
106+
... # Other configurations
131107
kValue: 0.5
108+
usageDBConfig:
109+
... # Other configurations
132110
```
133111
134-
### Advanced: overriding metrics
112+
#### Advanced: overriding metrics
135113
136114
> *This configuration should not be changed under normal conditions*
137115
@@ -152,6 +130,27 @@ kubectl edit schedulingshard default
152130
memoryCapacityMetric: sum(kube_node_status_capacity{resource=\"memory\"})
153131
```
154132
133+
### Prometheus configurations
134+
135+
> Using a kai-operated prometheus assumes that the [prometheus operator](https://prometheus-operator.dev/docs/getting-started/installation/) is installed in the cluster
136+
137+
To enable prometheus via kai-operator, apply the following patch:
138+
```sh
139+
kubectl patch config kai-config --type merge -p '{"spec":{"prometheus":{"enabled":true}}}'
140+
```
141+
142+
You can also customize the following configurations:
143+
144+
```
145+
externalPrometheusHealthProbe # defines the configuration for external Prometheus connectivity validation, with defaults.
146+
externalPrometheusUrl # defines the URL of an external Prometheus instance to use. When set, KAI will not deploy its own Prometheus but will configure ServiceMonitors for the external instance and validate connectivity
147+
retentionPeriod # defines how long to retain data (e.g., "2w", "1d", "30d")
148+
sampleInterval # defines the interval of sampling (e.g., "1m", "30s", "5m")
149+
serviceMonitor # defines ServiceMonitor configuration for KAI services
150+
storageClassName # defines the name of the storageClass that will be used to store the TSDB data. defaults to "standard".
151+
storageSize # defines the size of the storage (e.g., "20Gi", "30Gi")
152+
```
153+
155154
## Troubleshooting
156155

157156
### Dependencies

0 commit comments

Comments
 (0)