Skip to content

Commit c2bc3cd

Browse files
committed
Fix docs
1 parent ae5ea34 commit c2bc3cd

File tree

1 file changed

+44
-43
lines changed

1 file changed

+44
-43
lines changed

docs/timeaware/README.md

Lines changed: 44 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -52,49 +52,33 @@ The following plot demonstrates the GPU allocation over time in a 16 GPU cluster
5252

5353
## Setup and Configurations
5454

55-
> Note: this section is not finalized and is expected to change in an upcoming KAI release
55+
### Quick setup
5656

57-
### Enabling prometheus
57+
Enable prometheus in KAI operator:
5858

59-
> Using a kai-operated prometheus assumes that the [prometheus operator](https://prometheus-operator.dev/docs/getting-started/installation/) is installed in the cluster
60-
61-
To enable prometheus via kai-operator, apply the following patch:
6259
```sh
6360
kubectl patch config kai-config --type merge -p '{"spec":{"prometheus":{"enabled":true}}}'
6461
```
6562

66-
You can also customize the following configurations:
63+
It's recommended to wait for the prometheus pod to be available. Look for it in `kai-scheduler` namespace:
6764

68-
```
69-
externalPrometheusHealthProbe # defines the configuration for external Prometheus connectivity validation, with defaults.
70-
externalPrometheusUrl # defines the URL of an external Prometheus instance to use. When set, KAI will not deploy its own Prometheus but will configure ServiceMonitors for the external instance and validate connectivity
71-
retentionPeriod # defines how long to retain data (e.g., "2w", "1d", "30d")
72-
sampleInterval # defines the interval of sampling (e.g., "1m", "30s", "5m")
73-
serviceMonitor # defines ServiceMonitor configuration for KAI services
74-
storageClassName # defines the name of the storageClass that will be used to store the TSDB data. defaults to "standard".
75-
storageSize # defines the size of the storage (e.g., "20Gi", "30Gi")
65+
```sh
66+
watch kubectl get pod -n kai-scheduler prometheus-prometheus-0
7667
```
7768

78-
If you choose to use your own prometheus, make sure that it's configured to watch the relevant service monitors with `accounting: kai` labels. For example:
79-
``` yaml
80-
apiVersion: monitoring.coreos.com/v1
81-
kind: Prometheus
82-
metadata:
83-
name: external-prometheus
84-
namespace: other-namespace
85-
spec:
86-
... # Other prometheus configurations..
87-
serviceMonitorSelector:
88-
matchLabels:
89-
accounting: kai
90-
...
69+
And configure the scheduler to connect to it by patching the scheduling shard:
70+
71+
```sh
72+
kubectl patch schedulingshard -nkai-scheudler default --type merge -p '{"spec":{"usageDBConfig":{"clientType":"prometheus"}}}'
9173
```
9274

93-
### Scheduler configurations
75+
The scheduler should now restart and attempt to connect to prometheus.
9476

95-
In order to use time-aware fairness, you need to configure the scheduler to connect to prometheus. If using more than one scheduling shards in the cluster, each shard can be configured independently.
77+
Continue reading for more configuration options.
9678

97-
To edit the default scheduling shard:
79+
### Scheduler configurations
80+
81+
You can further configure the scheduler by editing the scheduling shard:
9882

9983
```sh
10084
kubectl edit schedulingshard default
@@ -105,21 +89,14 @@ Add the following section under `spec`:
10589
```yaml
10690
usageDBConfig:
10791
clientType: prometheus
108-
connectionString: http://prometheus-operated.kai-scheduler.svc.cluster.local:9090
92+
connectionString: http://prometheus-operated.kai-scheduler.svc.cluster.local:9090 # Optional: if not configured, the kai config will populate it
10993
usageParams:
110-
halfLifePeriod: 10m # Change to the desired value
111-
windowSize: 10m # Change to the desired value
112-
windowType: sliding # Change to the desired value (sliding/tumbling)
94+
windowSize: 1w # The time period considered for fairness calculations. One week is the default
95+
windowType: sliding # Change to the desired value (sliding/tumbling). Sliding is the default
96+
halfLifePeriod: 10m # Leave empty to not use time decay
11397
```
114-
*This configuration assumes using the kai operated prometheus. Change connectionString if relevant.*
115-
116-
Configure windowSize and halfLifePeriod to desired values.
117-
118-
### External prometheus
119-
120-
You can configure kai-scheduler to connect to any external DB that's compatible with the prometheus API - simply edit the connectionString accordingly. Note that it has to be accessible from the scheduler pod, and have access to queue controller and kube-state metrics.
12198
122-
### kValue
99+
#### kValue
123100
124101
KValue is a parameter used by the proportion plugin to determine the significance of historical usage in fairness calculations - higher values mean more aggressive effects on fairness. To set it, add it to the scheduling shard spec:
125102
```sh
@@ -128,10 +105,13 @@ kubectl edit schedulingshard default
128105

129106
```yaml
130107
spec:
108+
... # Other configurations
131109
kValue: 0.5
110+
usageDBConfig:
111+
... # Other configurations
132112
```
133113
134-
### Advanced: overriding metrics
114+
#### Advanced: overriding metrics
135115
136116
> *This configuration should not be changed under normal conditions*
137117
@@ -152,6 +132,27 @@ kubectl edit schedulingshard default
152132
memoryCapacityMetric: sum(kube_node_status_capacity{resource=\"memory\"})
153133
```
154134
135+
### Prometheus configurations
136+
137+
> Using a kai-operated prometheus assumes that the [prometheus operator](https://prometheus-operator.dev/docs/getting-started/installation/) is installed in the cluster
138+
139+
To enable prometheus via kai-operator, apply the following patch:
140+
```sh
141+
kubectl patch config kai-config --type merge -p '{"spec":{"prometheus":{"enabled":true}}}'
142+
```
143+
144+
You can also customize the following configurations:
145+
146+
```
147+
externalPrometheusHealthProbe # defines the configuration for external Prometheus connectivity validation, with defaults.
148+
externalPrometheusUrl # defines the URL of an external Prometheus instance to use. When set, KAI will not deploy its own Prometheus but will configure ServiceMonitors for the external instance and validate connectivity
149+
retentionPeriod # defines how long to retain data (e.g., "2w", "1d", "30d")
150+
sampleInterval # defines the interval of sampling (e.g., "1m", "30s", "5m")
151+
serviceMonitor # defines ServiceMonitor configuration for KAI services
152+
storageClassName # defines the name of the storageClass that will be used to store the TSDB data. defaults to "standard".
153+
storageSize # defines the size of the storage (e.g., "20Gi", "30Gi")
154+
```
155+
155156
## Troubleshooting
156157

157158
### Dependencies

0 commit comments

Comments
 (0)