Skip to content

Commit 63d98c0

Browse files
committed
Docs fixes
1 parent 4174533 commit 63d98c0

File tree

1 file changed

+33
-53
lines changed

1 file changed

+33
-53
lines changed

docs/timeaware/README.md

Lines changed: 33 additions & 53 deletions
Original file line numberDiff line numberDiff line change
@@ -37,76 +37,52 @@ The following plot demonstrates the GPU allocation over time in a 16 GPU cluster
3737

3838
*Time units are intentionally omitted*
3939

40-
## Configuration
40+
## Setup and Configurations
4141

42-
> Note: this is not finalized and is expected to change in an upcoming KAI release
42+
> Note: this section is not finalized and is expected to change in an upcoming KAI release
4343
4444
### Enabling prometheus
4545

46+
> Using a kai-operated prometheus assumes that the [prometheus operator](https://prometheus-operator.dev/docs/getting-started/installation/) is installed in the cluster
47+
4648
To enable prometheus via kai-operator, apply the following patch:
4749
```sh
48-
kubectl patch config kai-scheduler --type merge -p '{"spec":{"prometheus":{"enabled":true}}}'
50+
kubectl patch config kai-config --type merge -p '{"spec":{"prometheus":{"enabled":true}}}'
4951
```
5052

5153
You can also customize the following configurations:
5254

5355
```
54-
externalPrometheusHealthProbe <Object>
55-
ExternalPrometheusPingConfig defines the configuration for external
56-
Prometheus connectivity validation, with defaults.
57-
58-
externalPrometheusUrl <string>
59-
ExternalPrometheusUrl defines the URL of an external Prometheus instance to
60-
use
61-
When set, KAI will not deploy its own Prometheus but will configure
62-
ServiceMonitors
63-
for the external instance and validate connectivity
64-
65-
retentionPeriod <string>
66-
RetentionPeriod defines how long to retain data (e.g., "2w", "1d", "30d")
67-
68-
sampleInterval <string>
69-
SampleInterval defines the interval of sampling (e.g., "1m", "30s", "5m")
70-
71-
serviceMonitor <Object>
72-
ServiceMonitor defines ServiceMonitor configuration for KAI services
73-
74-
storageClassName <string>
75-
StorageClassName defines the name of the storageClass that will be used to
76-
store the TSDB data. defaults to "standard".
77-
78-
storageSize <string>
79-
StorageSize defines the size of the storage (e.g., "20Gi", "30Gi")
56+
externalPrometheusHealthProbe # defines the configuration for external Prometheus connectivity validation, with defaults.
57+
externalPrometheusUrl # defines the URL of an external Prometheus instance to use. When set, KAI will not deploy its own Prometheus but will configure ServiceMonitors for the external instance and validate connectivity
58+
retentionPeriod # defines how long to retain data (e.g., "2w", "1d", "30d")
59+
sampleInterval # defines the interval of sampling (e.g., "1m", "30s", "5m")
60+
serviceMonitor # defines ServiceMonitor configuration for KAI services
61+
storageClassName # defines the name of the storageClass that will be used to store the TSDB data. defaults to "standard".
62+
storageSize # defines the size of the storage (e.g., "20Gi", "30Gi")
8063
```
8164

82-
Alternatively, you can use your own prometheus. Make sure that it's configured to collect metrics from the queue controller via a service monitor. For example:
83-
84-
```yaml
65+
If you choose to use your own prometheus, make sure that it's configured to watch the relevant service monitors with `accounting: kai` labels. For example:
66+
``` yaml
8567
apiVersion: monitoring.coreos.com/v1
86-
kind: ServiceMonitor
68+
kind: Prometheus
8769
metadata:
88-
labels:
89-
app: queuecontroller
90-
name: queuecontroller
91-
namespace: kai-scheduler
70+
name: external-prometheus
71+
namespace: other-namespace
9272
spec:
93-
endpoints:
94-
- bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
95-
interval: 30s
96-
port: metrics
97-
scrapeTimeout: 10s
98-
jobLabel: queuecontroller
99-
namespaceSelector:
100-
matchNames:
101-
- kai-scheduler
102-
selector:
73+
... # Other prometheus configurations..
74+
serviceMonitorSelector:
10375
matchLabels:
104-
app: queuecontroller
76+
accounting: kai
77+
...
10578
```
10679

107-
### Usage Database
80+
### Scheduler configurations
81+
82+
In order to use time-aware fairness, you need to configure the scheduler to connect to prometheus. If using more than one scheduling shards in the cluster, each shard can be configured independently.
83+
84+
To edit the default scheduling shard:
10885

109-
To configure the scheduler to connect to prometheus, the usageDBConfig section of the scheduling shard needs to be edited:
11086
```sh
11187
kubectl edit schedulingshard default
11288
```
@@ -118,9 +94,9 @@ Add the following section under `spec`:
11894
clientType: prometheus
11995
connectionString: http://prometheus-operated.kai-scheduler.svc.cluster.local:9090
12096
usageParams:
121-
halfLifePeriod: 10m
122-
windowSize: 10m
123-
windowType: sliding
97+
halfLifePeriod: 10m # Change to the desired value
98+
windowSize: 10m # Change to the desired value
99+
windowType: sliding # Change to the desired value (sliding/tumbling)
124100
```
125101
*This configuration assumes using the kai operated prometheus. Change connectionString if relevant.*
126102
@@ -165,5 +141,9 @@ kubectl edit schedulingshard default
165141
166142
## Troubleshooting
167143
144+
### Dependencies
145+
146+
If trying
147+
168148
Prometheus connectivity
169149
Metrics availability

0 commit comments

Comments
 (0)