You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/timeaware/README.md
+33-53Lines changed: 33 additions & 53 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -37,76 +37,52 @@ The following plot demonstrates the GPU allocation over time in a 16 GPU cluster
37
37
38
38
*Time units are intentionally omitted*
39
39
40
-
## Configuration
40
+
## Setup and Configurations
41
41
42
-
> Note: this is not finalized and is expected to change in an upcoming KAI release
42
+
> Note: this section is not finalized and is expected to change in an upcoming KAI release
43
43
44
44
### Enabling prometheus
45
45
46
+
> Using a kai-operated prometheus assumes that the [prometheus operator](https://prometheus-operator.dev/docs/getting-started/installation/) is installed in the cluster
47
+
46
48
To enable prometheus via kai-operator, apply the following patch:
You can also customize the following configurations:
52
54
53
55
```
54
-
externalPrometheusHealthProbe <Object>
55
-
ExternalPrometheusPingConfig defines the configuration for external
56
-
Prometheus connectivity validation, with defaults.
57
-
58
-
externalPrometheusUrl <string>
59
-
ExternalPrometheusUrl defines the URL of an external Prometheus instance to
60
-
use
61
-
When set, KAI will not deploy its own Prometheus but will configure
62
-
ServiceMonitors
63
-
for the external instance and validate connectivity
64
-
65
-
retentionPeriod <string>
66
-
RetentionPeriod defines how long to retain data (e.g., "2w", "1d", "30d")
67
-
68
-
sampleInterval <string>
69
-
SampleInterval defines the interval of sampling (e.g., "1m", "30s", "5m")
70
-
71
-
serviceMonitor <Object>
72
-
ServiceMonitor defines ServiceMonitor configuration for KAI services
73
-
74
-
storageClassName <string>
75
-
StorageClassName defines the name of the storageClass that will be used to
76
-
store the TSDB data. defaults to "standard".
77
-
78
-
storageSize <string>
79
-
StorageSize defines the size of the storage (e.g., "20Gi", "30Gi")
56
+
externalPrometheusHealthProbe # defines the configuration for external Prometheus connectivity validation, with defaults.
57
+
externalPrometheusUrl # defines the URL of an external Prometheus instance to use. When set, KAI will not deploy its own Prometheus but will configure ServiceMonitors for the external instance and validate connectivity
58
+
retentionPeriod # defines how long to retain data (e.g., "2w", "1d", "30d")
59
+
sampleInterval # defines the interval of sampling (e.g., "1m", "30s", "5m")
60
+
serviceMonitor # defines ServiceMonitor configuration for KAI services
61
+
storageClassName # defines the name of the storageClass that will be used to store the TSDB data. defaults to "standard".
62
+
storageSize # defines the size of the storage (e.g., "20Gi", "30Gi")
80
63
```
81
64
82
-
Alternatively, you can use your own prometheus. Make sure that it's configured to collect metrics from the queue controller via a service monitor. For example:
83
-
84
-
```yaml
65
+
If you choose to use your own prometheus, make sure that it's configured to watch the relevant service monitors with `accounting: kai` labels. For example:
In order to use time-aware fairness, you need to configure the scheduler to connect to prometheus. If using more than one scheduling shards in the cluster, each shard can be configured independently.
83
+
84
+
To edit the default scheduling shard:
108
85
109
-
To configure the scheduler to connect to prometheus, the usageDBConfig section of the scheduling shard needs to be edited:
110
86
```sh
111
87
kubectl edit schedulingshard default
112
88
```
@@ -118,9 +94,9 @@ Add the following section under `spec`:
0 commit comments