-
Notifications
You must be signed in to change notification settings - Fork 4.3k
Description
What happened?
Hello,
-
The introduction of very small cpu limits (250m) in feat: update helm chart to work with kong as a gateway #8735 (march 2024) has starved the dashboard components out of CPU, making them take multiple times longer to run and hit timeout.
-
The combination of very short timeout + auto retry is causing repeated cascading failures. This is a long standing issue. The dashboard is non functional in (no-so) large clusters because queries can never complete before they time out.
A workaround to get the dashboard working is to turn off auto refresh, but it's not ideal.
Could you adjust the settings?
The CPU limits need to be immediately raised to 1000m (1 cpu).
I have multiple clusters of various size. I noticed all the pods (scraper, api, web) in all of the clusters are getting heavily throttled.
CPU limits should never be below 1 as it prevents the application from getting cpu to run. You may want to review kubernetes projects for the same mistake.
While we are at it. Could you adjust the settings for auto retry to be less aggressive?
- Logs are 5 seconds auto refresh, it's too short.
- Resources are 10 seconds auto refresh, it's too short. (It was previously 5 seconds and it was raised a bit because it was literally failing 100% of the time)
Could do 10 and 20 seconds respectively to try make the tool more stable, without rocking the boat.
Realistically, the resource refresh might need to be 30s or 60s or more for active clusters.
If somebody has a cluster with 1000+ and 10000+ pods, it would be nice to check how long the API take to return (pod/deployments views, with and without namespace filter, should be pretty slow).
I had a look at your contributing guidelines, I could raise a PR but it might be difficult for me to get the CLA signed. I'd appreciate if you can raise the PR yourself. It's merely a few values to adjust.
diff --git a/charts/kubernetes-dashboard/values.yaml b/charts/kubernetes-dashboard/values.yaml
index 67f028ffe..3ad7c6737 100644
--- a/charts/kubernetes-dashboard/values.yaml
+++ b/charts/kubernetes-dashboard/values.yaml
@@ -79,9 +79,9 @@ app:
# # Max number of labels that are displayed by default on most views.
# labelsLimit: 3
# # Number of seconds between every auto-refresh of logs
- # logsAutoRefreshTimeInterval: 5
+ # logsAutoRefreshTimeInterval: 10
# # Number of seconds between every auto-refresh of every resource. Set 0 to disable
- # resourceAutoRefreshTimeInterval: 10
+ # resourceAutoRefreshTimeInterval: 20
# # Hide all access denied warnings in the notification panel
# disableAccessDeniedNotifications: false
# # Hide all namespaces option in namespace selection dropdown to avoid accidental selection in large clusters thus preventing OOM errors
@@ -168,7 +168,7 @@ auth:
cpu: 100m
memory: 200Mi
limits:
- cpu: 250m
+ cpu: 1000m
memory: 400Mi
automountServiceAccountToken: true
volumes:
@@ -223,7 +223,7 @@ api:
cpu: 100m
memory: 200Mi
limits:
- cpu: 250m
+ cpu: 1000m
memory: 400Mi
automountServiceAccountToken: true
# Additional volumes
@@ -283,7 +283,7 @@ web:
cpu: 100m
memory: 200Mi
limits:
- cpu: 250m
+ cpu: 1000m
memory: 400Mi
automountServiceAccountToken: true
# Additional volumes
@@ -341,7 +341,7 @@ metricsScraper:
cpu: 100m
memory: 200Mi
limits:
- cpu: 250m
+ cpu: 1000m
memory: 400Mi
livenessProbe:
httpGet:
diff --git a/hack/test-resources/env-variables-pod.yaml b/hack/test-resources/env-variables-pod.yaml
index 5ba59b58d..344d313f7 100644
--- a/hack/test-resources/env-variables-pod.yaml
+++ b/hack/test-resources/env-variables-pod.yaml
@@ -60,7 +60,7 @@ spec:
cpu: "125m"
limits:
memory: "64Mi"
- cpu: "250m"
+ cpu: "1000m"
env:
- name: MY_POD_NAME
valueFrom:
diff --git a/modules/web/pkg/settings/settings.go b/modules/web/pkg/settings/settings.go
index b8a0421ad..0f3829e9a 100644
--- a/modules/web/pkg/settings/settings.go
+++ b/modules/web/pkg/settings/settings.go
@@ -25,8 +25,8 @@ var defaultSettings = Settings{
ClusterName: lo.ToPtr(""),
ItemsPerPage: lo.ToPtr(10),
LabelsLimit: lo.ToPtr(3),
- LogsAutoRefreshTimeInterval: lo.ToPtr(5),
- ResourceAutoRefreshTimeInterval: lo.ToPtr(10),
+ LogsAutoRefreshTimeInterval: lo.ToPtr(10),
+ ResourceAutoRefreshTimeInterval: lo.ToPtr(20),
DisableAccessDeniedNotifications: lo.ToPtr(false),
HideAllNamespaces: lo.ToPtr(false),
DefaultNamespace: lo.ToPtr("default"),
diff --git a/modules/web/src/common/services/global/globalsettings.ts b/modules/web/src/common/services/global/globalsettings.ts
index 2ec8b960e..9d1341b12 100644
--- a/modules/web/src/common/services/global/globalsettings.ts
+++ b/modules/web/src/common/services/global/globalsettings.ts
@@ -23,11 +23,11 @@ import {catchError, switchMap, takeUntil, tap} from 'rxjs/operators';
import {AuthorizerService} from './authorizer';
export const DEFAULT_SETTINGS: GlobalSettings = {
- itemsPerPage: 10,
+ itemsPerPage: 20,
clusterName: '',
labelsLimit: 3,
- logsAutoRefreshTimeInterval: 5,
- resourceAutoRefreshTimeInterval: 5,
+ logsAutoRefreshTimeInterval: 10,
+ resourceAutoRefreshTimeInterval: 20,
disableAccessDeniedNotifications: false,
hideAllNamespaces: false,
defaultNamespace: 'default',
diff --git a/modules/web/src/settings/global/template.html b/modules/web/src/settings/global/template.html
index eb40a4423..907ad2c64 100644
--- a/modules/web/src/settings/global/template.html
+++ b/modules/web/src/settings/global/template.html
@@ -107,7 +107,7 @@ limitations under the License.
[formControlName]="Controls.LogsAutorefreshInterval"
color="primary"
min="1"
- max="10"
+ max="60"
step="1"
fxFlex
>
What did you expect to happen?
dashboard doesn't fail with 50X timeout
How can we reproduce it (as minimally and precisely as possible)?
use the dashboard in a real world deployment (non negligible number of pods/host)
Kubernetes Dashboard version
master affected, bad configuration merged since March 2024
Kubernetes version
1.3x