Fix failed to do cluster health check when member cluster apiserver configured with --shutdown-delay-duration #6277

yanfeng1992 · 2025-04-08T11:54:23Z

Signed-off-by: huangyanfeng [email protected]

What type of PR is this?
/kind bug

What this PR does / why we need it:

When the server is configured with --shutdown-delay-duration, during that time it keeps serving requests normally. The endpoints /healthz and /livez will return success, but /readyz immediately returns a failure.

https://github.com/kubernetes/kubernetes/blob/ab3e83f73424a18f298a0050440af92d2d7c4720/staging/src/k8s.io/apiserver/pkg/server/options/server_run_options.go#L386-L389

kube-apiserver --shutdown-delay-duration duration

The karmada-controller log when a problem occurs

`E0408 09:42:37.339849 1 cluster_status_controller.go:394] Failed to do cluster health check for cluster arm942, err is : an error on the server ("[+]ping ok\n[+]log ok\n[+]etcd ok\n...\n[-]shutdown failed: reason withheld\nreadyz check failed") has prevented the request from succeeding

E0408 09:42:47.345929 1 cluster_status_controller.go:394] Failed to do cluster health check for cluster arm942, err is : an error on the server ("[+]ping ok\n[+]log ok\n[+]etcd ok\n...\n[-]shutdown failed: reason withheld\nreadyz check failed") has prevented the request from succeeding

...(the following 10 entries of the same format have been omitted)...

E0408 09:43:37.435542 1 cluster_status_controller.go:394] Failed to do cluster health check for cluster arm942, err is : an error on the server ("[+]ping ok\n[+]log ok\n[+]etcd ok\n...\n[-]shutdown failed: reason withheld\nreadyz check failed") has prevented the request from succeeding
`

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

codecov-commenter · 2025-04-08T12:22:44Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 49.34%. Comparing base (ba5ffba) to head (fb0f5cc).
Report is 110 commits behind head on master.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6277      +/-   ##
==========================================
+ Coverage   47.95%   49.34%   +1.39%     
==========================================
  Files         676      678       +2     
  Lines       55964    55125     -839     
==========================================
+ Hits        26837    27203     +366     
+ Misses      27355    26153    -1202     
+ Partials     1772     1769       -3

Flag	Coverage Δ
unittests	`49.34% <100.00%> (+1.39%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

whitewindmills · 2025-04-09T02:33:32Z

/lgtm
/cc @XiShanYongYe-Chang @RainbowMango

XiShanYongYe-Chang · 2025-04-09T02:34:33Z

Thanks~
/assign

XiShanYongYe-Chang · 2025-04-09T09:48:49Z

Hi @yanfeng1992, thanks for your feedback.

I'd like to know more about this subject.

Which businesses will be affected by this issue, and what will be the specific impact?
What is the status of kube-apiserver in the member cluster after the shutdown-delay-duration period? Can kube-apiserver still provide services?

I'm wondering if we're using the readyz and healthz interfaces to indicate the ready condition of the cluster is enough.

yanfeng1992 · 2025-04-09T11:54:36Z

Which businesses will be affected by this issue, and what will be the specific impact?

Causing the cluster to be set offline, but in reality, the cluster is healthy

What is the status of kube-apiserver in the member cluster after the shutdown-delay-duration period? Can kube-apiserver still provide services?

During this period, it continues to process requests normally. After the shutdown delay period, the kube-apiserver still provide services because it is deployed with multi-replica rolling updates.

shutdown-delay-duration

@XiShanYongYe-Chang

karmada-bot · 2025-04-11T06:47:15Z

New changes are detected. LGTM label has been removed.

karmada-bot · 2025-04-11T06:47:15Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from whitewindmills. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

pkg/controllers/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

yanfeng1992 · 2025-04-11T08:35:05Z

/retest

RainbowMango · 2025-04-15T12:24:30Z

Causing the cluster to be set offline, but in reality, the cluster is healthy

Then, what would happen? Note that, before the cluster is set offline, it requires at least 3 consecutive detection failures. Do you mean that the --shutdown-delay-duration is configured with a larger period?

RainbowMango · 2025-04-15T12:26:42Z

pkg/controllers/status/cluster_status_controller.go

+		// 1. StatusInternalServerError(500): When the server is configured with --shutdown-delay-duration,
+		//    /readyz returns failure but /healthz still serves success
+		// 2. StatusNotFound(404): When the readyz endpoint is not installed in member cluster
 		healthStatus, err = healthEndpointCheck(clusterClient.KubeClient, "/healthz")


My concern is that the /healthz has been deprecated since v1.16, it can only be used as a backoff.

My concern is that the /healthz has been deprecated since v1.16, it can only be used as a backoff.

health-checks

The Kubernetes API server provides 3 API endpoints (healthz, livez and readyz) to indicate the current status of the API server. The healthz endpoint is deprecated (since Kubernetes v1.16), and we should use the more specific livez and readyz endpoints instead.

Yeah, that's exactly my concern. We don't know when /healthz will be removed, and then will break this solution.

yanfeng1992 · 2025-04-21T09:12:12Z

Then, what would happen? Note that, before the cluster is set offline, it requires at least 3 consecutive detection failures. Do you mean that the --shutdown-delay-duration is configured with a larger period?

--shutdown-delay-duration is configured with more than 60s

In our environment, some high-level warnings are generated when the cluster goes offline.

Will changes in cluster status also affect scheduling and cause rescheduling?

RainbowMango · 2025-04-21T11:42:27Z

Will changes in cluster status also affect scheduling and cause rescheduling?

No, no rescheduling.

RainbowMango · 2025-04-22T09:26:39Z

@yanfeng1992
We discussed this PR at today's community meeting with @whitewindmills, and we are agree to move this forward, given this patch rely on the deprecated healthz endpoint which might be removed in following Kubernetes releases, we hope to add an test(probably a E2E test) to protect this case. Once Kubernetes removed the endpoint, when Karmada started to test against the version of Kubernetes, we could notice this immediately.

…le shutdown delay or missing readyz endpoint Signed-off-by: huangyanfeng <[email protected]>

karmada-bot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 8, 2025

karmada-bot requested review from chaunceyjiang and mrlihanbo April 8, 2025 11:54

karmada-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Apr 8, 2025

yanfeng1992 force-pushed the improve_check_healthy branch 2 times, most recently from cd5bd18 to f9625f6 Compare April 8, 2025 12:03

yanfeng1992 changed the title ~~Fix failed to do cluster health check for cluster when member apiserver configured with --shutdown-delay-duration~~ Fix failed to do cluster health check when member cluster apiserver configured with --shutdown-delay-duration Apr 8, 2025

karmada-bot requested review from RainbowMango and XiShanYongYe-Chang April 9, 2025 02:33

karmada-bot assigned whitewindmills Apr 9, 2025

karmada-bot added the lgtm Indicates that a PR is ready to be merged. label Apr 9, 2025

karmada-bot assigned XiShanYongYe-Chang Apr 9, 2025

RainbowMango added this to the v1.14 milestone Apr 10, 2025

yanfeng1992 force-pushed the improve_check_healthy branch from f9625f6 to be5ff73 Compare April 11, 2025 06:47

karmada-bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 11, 2025

yanfeng1992 force-pushed the improve_check_healthy branch from be5ff73 to 8663c29 Compare April 11, 2025 08:10

yanfeng1992 force-pushed the improve_check_healthy branch from 8663c29 to 18e018f Compare April 11, 2025 09:10

RainbowMango reviewed Apr 15, 2025

View reviewed changes

yanfeng1992 force-pushed the improve_check_healthy branch from 18e018f to 717aed6 Compare May 13, 2025 07:06

karmada-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 13, 2025

yanfeng1992 force-pushed the improve_check_healthy branch from 717aed6 to 1407978 Compare May 13, 2025 08:35

Improve health check logic in getClusterHealthStatus function to hand…

fb0f5cc

…le shutdown delay or missing readyz endpoint Signed-off-by: huangyanfeng <[email protected]>

yanfeng1992 force-pushed the improve_check_healthy branch from 1407978 to fb0f5cc Compare May 13, 2025 08:43

RainbowMango modified the milestones: v1.14, v1.15 Jul 8, 2025

RainbowMango modified the milestones: v1.15, v1.16 Oct 15, 2025

RainbowMango modified the milestones: v1.16, v1.17 Dec 4, 2025

Fix failed to do cluster health check when member cluster apiserver configured with --shutdown-delay-duration #6277

Are you sure you want to change the base?

Fix failed to do cluster health check when member cluster apiserver configured with --shutdown-delay-duration #6277

Conversation

yanfeng1992 commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

whitewindmills commented Apr 9, 2025

Uh oh!

XiShanYongYe-Chang commented Apr 9, 2025

Uh oh!

XiShanYongYe-Chang commented Apr 9, 2025

Uh oh!

yanfeng1992 commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

karmada-bot commented Apr 11, 2025

Uh oh!

karmada-bot commented Apr 11, 2025

Uh oh!

yanfeng1992 commented Apr 11, 2025

Uh oh!

RainbowMango commented Apr 15, 2025

Uh oh!

RainbowMango Apr 15, 2025

Choose a reason for hiding this comment

Uh oh!

yanfeng1992 Apr 21, 2025

Choose a reason for hiding this comment

Uh oh!

RainbowMango Apr 21, 2025

Choose a reason for hiding this comment

Uh oh!

yanfeng1992 commented Apr 21, 2025

Uh oh!

RainbowMango commented Apr 21, 2025

Uh oh!

RainbowMango commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

yanfeng1992 commented Apr 8, 2025 •

edited

Loading

codecov-commenter commented Apr 8, 2025 •

edited

Loading

yanfeng1992 commented Apr 9, 2025 •

edited

Loading

RainbowMango commented Apr 22, 2025 •

edited

Loading