Skip to content

Conversation

@yanfeng1992
Copy link
Member

@yanfeng1992 yanfeng1992 commented Apr 8, 2025

Signed-off-by: huangyanfeng [email protected]

What type of PR is this?
/kind bug

What this PR does / why we need it:

When the server is configured with --shutdown-delay-duration, during that time it keeps serving requests normally. The endpoints /healthz and /livez will return success, but /readyz immediately returns a failure.

https://github.com/kubernetes/kubernetes/blob/ab3e83f73424a18f298a0050440af92d2d7c4720/staging/src/k8s.io/apiserver/pkg/server/options/server_run_options.go#L386-L389

kube-apiserver --shutdown-delay-duration duration

image

The karmada-controller log when a problem occurs

`E0408 09:42:37.339849 1 cluster_status_controller.go:394] Failed to do cluster health check for cluster arm942, err is : an error on the server ("[+]ping ok\n[+]log ok\n[+]etcd ok\n...\n[-]shutdown failed: reason withheld\nreadyz check failed") has prevented the request from succeeding

E0408 09:42:47.345929 1 cluster_status_controller.go:394] Failed to do cluster health check for cluster arm942, err is : an error on the server ("[+]ping ok\n[+]log ok\n[+]etcd ok\n...\n[-]shutdown failed: reason withheld\nreadyz check failed") has prevented the request from succeeding

...(the following 10 entries of the same format have been omitted)...

E0408 09:43:37.435542 1 cluster_status_controller.go:394] Failed to do cluster health check for cluster arm942, err is : an error on the server ("[+]ping ok\n[+]log ok\n[+]etcd ok\n...\n[-]shutdown failed: reason withheld\nreadyz check failed") has prevented the request from succeeding
`

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:


@karmada-bot karmada-bot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 8, 2025
@karmada-bot karmada-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Apr 8, 2025
@yanfeng1992 yanfeng1992 force-pushed the improve_check_healthy branch 2 times, most recently from cd5bd18 to f9625f6 Compare April 8, 2025 12:03
@codecov-commenter
Copy link

codecov-commenter commented Apr 8, 2025

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 49.34%. Comparing base (ba5ffba) to head (fb0f5cc).
Report is 110 commits behind head on master.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6277      +/-   ##
==========================================
+ Coverage   47.95%   49.34%   +1.39%     
==========================================
  Files         676      678       +2     
  Lines       55964    55125     -839     
==========================================
+ Hits        26837    27203     +366     
+ Misses      27355    26153    -1202     
+ Partials     1772     1769       -3     
Flag Coverage Δ
unittests 49.34% <100.00%> (+1.39%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@yanfeng1992 yanfeng1992 changed the title Fix failed to do cluster health check for cluster when member apiserver configured with --shutdown-delay-duration Fix failed to do cluster health check when member cluster apiserver configured with --shutdown-delay-duration Apr 8, 2025
@whitewindmills
Copy link
Member

/lgtm
/cc @XiShanYongYe-Chang @RainbowMango

@karmada-bot karmada-bot added the lgtm Indicates that a PR is ready to be merged. label Apr 9, 2025
@XiShanYongYe-Chang
Copy link
Member

Thanks~
/assign

@XiShanYongYe-Chang
Copy link
Member

Hi @yanfeng1992, thanks for your feedback.

I'd like to know more about this subject.

  1. Which businesses will be affected by this issue, and what will be the specific impact?
  2. What is the status of kube-apiserver in the member cluster after the shutdown-delay-duration period? Can kube-apiserver still provide services?

I'm wondering if we're using the readyz and healthz interfaces to indicate the ready condition of the cluster is enough.

@yanfeng1992
Copy link
Member Author

yanfeng1992 commented Apr 9, 2025

  1. Which businesses will be affected by this issue, and what will be the specific impact?

Causing the cluster to be set offline, but in reality, the cluster is healthy

  1. What is the status of kube-apiserver in the member cluster after the shutdown-delay-duration period? Can kube-apiserver still provide services?

During this period, it continues to process requests normally. After the shutdown delay period, the kube-apiserver still provide services because it is deployed with multi-replica rolling updates.

shutdown-delay-duration

image

@XiShanYongYe-Chang

@RainbowMango RainbowMango added this to the v1.14 milestone Apr 10, 2025
@yanfeng1992 yanfeng1992 force-pushed the improve_check_healthy branch from f9625f6 to be5ff73 Compare April 11, 2025 06:47
@karmada-bot karmada-bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 11, 2025
@karmada-bot
Copy link
Collaborator

New changes are detected. LGTM label has been removed.

@karmada-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from whitewindmills. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@yanfeng1992 yanfeng1992 force-pushed the improve_check_healthy branch from be5ff73 to 8663c29 Compare April 11, 2025 08:10
@yanfeng1992
Copy link
Member Author

/retest

@yanfeng1992 yanfeng1992 force-pushed the improve_check_healthy branch from 8663c29 to 18e018f Compare April 11, 2025 09:10
@RainbowMango
Copy link
Member

Causing the cluster to be set offline, but in reality, the cluster is healthy

Then, what would happen? Note that, before the cluster is set offline, it requires at least 3 consecutive detection failures. Do you mean that the --shutdown-delay-duration is configured with a larger period?

// 1. StatusInternalServerError(500): When the server is configured with --shutdown-delay-duration,
// /readyz returns failure but /healthz still serves success
// 2. StatusNotFound(404): When the readyz endpoint is not installed in member cluster
healthStatus, err = healthEndpointCheck(clusterClient.KubeClient, "/healthz")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that the /healthz has been deprecated since v1.16, it can only be used as a backoff.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is that the /healthz has been deprecated since v1.16, it can only be used as a backoff.

health-checks

The Kubernetes API server provides 3 API endpoints (healthz, livez and readyz) to indicate the current status of the API server. The healthz endpoint is deprecated (since Kubernetes v1.16), and we should use the more specific livez and readyz endpoints instead.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's exactly my concern. We don't know when /healthz will be removed, and then will break this solution.

@yanfeng1992
Copy link
Member Author

Then, what would happen? Note that, before the cluster is set offline, it requires at least 3 consecutive detection failures. Do you mean that the --shutdown-delay-duration is configured with a larger period?

--shutdown-delay-duration is configured with more than 60s

In our environment, some high-level warnings are generated when the cluster goes offline.

Will changes in cluster status also affect scheduling and cause rescheduling?

@RainbowMango
Copy link
Member

Will changes in cluster status also affect scheduling and cause rescheduling?

No, no rescheduling.

@RainbowMango
Copy link
Member

RainbowMango commented Apr 22, 2025

@yanfeng1992
We discussed this PR at today's community meeting with @whitewindmills, and we are agree to move this forward, given this patch rely on the deprecated healthz endpoint which might be removed in following Kubernetes releases, we hope to add an test(probably a E2E test) to protect this case. Once Kubernetes removed the endpoint, when Karmada started to test against the version of Kubernetes, we could notice this immediately.

@yanfeng1992 yanfeng1992 force-pushed the improve_check_healthy branch from 18e018f to 717aed6 Compare May 13, 2025 07:06
@karmada-bot karmada-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels May 13, 2025
@yanfeng1992 yanfeng1992 force-pushed the improve_check_healthy branch from 717aed6 to 1407978 Compare May 13, 2025 08:35
…le shutdown delay or missing readyz endpoint

Signed-off-by: huangyanfeng <[email protected]>
@yanfeng1992 yanfeng1992 force-pushed the improve_check_healthy branch from 1407978 to fb0f5cc Compare May 13, 2025 08:43
@RainbowMango RainbowMango modified the milestones: v1.14, v1.15 Jul 8, 2025
@RainbowMango RainbowMango modified the milestones: v1.15, v1.16 Oct 15, 2025
@RainbowMango RainbowMango modified the milestones: v1.16, v1.17 Dec 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Categorizes issue or PR as related to a bug. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants