-
Notifications
You must be signed in to change notification settings - Fork 1k
Fix failed to do cluster health check when member cluster apiserver configured with --shutdown-delay-duration #6277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
cd5bd18 to
f9625f6
Compare
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## master #6277 +/- ##
==========================================
+ Coverage 47.95% 49.34% +1.39%
==========================================
Files 676 678 +2
Lines 55964 55125 -839
==========================================
+ Hits 26837 27203 +366
+ Misses 27355 26153 -1202
+ Partials 1772 1769 -3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
/lgtm |
|
Thanks~ |
|
Hi @yanfeng1992, thanks for your feedback. I'd like to know more about this subject.
I'm wondering if we're using the readyz and healthz interfaces to indicate the ready condition of the cluster is enough. |
Causing the cluster to be set offline, but in reality, the cluster is healthy
During this period, it continues to process requests normally. After the shutdown delay period, the kube-apiserver still provide services because it is deployed with multi-replica rolling updates. |
f9625f6 to
be5ff73
Compare
|
New changes are detected. LGTM label has been removed. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
be5ff73 to
8663c29
Compare
|
/retest |
8663c29 to
18e018f
Compare
Then, what would happen? Note that, before the cluster is set offline, it requires at least 3 consecutive detection failures. Do you mean that the |
| // 1. StatusInternalServerError(500): When the server is configured with --shutdown-delay-duration, | ||
| // /readyz returns failure but /healthz still serves success | ||
| // 2. StatusNotFound(404): When the readyz endpoint is not installed in member cluster | ||
| healthStatus, err = healthEndpointCheck(clusterClient.KubeClient, "/healthz") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is that the /healthz has been deprecated since v1.16, it can only be used as a backoff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern is that the
/healthzhas been deprecated since v1.16, it can only be used as a backoff.
The Kubernetes API server provides 3 API endpoints (healthz, livez and readyz) to indicate the current status of the API server. The healthz endpoint is deprecated (since Kubernetes v1.16), and we should use the more specific livez and readyz endpoints instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's exactly my concern. We don't know when /healthz will be removed, and then will break this solution.
--shutdown-delay-duration is configured with more than 60s In our environment, some high-level warnings are generated when the cluster goes offline. Will changes in cluster status also affect scheduling and cause rescheduling? |
No, no rescheduling. |
|
@yanfeng1992 |
18e018f to
717aed6
Compare
717aed6 to
1407978
Compare
…le shutdown delay or missing readyz endpoint Signed-off-by: huangyanfeng <[email protected]>
1407978 to
fb0f5cc
Compare

Signed-off-by: huangyanfeng [email protected]
What type of PR is this?
/kind bug
What this PR does / why we need it:
When the server is configured with --shutdown-delay-duration, during that time it keeps serving requests normally. The endpoints /healthz and /livez will return success, but /readyz immediately returns a failure.
https://github.com/kubernetes/kubernetes/blob/ab3e83f73424a18f298a0050440af92d2d7c4720/staging/src/k8s.io/apiserver/pkg/server/options/server_run_options.go#L386-L389
kube-apiserver --shutdown-delay-duration duration
The karmada-controller log when a problem occurs
`E0408 09:42:37.339849 1 cluster_status_controller.go:394] Failed to do cluster health check for cluster arm942, err is : an error on the server ("[+]ping ok\n[+]log ok\n[+]etcd ok\n...\n[-]shutdown failed: reason withheld\nreadyz check failed") has prevented the request from succeeding
E0408 09:42:47.345929 1 cluster_status_controller.go:394] Failed to do cluster health check for cluster arm942, err is : an error on the server ("[+]ping ok\n[+]log ok\n[+]etcd ok\n...\n[-]shutdown failed: reason withheld\nreadyz check failed") has prevented the request from succeeding
...(the following 10 entries of the same format have been omitted)...
E0408 09:43:37.435542 1 cluster_status_controller.go:394] Failed to do cluster health check for cluster arm942, err is : an error on the server ("[+]ping ok\n[+]log ok\n[+]etcd ok\n...\n[-]shutdown failed: reason withheld\nreadyz check failed") has prevented the request from succeeding
`
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?: