feat(observability): add /healthz, /readyz, reconcile metrics, and webhook admission metrics#354
Conversation
…bhook admission metrics Closes the K8s-facing observability gaps in sandbox-manager — first of two PRs for issue openkruise#353. Subsequent PR will cover memberlist peer state, key-cache hit/miss counters, and structured-field logging on hot paths. (a) Health probes — enables controller-runtime's built-in /healthz and /readyz via a new --health-probe-bind-address flag (default :8081, kubebuilder convention; empty disables for backward compatibility). (b) Reconcile metrics — adds sandbox_reconcile_duration_seconds {namespace,result} histogram and drops a hot-path utils.DumpJson allocation from the status-update path. (c) Webhook admission metrics — wraps every registered admission handler in a Prometheus-instrumented decorator, adding sandbox_admission_duration_seconds and sandbox_admission_total, both labelled by webhook path, operation, and allowed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #354 +/- ##
==========================================
+ Coverage 75.31% 75.54% +0.22%
==========================================
Files 143 144 +1
Lines 10235 10275 +40
==========================================
+ Hits 7709 7762 +53
+ Misses 2193 2180 -13
Partials 333 333
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…ntroller plumbing Addresses CI feedback on openkruise#354: - gofmt: realign Controller struct fields after the healthProbeBindAddress addition disturbed tab alignment in pkg/servers/e2b/core.go. - coverage: extract the /healthz + /readyz registration logic in NewControllerManager into a small registerProbeChecks helper. The helper can now be unit tested with a tiny mock manager, lifting the six previously uncovered lines without needing envtest. - coverage: add TestNewController_FieldPlumbing and TestNewController_DisabledHealthProbe to exercise NewController and sandboxManagerOptions for both the enabled and disabled probe-address paths. Net coverage gain on the patch: ~16 of the 17 previously uncovered lines are now exercised by unit tests. The single line in pkg/webhook/server.go remains uncovered because it sits inside SetupWithManager, which would need a real manager.Manager instance — the wrapper itself (newInstrumentedHandler) is already at 100% via pkg/webhook/metrics_test.go. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@PRAteek-singHWY: PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Summary
This is the first of two PRs for #353. It closes the K8s-facing gaps in
sandbox-managerso operators can actually tell what's going on:/healthzand/readyzHTTP endpoints so Kubernetes liveness/readiness probes work properly.Sandboxreconcile takes, and removes a hot-path JSON dump that allocated on every status update.The other three sub-pieces — peer-state metrics, key-cache hit/miss counters, and log-message cleanup — will land in a follow-up PR against the same issue.
Why this matters
sandbox-manageralready publishes a lot of sandbox-level metrics (sandbox_creation_duration,sandbox_status_phase, etc.) — but operators have no way to answer simpler questions like:That's the gap this PR closes.
What's new
(a) Health probes
The controller-runtime manager already supports
/healthzand/readyz— the option was just disabled inpkg/cache/cache.go(HealthProbeBindAddress: ""). This PR enables it via a new--health-probe-bind-addressflag.:8081(kubebuilder convention).--health-probe-bind-address="".200 OKonce the manager has started.(b) Reconcile metrics
New histogram:
sandbox_reconcile_duration_seconds{namespace,result}, whereresultissuccess/requeue/error.Also drops a
utils.DumpJson(newStatus)call from the success log line inupdateSandboxStatus. That JSON marshal happened on every status update; replaced with three structured fields (phase,observedGeneration,updateRevision) which are cheaper to allocate and easier to query in log aggregators.(c) Webhook admission metrics
Two new metrics:
sandbox_admission_duration_seconds{webhook,operation,allowed}— how long each handler takes.sandbox_admission_total{webhook,operation,allowed}— admit/deny counts.Implementation is a tiny
instrumentedHandlerdecorator wrapped around every existing handler at registration. No handler logic changes —Handle()returns whatever the inner handler returns.Backward compatibility
:8081is the kubebuilder convention; operators who don't want the new port can disable it with an empty string.Tests
HealthProbeBindAddressfield.success/requeue/error).All affected packages green:
pkg/sandbox-manager/...,pkg/cache/...,pkg/servers/e2b/...,pkg/webhook/...,pkg/controller/...,cmd/....Checklist
fixes #353 (sub-pieces a, b, c). Sub-pieces (d) memberlist peer-state, (e) key-cache hit/miss, and (f) structured-field logging on hot paths will land in a follow-up PR against the same issue.