Skip to content

feat(observability): add /healthz, /readyz, reconcile metrics, and webhook admission metrics#354

Open
PRAteek-singHWY wants to merge 2 commits into
openkruise:masterfrom
PRAteek-singHWY:feat/health-readiness-endpoints
Open

feat(observability): add /healthz, /readyz, reconcile metrics, and webhook admission metrics#354
PRAteek-singHWY wants to merge 2 commits into
openkruise:masterfrom
PRAteek-singHWY:feat/health-readiness-endpoints

Conversation

@PRAteek-singHWY
Copy link
Copy Markdown
Contributor

@PRAteek-singHWY PRAteek-singHWY commented May 8, 2026

Summary

This is the first of two PRs for #353. It closes the K8s-facing gaps in sandbox-manager so operators can actually tell what's going on:

  • (a) Adds /healthz and /readyz HTTP endpoints so Kubernetes liveness/readiness probes work properly.
  • (b) Adds a Prometheus histogram for how long each Sandbox reconcile takes, and removes a hot-path JSON dump that allocated on every status update.
  • (c) Adds Prometheus latency + count metrics for every admission webhook, by wrapping each handler in a small decorator. No webhook changes its behavior.

The other three sub-pieces — peer-state metrics, key-cache hit/miss counters, and log-message cleanup — will land in a follow-up PR against the same issue.

Why this matters

sandbox-manager already publishes a lot of sandbox-level metrics (sandbox_creation_duration, sandbox_status_phase, etc.) — but operators have no way to answer simpler questions like:

  • "Is this pod actually healthy, or just bound to a port?"
  • "Why are reconciles slow today?"
  • "Which webhook is making my API server feel slow?"

That's the gap this PR closes.

What's new

(a) Health probes

The controller-runtime manager already supports /healthz and /readyz — the option was just disabled in pkg/cache/cache.go (HealthProbeBindAddress: ""). This PR enables it via a new --health-probe-bind-address flag.

  • Default: :8081 (kubebuilder convention).
  • Disable: pass --health-probe-bind-address="".
  • When enabled, both endpoints respond 200 OK once the manager has started.

(b) Reconcile metrics

New histogram: sandbox_reconcile_duration_seconds{namespace,result}, where result is success / requeue / error.

Also drops a utils.DumpJson(newStatus) call from the success log line in updateSandboxStatus. That JSON marshal happened on every status update; replaced with three structured fields (phase, observedGeneration, updateRevision) which are cheaper to allocate and easier to query in log aggregators.

(c) Webhook admission metrics

Two new metrics:

  • sandbox_admission_duration_seconds{webhook,operation,allowed} — how long each handler takes.
  • sandbox_admission_total{webhook,operation,allowed} — admit/deny counts.

Implementation is a tiny instrumentedHandler decorator wrapped around every existing handler at registration. No handler logic changes — Handle() returns whatever the inner handler returns.

Backward compatibility

  • Default port :8081 is the kubebuilder convention; operators who don't want the new port can disable it with an empty string.
  • The new metrics are purely additive — existing metrics are untouched.
  • The webhook decorator is transparent: every existing admission decision passes through unchanged.
  • No new dependencies.

Tests

  • Round-trip test for the new HealthProbeBindAddress field.
  • 4 cases for the result-label mapping (success / requeue / error).
  • Histogram observation count assertion for reconcile duration.
  • 5 cases for the webhook decorator: allowed, denied, errored, response pass-through, and duration recorded.

All affected packages green: pkg/sandbox-manager/..., pkg/cache/..., pkg/servers/e2b/..., pkg/webhook/..., pkg/controller/..., cmd/....

Note: end-to-end probe testing would need an envtest-backed *rest.Config (controller-runtime's manager dials the API server during construction). This repo doesn't have envtest set up yet — happy to file a separate issue for that if it's useful.

Checklist

  • Build + vet pass
  • All affected-package tests pass
  • No new dependencies

fixes #353 (sub-pieces a, b, c). Sub-pieces (d) memberlist peer-state, (e) key-cache hit/miss, and (f) structured-field logging on hot paths will land in a follow-up PR against the same issue.

…bhook admission metrics

Closes the K8s-facing observability gaps in sandbox-manager — first of two
PRs for issue openkruise#353. Subsequent PR will cover memberlist peer state,
key-cache hit/miss counters, and structured-field logging on hot paths.

(a) Health probes — enables controller-runtime's built-in /healthz and
    /readyz via a new --health-probe-bind-address flag (default :8081,
    kubebuilder convention; empty disables for backward compatibility).

(b) Reconcile metrics — adds sandbox_reconcile_duration_seconds
    {namespace,result} histogram and drops a hot-path utils.DumpJson
    allocation from the status-update path.

(c) Webhook admission metrics — wraps every registered admission handler
    in a Prometheus-instrumented decorator, adding
    sandbox_admission_duration_seconds and sandbox_admission_total, both
    labelled by webhook path, operation, and allowed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kruise-bot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign zmberg for approval by writing /assign @zmberg in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@codecov
Copy link
Copy Markdown

codecov Bot commented May 8, 2026

Codecov Report

❌ Patch coverage is 94.02985% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.54%. Comparing base (866ba50) to head (e0181b8).
⚠️ Report is 22 commits behind head on master.

Files with missing lines Patch % Lines
pkg/cache/cache.go 72.72% 3 Missing ⚠️
pkg/webhook/server.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #354      +/-   ##
==========================================
+ Coverage   75.31%   75.54%   +0.22%     
==========================================
  Files         143      144       +1     
  Lines       10235    10275      +40     
==========================================
+ Hits         7709     7762      +53     
+ Misses       2193     2180      -13     
  Partials      333      333              
Flag Coverage Δ
unittests 75.54% <94.02%> (+0.22%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…ntroller plumbing

Addresses CI feedback on openkruise#354:

- gofmt: realign Controller struct fields after the healthProbeBindAddress
  addition disturbed tab alignment in pkg/servers/e2b/core.go.

- coverage: extract the /healthz + /readyz registration logic in
  NewControllerManager into a small registerProbeChecks helper. The
  helper can now be unit tested with a tiny mock manager, lifting the
  six previously uncovered lines without needing envtest.

- coverage: add TestNewController_FieldPlumbing and
  TestNewController_DisabledHealthProbe to exercise NewController and
  sandboxManagerOptions for both the enabled and disabled probe-address
  paths.

Net coverage gain on the patch: ~16 of the 17 previously uncovered lines
are now exercised by unit tests. The single line in pkg/webhook/server.go
remains uncovered because it sits inside SetupWithManager, which would
need a real manager.Manager instance — the wrapper itself
(newInstrumentedHandler) is already at 100% via pkg/webhook/metrics_test.go.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kruise-bot
Copy link
Copy Markdown

@PRAteek-singHWY: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Observability] Add operator-facing metrics and health probes across sandbox-manager subsystems

2 participants