feat(observability): add memberlist peer-state metrics, key-cache hit/miss counters, and structured-log cleanup#355
Conversation
…/miss counters, and structured-log cleanup Closes the distributed-side observability gaps in sandbox-manager — second of two PRs for issue openkruise#353. PR 1 covered K8s-facing observability (health probes, reconcile metrics, webhook admission metrics). (d) Memberlist peer-state metrics — adds sandbox_peer_state{node,state} gauge wired into the existing eventDelegate (NotifyJoin/NotifyLeave), plus sandbox_peer_join_duration_seconds histogram observed once per process the first time a peer is seen after Start(). (e) Key-cache hit/miss counters — adds e2b_key_cache_hits_total and e2b_key_cache_misses_total, both labelled by storage backend (mysql/secret) and lookup path (by_key/by_id). Increments at the cache-check points in both backends. (f) Structured-log cleanup — converts the 6 remaining unstructured klog.Infof/Errorf/Warningf calls that fire on real events to klog.InfoS/ErrorS with named fields. One-time startup messages are left as-is (no benefit from conversion). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #355 +/- ##
==========================================
+ Coverage 75.31% 75.39% +0.07%
==========================================
Files 143 145 +2
Lines 10235 10266 +31
==========================================
+ Hits 7709 7740 +31
Misses 2193 2193
Partials 333 333
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@PRAteek-singHWY: PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What this PR does
Second and final PR for #353. Adds operator-facing observability for the distributed parts of
sandbox-manager— theparts that PR #354 didn't cover.
What it handles
sandbox_peer_state{node,state}gauge (alive/dead) +sandbox_peer_join_duration_secondshistogram.e2b_key_cache_hits_totalande2b_key_cache_misses_total, both labelled bymysql/secret) and lookup path (by_key/by_id).klog.Infof/Errorf/Warningfcalls thatklog.InfoS/ErrorSwith named fields.Relation to PR #354
PR #354 closed the K8s-facing gaps — health probes (
/healthz,/readyz), reconcile-loop metrics, and webhookadmission metrics. Together with this PR, all 6 sub-pieces from issue #353 are addressed.
Why it matters
Two parts of
sandbox-managerhave been completely silent in Prometheus until now:traffic skews.
there's no signal at all about whether the cache is doing useful work.
This PR exposes both.
Implementation notes
(d) Memberlist metrics
New
pkg/peers/metrics.gowires into the existingeventDelegate.NotifyJoin/NotifyLeavehooks — no extra polling.The
peer_join_durationhistogram observes once per process, the first time another peer is seen afterStart().(e) Cache counters
New
pkg/servers/e2b/keys/metrics.go. Increments are added at the cache-check points in bothmysql.goandsecret.go.Pure addition — no behavior change. Operators can finally answer "what's our auth cache hit rate?" and "is the
by_idpath even being used?".
(f) Log cleanup — smaller than originally scoped
A grep across non-test code surfaced only 11 unstructured klog calls, none of them in hot paths (the main hot-path
offender —
utils.DumpJson(newStatus)— was already removed in PR #354). Of those 11, 5 are one-time startupmessages ("Started X successfully") with no benefit from conversion. The other 6 fire on real events and were
converted. The remaining 5 are intentionally left alone to avoid churn.
Backward compatibility
Tests
pkg/peers/metrics_test.go— 3 tests: gauge toggling, label isolation across nodes, histogram increment.pkg/servers/e2b/keys/metrics_test.go— 3 tests: counter increments and label-partition isolation.All affected packages green.
Checklist
fixes #353 (sub-pieces d, e, f). Sub-pieces a, b, c shipped in #354.