Skip to content

feat(observability): add memberlist peer-state metrics, key-cache hit/miss counters, and structured-log cleanup#355

Open
PRAteek-singHWY wants to merge 1 commit into
openkruise:masterfrom
PRAteek-singHWY:feat/observability-distributed
Open

feat(observability): add memberlist peer-state metrics, key-cache hit/miss counters, and structured-log cleanup#355
PRAteek-singHWY wants to merge 1 commit into
openkruise:masterfrom
PRAteek-singHWY:feat/observability-distributed

Conversation

@PRAteek-singHWY
Copy link
Copy Markdown
Contributor

@PRAteek-singHWY PRAteek-singHWY commented May 8, 2026

What this PR does

Second and final PR for #353. Adds operator-facing observability for the distributed parts of sandbox-manager — the
parts that PR #354 didn't cover.

What it handles

Sub-piece What it adds
(d) Memberlist peer-state metrics sandbox_peer_state{node,state} gauge (alive/dead) +
sandbox_peer_join_duration_seconds histogram.
(e) Key-cache hit/miss counters e2b_key_cache_hits_total and e2b_key_cache_misses_total, both labelled by
storage backend (mysql / secret) and lookup path (by_key / by_id).
(f) Structured-log cleanup Converts the 6 remaining unstructured klog.Infof / Errorf / Warningf calls that
fire on real events to klog.InfoS / ErrorS with named fields.

Relation to PR #354

PR #354 closed the K8s-facing gaps — health probes (/healthz, /readyz), reconcile-loop metrics, and webhook
admission metrics. Together with this PR, all 6 sub-pieces from issue #353 are addressed.

PR #354 (sub-pieces a, b, c) This PR (sub-pieces d, e, f)
Scope K8s-facing Distributed
Surfaces health probes, reconcile, webhooks memberlist, auth cache, log hygiene

Why it matters

Two parts of sandbox-manager have been completely silent in Prometheus until now:

  • Memberlist gossip. A peer can drop out of the cluster and nothing tells the operator — they only notice when
    traffic skews.
  • API key cache. Every authenticated request runs through two layers of in-memory cache before hitting the DB, but
    there's no signal at all about whether the cache is doing useful work.

This PR exposes both.

Implementation notes

(d) Memberlist metrics

New pkg/peers/metrics.go wires into the existing eventDelegate.NotifyJoin / NotifyLeave hooks — no extra polling.
The peer_join_duration histogram observes once per process, the first time another peer is seen after Start().

Tracking memberlist's suspect (intermediate "maybe dead") state would need polling of list.Members() rather than
event hooks. Left for a follow-up.

(e) Cache counters

New pkg/servers/e2b/keys/metrics.go. Increments are added at the cache-check points in both mysql.go and secret.go.
Pure addition — no behavior change. Operators can finally answer "what's our auth cache hit rate?" and "is the by_id
path even being used?".

(f) Log cleanup — smaller than originally scoped

A grep across non-test code surfaced only 11 unstructured klog calls, none of them in hot paths (the main hot-path
offender — utils.DumpJson(newStatus) — was already removed in PR #354). Of those 11, 5 are one-time startup
messages
("Started X successfully") with no benefit from conversion. The other 6 fire on real events and were
converted. The remaining 5 are intentionally left alone to avoid churn.

Backward compatibility

  • All new metrics are purely additive.
  • The (f) log conversions change only the call shape, not what gets logged.
  • No new dependencies.

Tests

  • pkg/peers/metrics_test.go — 3 tests: gauge toggling, label isolation across nodes, histogram increment.
  • pkg/servers/e2b/keys/metrics_test.go — 3 tests: counter increments and label-partition isolation.

All affected packages green.

Checklist

  • Build + vet pass
  • All affected-package tests pass
  • No new dependencies

fixes #353 (sub-pieces d, e, f). Sub-pieces a, b, c shipped in #354.

…/miss counters, and structured-log cleanup

Closes the distributed-side observability gaps in sandbox-manager — second
of two PRs for issue openkruise#353. PR 1 covered K8s-facing observability (health
probes, reconcile metrics, webhook admission metrics).

(d) Memberlist peer-state metrics — adds sandbox_peer_state{node,state}
    gauge wired into the existing eventDelegate (NotifyJoin/NotifyLeave),
    plus sandbox_peer_join_duration_seconds histogram observed once per
    process the first time a peer is seen after Start().

(e) Key-cache hit/miss counters — adds e2b_key_cache_hits_total and
    e2b_key_cache_misses_total, both labelled by storage backend
    (mysql/secret) and lookup path (by_key/by_id). Increments at the
    cache-check points in both backends.

(f) Structured-log cleanup — converts the 6 remaining unstructured
    klog.Infof/Errorf/Warningf calls that fire on real events to
    klog.InfoS/ErrorS with named fields. One-time startup messages are
    left as-is (no benefit from conversion).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kruise-bot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign zmberg for approval by writing /assign @zmberg in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@codecov
Copy link
Copy Markdown

codecov Bot commented May 8, 2026

Codecov Report

❌ Patch coverage is 79.48718% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.39%. Comparing base (866ba50) to head (a4ebeec).
⚠️ Report is 12 commits behind head on master.

Files with missing lines Patch % Lines
pkg/utils/webhookutils/writer/secret.go 0.00% 3 Missing ⚠️
.../utils/webhookutils/configuration/configuration.go 0.00% 2 Missing ⚠️
pkg/webhook/webhook_controller.go 0.00% 2 Missing ⚠️
...controller/sandboxclaim/sandboxclaim_controller.go 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #355      +/-   ##
==========================================
+ Coverage   75.31%   75.39%   +0.07%     
==========================================
  Files         143      145       +2     
  Lines       10235    10266      +31     
==========================================
+ Hits         7709     7740      +31     
  Misses       2193     2193              
  Partials      333      333              
Flag Coverage Δ
unittests 75.39% <79.48%> (+0.07%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@kruise-bot
Copy link
Copy Markdown

@PRAteek-singHWY: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Observability] Add operator-facing metrics and health probes across sandbox-manager subsystems

2 participants