pkg/election: add lease keepalive metrics#10622
pkg/election: add lease keepalive metrics#10622JmPotato wants to merge 2 commits intotikv:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughLease now owns per-purpose Prometheus metrics initialized in Grant. KeepAlive measures response intervals, updates a local TTL gauge from observed/computed expire times, treats non‑positive TTLs as invalid (increments invalid counter and stops), and increments a lease‑expired counter on watchdog timeout. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Caller
participant Lease as Lease
participant Lessor as Etcd Lessor
participant Stream as KeepAlive Stream
participant Watchdog as Watchdog Timer
participant Metrics as Prometheus Metrics
rect rgba(200,220,255,0.5)
Client->>Lease: Grant() -> initialize metrics
Lease->>Lessor: KeepAlive(ctx, leaseID) -> open stream
end
rect rgba(220,255,200,0.5)
Stream->>Lease: keepalive response (TTL, leaseID)
Lease->>Lease: loadExpireTime() / compute monotonic expireTime
Lease->>Metrics: observe keepalive_response_interval_seconds
Lease->>Metrics: set local_ttl_remaining_seconds (max of observed/computed)
end
rect rgba(255,230,200,0.5)
Watchdog->>Lease: tick -> sample expireTime
alt TTL expired (watchdog)
Lease->>Metrics: increment renewal_failure_total (watchdog_timeout)
Lease->>Lease: stop keepalive processing
else invalid TTL received (TTL <= 0)
Stream->>Lease: keepalive response (TTL<=0)
Lease->>Metrics: increment renewal_failure_total (invalid_ttl)
Lease->>Lease: stop keepalive processing (no expire published)
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
pkg/election/metrics.go (1)
19-72: LGTM — clean metrics declarations.The collectors follow Prometheus conventions (namespace + subsystem + unit suffix), labels are low-cardinality (
purpose,reasonare bounded enums), and the help texts are descriptive — particularly thelocalTTLRemaininghelp text which explicitly disclaims it isn't the etcd-authoritative TTL. The reason constants are exported with proper GoDoc and reused at the call sites inlease.go.One small ergonomic suggestion (optional): the three
prometheus.MustRegistercalls ininit()can be collapsed into a single varargs call.♻️ Optional: combine MustRegister calls
func init() { - prometheus.MustRegister(keepAliveResponseInterval) - prometheus.MustRegister(renewalFailureTotal) - prometheus.MustRegister(localTTLRemaining) + prometheus.MustRegister( + keepAliveResponseInterval, + renewalFailureTotal, + localTTLRemaining, + ) }Based on learnings: "Use Prometheus-style metrics with subsystem + unit; avoid high-cardinality labels".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/election/metrics.go` around lines 19 - 72, The init() function currently calls prometheus.MustRegister three times; replace those three calls with a single variadic call to prometheus.MustRegister(keepAliveResponseInterval, renewalFailureTotal, localTTLRemaining) to make registration concise and idiomatic while preserving the same collectors (refer to the init function and the collector variables keepAliveResponseInterval, renewalFailureTotal, and localTTLRemaining).pkg/election/lease_test.go (2)
361-370: Minor: threadpurposethrough the helper.Most callers of
newTestLeaseWithKeepAliveChimmediately overwritelease.Purpose = purposeafterwards (lines 215, 241, 263, 283, 302, 316, 342). Threading the purpose into the helper would remove the seven-line repetition and keep the test setup atomic.♻️ Suggested helper change
-func newTestLeaseWithKeepAliveCh(keepAliveCh chan *clientv3.LeaseKeepAliveResponse, leaseTimeout time.Duration) *Lease { +func newTestLeaseWithKeepAliveCh(purpose string, keepAliveCh chan *clientv3.LeaseKeepAliveResponse, leaseTimeout time.Duration) *Lease { lease := &Lease{ - Purpose: "test_lease", + Purpose: purpose, lease: &fakeLease{keepAliveCh: keepAliveCh}, leaseTimeout: leaseTimeout, } lease.ID.Store(clientv3.LeaseID(1)) lease.expireTime.Store(time.Now().Add(-time.Second)) return lease }Then at each call site:
- lease := newTestLeaseWithKeepAliveCh(keepAliveCh, time.Hour) - lease.Purpose = purpose + lease := newTestLeaseWithKeepAliveCh(purpose, keepAliveCh, time.Hour)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/election/lease_test.go` around lines 361 - 370, Change newTestLeaseWithKeepAliveCh to accept a purpose string and assign it to lease.Purpose so callers don't need to overwrite it; specifically, modify the helper signature newTestLeaseWithKeepAliveCh(keepAliveCh chan *clientv3.LeaseKeepAliveResponse, leaseTimeout time.Duration) to include a purpose parameter and set lease.Purpose = purpose inside the function, then update all call sites that currently set lease.Purpose after creation to pass the desired purpose into the newTestLeaseWithKeepAliveCh call (e.g., replace post-construction assignments like lease.Purpose = purpose with passing purpose into the helper).
200-208: Sleep-based monotonicity probe is acceptable but worth noting.
time.Sleep(100 * time.Millisecond)at line 205 is the only nondeterministic synchronization in the new tests. It's pragmatic — the test fails only if the monotonicity guard regresses, and a too-short sleep trivially passes — so a missing guard wouldn't cause spurious failures, only spurious passes. Fine to keep, but if you'd like a more deterministic approach, a follow-up could send a third response with a larger TTL and useEventuallyto wait for it to take effect, which proves the loop has drained the smaller TTL.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/election/lease_test.go` around lines 200 - 208, The test currently uses time.Sleep(100 * time.Millisecond) to wait for the keep-alive processing loop; replace this nondeterministic sleep with a deterministic probe: after sendKeepAliveResponse(t, keepAliveCh, 1) send a third keep-alive with a larger TTL (e.g., via sendKeepAliveResponse with TTL > previous), then use a retry/assertion helper (like re.Eventually or require.Eventually) to wait until lease.getExpireTime() reflects the later larger TTL, which guarantees the loop drained the smaller TTL rather than relying on an arbitrary sleep.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/election/lease.go`:
- Around line 226-241: The watchdog case can increment renewalFailureTotal
spuriously during shutdown races; modify the watchdog.C branch in
pkg/election/lease.go so after computing remaining := time.Until(actualExpire)
and before logging/incrementing you check the context (ctx.Err() == nil) —
mirror the guard used in the channel-closed branch — and only call logger.Info,
renewalFailureTotal.WithLabelValues(...).Inc(), and return when ctx is still
active; otherwise treat it as a clean shutdown (reset or continue as
appropriate). Ensure you reference watchdog.C, l.getExpireTime(),
localTTLRemaining, renewalFailureTotal, ReasonWatchdogTimeout and l.Purpose in
the change.
---
Nitpick comments:
In `@pkg/election/lease_test.go`:
- Around line 361-370: Change newTestLeaseWithKeepAliveCh to accept a purpose
string and assign it to lease.Purpose so callers don't need to overwrite it;
specifically, modify the helper signature
newTestLeaseWithKeepAliveCh(keepAliveCh chan *clientv3.LeaseKeepAliveResponse,
leaseTimeout time.Duration) to include a purpose parameter and set lease.Purpose
= purpose inside the function, then update all call sites that currently set
lease.Purpose after creation to pass the desired purpose into the
newTestLeaseWithKeepAliveCh call (e.g., replace post-construction assignments
like lease.Purpose = purpose with passing purpose into the helper).
- Around line 200-208: The test currently uses time.Sleep(100 *
time.Millisecond) to wait for the keep-alive processing loop; replace this
nondeterministic sleep with a deterministic probe: after
sendKeepAliveResponse(t, keepAliveCh, 1) send a third keep-alive with a larger
TTL (e.g., via sendKeepAliveResponse with TTL > previous), then use a
retry/assertion helper (like re.Eventually or require.Eventually) to wait until
lease.getExpireTime() reflects the later larger TTL, which guarantees the loop
drained the smaller TTL rather than relying on an arbitrary sleep.
In `@pkg/election/metrics.go`:
- Around line 19-72: The init() function currently calls prometheus.MustRegister
three times; replace those three calls with a single variadic call to
prometheus.MustRegister(keepAliveResponseInterval, renewalFailureTotal,
localTTLRemaining) to make registration concise and idiomatic while preserving
the same collectors (refer to the init function and the collector variables
keepAliveResponseInterval, renewalFailureTotal, and localTTLRemaining).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b252437b-213b-4a52-9752-a592ac7ce8bb
📒 Files selected for processing (3)
pkg/election/lease.gopkg/election/lease_test.gopkg/election/metrics.go
02b6b75 to
b0c7b7f
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (2)
pkg/election/lease_test.go (1)
392-401: Histogram introspection LGTM, but worth a brief comment.The
observer.(prometheus.Metric)assertion works becauseclient_golanghistogram instances implement bothObserverandMetric. This is intentional but not obvious; a one-line comment explaining the assertion would help future readers and guard against regressions if someone replaces the histogram with a wrapped observer.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/election/lease_test.go` around lines 392 - 401, In keepAliveResponseIntervalCount, add a one-line comment above the observer.(prometheus.Metric) type assertion explaining that the current prometheus histogram implementation also implements prometheus.Metric, so casting the Observer to prometheus.Metric is safe here; also note that this relies on the concrete histogram implementation and would break if the metric were replaced by a wrapped Observer, to warn future maintainers and prevent accidental regressions.pkg/election/lease.go (1)
86-96: Consider initializing metrics inNewLease, notGrant.
l.metricsis currently initialized only at the end ofGrant(). If a programmer ever invokesKeepAlive()before a successfulGrant()(or afterGrant()returns an error), the very first metric write — e.g.l.metrics.streamFailed.Inc()at line 178 — will deref a nilprometheus.Counterand panic.Since
newLeaseMetrics()only depends onl.Purpose(set inNewLease), initializing there makes the type zero-value-friendlier and matches the guideline "Keep structs zero-value friendly":🛡️ Suggested move
func NewLease(client *clientv3.Client, purpose string) *Lease { - return &Lease{ + l := &Lease{ Purpose: purpose, client: client, lease: clientv3.NewLease(client), } + l.initMetrics() + return l } @@ l.expireTime.Store(start.Add(time.Duration(leaseResp.TTL) * time.Second)) - l.initMetrics() return nilAs per coding guidelines: "Keep structs zero-value friendly; init maps/slices before use".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/election/lease.go` around lines 86 - 96, The metrics field l.metrics should be initialized in NewLease (where l.Purpose is already set) instead of only in Grant to avoid nil derefs when methods like KeepAlive access l.metrics before a successful Grant; call newLeaseMetrics(l.Purpose) during NewLease construction to set l.metrics and remove or keep a no-op initMetrics call from Grant (or guard uses) so Grant no longer is the sole initializer; update references to l.metrics (e.g., streamFailed) to assume it is non-nil after NewLease.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/election/lease_test.go`:
- Around line 275-292: Test uses a brittle 20ms watchdog timeout in
TestRenewalFailureWatchdogTimeout which can flake under CI; update the watchdog
duration passed to newTestLeaseWithKeepAliveCh from 20*time.Millisecond to a
more robust value (e.g., 100–200*time.Millisecond) so the test has enough time
for goroutine startup/scheduling without materially slowing it, leaving the rest
of the test (done channel, KeepAlive call, and assertions on renewalFailureValue
and localTTLRemainingValue) unchanged.
- Around line 187-208: The test uses time.Sleep(100*time.Millisecond) as a
race-prone sync; replace it with a deterministic synchronization so the
KeepAlive consumer has processed the small-TTL response before asserting
monotonicity: after sendKeepAliveResponse(t, keepAliveCh, 1) either (a) send a
marker over an unbuffered channel that the KeepAlive loop reads when it finishes
processing each response (add a test-only opt to lease.KeepAlive or a hook
callback) and wait for that marker, or (b) poll with testutil.Eventually until
lease.getExpireTime() stops changing or until a processed-response counter
increments (expose/read a test metric from the KeepAlive loop) before reading
afterSmall; update TestLeaseKeepAliveKeepsExpireMonotonic to use the chosen
deterministic wait instead of time.Sleep.
---
Nitpick comments:
In `@pkg/election/lease_test.go`:
- Around line 392-401: In keepAliveResponseIntervalCount, add a one-line comment
above the observer.(prometheus.Metric) type assertion explaining that the
current prometheus histogram implementation also implements prometheus.Metric,
so casting the Observer to prometheus.Metric is safe here; also note that this
relies on the concrete histogram implementation and would break if the metric
were replaced by a wrapped Observer, to warn future maintainers and prevent
accidental regressions.
In `@pkg/election/lease.go`:
- Around line 86-96: The metrics field l.metrics should be initialized in
NewLease (where l.Purpose is already set) instead of only in Grant to avoid nil
derefs when methods like KeepAlive access l.metrics before a successful Grant;
call newLeaseMetrics(l.Purpose) during NewLease construction to set l.metrics
and remove or keep a no-op initMetrics call from Grant (or guard uses) so Grant
no longer is the sole initializer; update references to l.metrics (e.g.,
streamFailed) to assume it is non-nil after NewLease.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b76876a9-1227-4569-9e60-6dc4d576f1b4
📒 Files selected for processing (3)
pkg/election/lease.gopkg/election/lease_test.gopkg/election/metrics.go
b0c7b7f to
85a5524
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
pkg/election/lease_test.go (1)
153-170:⚠️ Potential issue | 🟡 MinorTight 20 ms watchdog still risks flake on loaded CI.
Same concern as in the previous review: a 20 ms watchdog leaves very little headroom for goroutine startup and scheduler latency. Bumping to ~100–200 ms keeps the test deterministic without materially slowing it.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/election/lease_test.go` around lines 153 - 170, The 20ms watchdog timeout in TestRenewalFailureWatchdogTimeout is too tight and may flake on CI; update the timeout passed to newTestLeaseWithResponses (currently 20*time.Millisecond) to a larger value (e.g., 150-200ms) to provide scheduler/goroutine headroom while keeping the test fast, leaving the rest of the test (calls to lease.KeepAlive, renewalFailureValue(..., reasonWatchdogTimeout), and localTTLRemainingValue checks) unchanged.pkg/election/lease.go (1)
155-160:⚠️ Potential issue | 🟡 MinorWatchdog branch can record a phantom failure during shutdown races.
When
ctxis cancelled at almost the same momenttimer.Cfires, Go'sselectmay still pick the timer case even though the caller has already abandoned the loop, incrementingrenewalFailureTotal{reason="watchdog_timeout"}and emitting a "keep alive lease too slow" log on a clean shutdown. Mirroring the early-return onctx.Err() != nilkeeps the metric/log limited to true watchdog expirations.
TestNoFailureOnContextCanceldoesn't exercise this race becauseleaseTimeout=time.Hour, so the timer cannot fire within the test window.🛡️ Suggested guard
case <-timer.C: + if ctx.Err() != nil { + return + } actualExpire := l.expireTime.Load().(time.Time) l.metrics.ttlRemaining.Set(time.Until(actualExpire).Seconds()) l.metrics.watchdogTimeout.Inc() log.Info("keep alive lease too slow", zap.Duration("timeout-duration", l.leaseTimeout), zap.Time("actual-expire", actualExpire), zap.String("purpose", l.Purpose)) return🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/election/lease.go` around lines 155 - 160, The watchdog timer branch can misreport failures during shutdown races; in the select case handling timer.C (where you read actualExpire via l.expireTime.Load(), set l.metrics.ttlRemaining, increment l.metrics.watchdogTimeout and call log.Info about "keep alive lease too slow"), first check if ctx.Err() != nil and return early if so, mirroring the existing early-return guard used elsewhere; this prevents emitting the watchdog_timeout metric and log when the caller has already canceled the context.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/election/lease_test.go`:
- Around line 259-298: The fakeLease.KeepAliveOnce method mutates shared fields
ttls and err without synchronization, causing a data race when keepAliveWorker
spawns concurrent goroutines; fix by adding a mutex field (e.g., mu sync.Mutex)
to fakeLease and guard all accesses and mutations of l.ttls and l.err in
KeepAliveOnce (lock at start, update/slice l.ttls and read l.err while locked,
then unlock before returning) so concurrent calls are safe.
---
Duplicate comments:
In `@pkg/election/lease_test.go`:
- Around line 153-170: The 20ms watchdog timeout in
TestRenewalFailureWatchdogTimeout is too tight and may flake on CI; update the
timeout passed to newTestLeaseWithResponses (currently 20*time.Millisecond) to a
larger value (e.g., 150-200ms) to provide scheduler/goroutine headroom while
keeping the test fast, leaving the rest of the test (calls to lease.KeepAlive,
renewalFailureValue(..., reasonWatchdogTimeout), and localTTLRemainingValue
checks) unchanged.
In `@pkg/election/lease.go`:
- Around line 155-160: The watchdog timer branch can misreport failures during
shutdown races; in the select case handling timer.C (where you read actualExpire
via l.expireTime.Load(), set l.metrics.ttlRemaining, increment
l.metrics.watchdogTimeout and call log.Info about "keep alive lease too slow"),
first check if ctx.Err() != nil and return early if so, mirroring the existing
early-return guard used elsewhere; this prevents emitting the watchdog_timeout
metric and log when the caller has already canceled the context.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bb07ab67-6d22-4ea3-ad47-372d15157a58
📒 Files selected for processing (3)
pkg/election/lease.gopkg/election/lease_test.gopkg/election/metrics.go
Signed-off-by: JmPotato <github@ipotato.me>
85a5524 to
9cf0ea6
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (3)
pkg/election/lease.go (1)
155-160:⚠️ Potential issue | 🟡 MinorSpurious
leaseExpiredincrement on shutdown race.When
ctx.Done()andtimer.Cbecome ready simultaneously,selectmay pick thetimer.Carm and incrementl.metrics.leaseExpiredeven though the caller is shutting down cleanly. Consider gating the increment/log onctx.Err() == nil:🛡️ Suggested guard
case <-timer.C: actualExpire := l.expireTime.Load().(time.Time) l.metrics.ttlRemaining.Set(time.Until(actualExpire).Seconds()) + if ctx.Err() != nil { + return + } l.metrics.leaseExpired.Inc() log.Info("keep alive lease too slow", zap.Duration("timeout-duration", l.leaseTimeout), zap.Time("actual-expire", actualExpire), zap.String("purpose", l.Purpose)) return🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/election/lease.go` around lines 155 - 160, The timer.C case in the lease keep-alive loop increments l.metrics.leaseExpired and logs even when shutdown raced with the timer; guard that work by checking the request context state before treating it as an actual lease expiration: in the timer.C branch of the loop (the block referencing l.expireTime.Load(), l.metrics.ttlRemaining, l.metrics.leaseExpired, and log.Info with l.leaseTimeout and l.Purpose) first verify ctx.Err() == nil (or equivalent not-shutting-down check) and only then update ttlRemaining, increment l.metrics.leaseExpired, and emit the "keep alive lease too slow" log; if ctx.Err() != nil, return without counting it as an expiration.pkg/election/lease_test.go (2)
153-170:⚠️ Potential issue | 🟡 MinorTight 20 ms
leaseTimeoutmay flake under load.The test depends on the timer firing and
KeepAlivereturning within a short window. On a busy CI runner, goroutine startup + scheduler latency + occasional GC pause can push the actual delivery much later or even keeptime.Until(actualExpire)momentarily positive (the helper seedsexpireTimetonow-1s). Bumping to ~100–200 ms keeps the test deterministic without slowing the suite materially.♻️ Suggested change
- lease := newTestLeaseWithResponses(purpose, 20*time.Millisecond) + lease := newTestLeaseWithResponses(purpose, 200*time.Millisecond)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/election/lease_test.go` around lines 153 - 170, TestRenewalFailureLeaseExpired uses a tight 20ms lease timeout which can flake under CI; update the call to newTestLeaseWithResponses in TestRenewalFailureLeaseExpired (and any similar tests) to use a larger timeout (≈100–200ms) instead of 20*time.Millisecond so the KeepAlive goroutine and timers have headroom (reference: TestRenewalFailureLeaseExpired and newTestLeaseWithResponses).
259-294:⚠️ Potential issue | 🟠 MajorData race in
fakeLease.KeepAliveOnceunder-race.
keepAliveWorker(lease.go line 184) spawns a fresh goroutine everyleaseTimeout/3. InTestRenewalFailureLeaseExpiredthat's ≈6.6 ms, and the spawned goroutines readl.err/ read+slicel.ttlsconcurrently with no synchronization.go test -racewill flag this and responses become non-deterministic.🔒 Suggested fix
type fakeLease struct { + mu sync.Mutex ttls []int64 err error } @@ func (l *fakeLease) KeepAliveOnce(context.Context, clientv3.LeaseID) (*clientv3.LeaseKeepAliveResponse, error) { + l.mu.Lock() + defer l.mu.Unlock() if l.err != nil { return nil, l.err } if len(l.ttls) == 0 { return nil, errors.New("no fake keepalive response") } ttl := l.ttls[0] l.ttls = l.ttls[1:] return &clientv3.LeaseKeepAliveResponse{ID: clientv3.LeaseID(1), TTL: ttl}, nil }(Add
"sync"to the imports.)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/election/lease_test.go` around lines 259 - 294, fakeLease.KeepAliveOnce is racy because keepAliveWorker spawns goroutines that concurrently read/modify l.err and l.ttls; to fix, add synchronization to fakeLease by adding a sync.Mutex field (or sync.RWMutex) and lock/unlock around accesses to l.err and the l.ttls slice inside KeepAliveOnce; also add "sync" to the imports so tests under -race become deterministic (references: fakeLease.KeepAliveOnce, keepAliveWorker, TestRenewalFailureLeaseExpired).
🧹 Nitpick comments (1)
pkg/election/lease.go (1)
90-108:local_ttl_remaining_secondsgauge is not reset onClose().After a lease is revoked, the gauge keeps its last observed value indefinitely (until process restart), which can be misleading in dashboards/alerts that key off "remaining TTL ≤ 0". Consider deleting the per-purpose child (or setting it to 0) in
Close():♻️ Suggested cleanup
func (l *Lease) Close() error { if l == nil { return nil } // Reset expire time. l.expireTime.Store(typeutil.ZeroTime) + localTTLRemaining.DeleteLabelValues(l.Purpose) // Try to revoke lease to make subsequent elections faster.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/election/lease.go` around lines 90 - 108, The Close() method should clear the per-purpose local_ttl_remaining_seconds metric so it doesn't linger after lease revocation; update Lease.Close() (around where l.Purpose is available) to either set the per-purpose child gauge to 0 or unregister/delete that child for l.Purpose (before/after the revoke and expireTime reset), ensuring any metric registry or gauge variable used for local_ttl_remaining_seconds is referenced and cleaned up for the specific purpose.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pkg/election/lease_test.go`:
- Around line 153-170: TestRenewalFailureLeaseExpired uses a tight 20ms lease
timeout which can flake under CI; update the call to newTestLeaseWithResponses
in TestRenewalFailureLeaseExpired (and any similar tests) to use a larger
timeout (≈100–200ms) instead of 20*time.Millisecond so the KeepAlive goroutine
and timers have headroom (reference: TestRenewalFailureLeaseExpired and
newTestLeaseWithResponses).
- Around line 259-294: fakeLease.KeepAliveOnce is racy because keepAliveWorker
spawns goroutines that concurrently read/modify l.err and l.ttls; to fix, add
synchronization to fakeLease by adding a sync.Mutex field (or sync.RWMutex) and
lock/unlock around accesses to l.err and the l.ttls slice inside KeepAliveOnce;
also add "sync" to the imports so tests under -race become deterministic
(references: fakeLease.KeepAliveOnce, keepAliveWorker,
TestRenewalFailureLeaseExpired).
In `@pkg/election/lease.go`:
- Around line 155-160: The timer.C case in the lease keep-alive loop increments
l.metrics.leaseExpired and logs even when shutdown raced with the timer; guard
that work by checking the request context state before treating it as an actual
lease expiration: in the timer.C branch of the loop (the block referencing
l.expireTime.Load(), l.metrics.ttlRemaining, l.metrics.leaseExpired, and
log.Info with l.leaseTimeout and l.Purpose) first verify ctx.Err() == nil (or
equivalent not-shutting-down check) and only then update ttlRemaining, increment
l.metrics.leaseExpired, and emit the "keep alive lease too slow" log; if
ctx.Err() != nil, return without counting it as an expiration.
---
Nitpick comments:
In `@pkg/election/lease.go`:
- Around line 90-108: The Close() method should clear the per-purpose
local_ttl_remaining_seconds metric so it doesn't linger after lease revocation;
update Lease.Close() (around where l.Purpose is available) to either set the
per-purpose child gauge to 0 or unregister/delete that child for l.Purpose
(before/after the revoke and expireTime reset), ensuring any metric registry or
gauge variable used for local_ttl_remaining_seconds is referenced and cleaned up
for the specific purpose.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: acfbca14-809d-4e64-9806-90cae220fa7c
📒 Files selected for processing (3)
pkg/election/lease.gopkg/election/lease_test.gopkg/election/metrics.go
Signed-off-by: JmPotato <github@ipotato.me>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
pkg/election/lease.go (1)
165-170:⚠️ Potential issue | 🟡 MinorWatchdog branch still lacks a
ctx.Err()guard.When
ctxis cancelled at nearly the same instanttimer.Cfires, Go'sselectmay pick the timer case even though the caller has already shut down the loop. As written, this unconditionally incrementsmetrics.leaseExpiredand logs"keep alive lease too slow", producing a phantom failure event on a clean shutdown of an already-expired lease. Mirroring the early-return behaviour of thectx.Done()branch keeps the failure counter focused on real renewal failures.🛡️ Suggested guard
case <-timer.C: actualExpire := l.loadExpireTime() l.metrics.ttlRemaining.Set(time.Until(actualExpire).Seconds()) + if ctx.Err() != nil { + return + } l.metrics.leaseExpired.Inc() log.Info("keep alive lease too slow", zap.Duration("timeout-duration", l.leaseTimeout), zap.Time("actual-expire", actualExpire), zap.String("purpose", l.Purpose)) return🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/election/lease.go` around lines 165 - 170, The timer.C case in the select (where you compute actualExpire via l.loadExpireTime(), call l.metrics.leaseExpired.Inc(), and log.Info("keep alive lease too slow", ...)) should first check ctx.Err() and return without incrementing or logging when the context is canceled; change the watchdog branch to read ctx.Err() (or check ctx.Done()) and only perform l.metrics.leaseExpired.Inc() and the log.Info call if ctx.Err() == nil so shutdown races don't produce phantom lease-failure metrics/events.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@metrics/grafana/pd.json`:
- Around line 16400-16410: The Grafana query uses unsupported label filters on
the metric service_member_role (defined in pkg/member/metrics.go with only the
service label), so remove the nonexistent k8s_cluster and tidb_cluster selectors
from the expr; replace service_member_role{k8s_cluster="$k8s_cluster",
tidb_cluster="$tidb_cluster"} with a valid selector such as service_member_role
or service_member_role{service="$service"} (or any actual label exposed by
service_member_role) so the table will return rows when a leader exists.
---
Duplicate comments:
In `@pkg/election/lease.go`:
- Around line 165-170: The timer.C case in the select (where you compute
actualExpire via l.loadExpireTime(), call l.metrics.leaseExpired.Inc(), and
log.Info("keep alive lease too slow", ...)) should first check ctx.Err() and
return without incrementing or logging when the context is canceled; change the
watchdog branch to read ctx.Err() (or check ctx.Done()) and only perform
l.metrics.leaseExpired.Inc() and the log.Info call if ctx.Err() == nil so
shutdown races don't produce phantom lease-failure metrics/events.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 00e39094-4c7d-48fb-85b5-cc54013e24b1
📒 Files selected for processing (3)
metrics/grafana/pd.jsonpkg/election/lease.gopkg/election/lease_test.go
|
/retest |
|
@JmPotato: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What problem does this PR solve?
Issue Number: ref #9389
This PR adds lease keepalive observability to the current
masterimplementation. It is independent from #10618 and can be merged before the streaming keepalive PR.What is changed and how does it work?
Add Prometheus metrics for the current lease keepalive loop:
pd_lease_keepalive_response_interval_seconds{purpose}for validKeepAliveOnceresponse intervals.pd_lease_renewal_failure_total{purpose,reason}for current implementation failures:invalid_ttlandlease_expired.pd_lease_local_ttl_remaining_seconds{purpose}as an event-sampled local TTL estimate, not an etcd authoritative TTL.Metric children are cached on
Leaseafter a successfulGrant(), so the keepalive path does not call.WithLabelValues. Streaming-only reasons such as stream setup or channel-close failures are intentionally not introduced in this PR; they can be added when #10618 lands.Additional review follow-ups:
loadExpireTime()for nil leases, empty values, and unexpected value types, returningtypeutil.ZeroTimeso missing/invalid expire time is treated as expired instead of panicking.Leaderrow at the end of the PD Grafana dashboard.Leader/Primary,Raft term, lease local TTL remaining, lease keepalive response interval p99, and lease renewal failures in theLeaderrow.jobfirst, thenpurpose/reason/instance, to avoid hiding service source in multi-service deployments.Check List
Tests
Unit test
Manual test (dashboard JSON validation)
make gotest GOTEST_ARGS='./pkg/election -count=1'make static PACKAGE_DIRECTORIES='./pkg/election' SUBMODULES=jq empty metrics/grafana/pd.jsonjq -e '.panels[-1].title == "Leader" and .panels[-1].id == 1700 and ([.panels[-1].panels[].title] == ["Leader/Primary", "Raft term", "Lease local TTL remaining", "Lease keepalive response interval", "Lease renewal failures"]) and (.panels[-1].panels[] | select(.title == "Lease keepalive response interval") | .targets | length == 1)' metrics/grafana/pd.jsongit diff --checkRelease note
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Chores