Skip to content

pkg/election: add lease keepalive metrics#10622

Open
JmPotato wants to merge 2 commits intotikv:masterfrom
JmPotato:bob/lease-keepalive-metrics
Open

pkg/election: add lease keepalive metrics#10622
JmPotato wants to merge 2 commits intotikv:masterfrom
JmPotato:bob/lease-keepalive-metrics

Conversation

@JmPotato
Copy link
Copy Markdown
Member

@JmPotato JmPotato commented Apr 25, 2026

What problem does this PR solve?

Issue Number: ref #9389

This PR adds lease keepalive observability to the current master implementation. It is independent from #10618 and can be merged before the streaming keepalive PR.

What is changed and how does it work?

pkg/election: add lease keepalive metrics

Add Prometheus metrics for the current lease keepalive loop:

  • pd_lease_keepalive_response_interval_seconds{purpose} for valid KeepAliveOnce response intervals.
  • pd_lease_renewal_failure_total{purpose,reason} for current implementation failures: invalid_ttl and lease_expired.
  • pd_lease_local_ttl_remaining_seconds{purpose} as an event-sampled local TTL estimate, not an etcd authoritative TTL.

Metric children are cached on Lease after a successful Grant(), so the keepalive path does not call .WithLabelValues. Streaming-only reasons such as stream setup or channel-close failures are intentionally not introduced in this PR; they can be added when #10618 lands.

Additional review follow-ups:

  • Harden loadExpireTime() for nil leases, empty values, and unexpected value types, returning typeutil.ZeroTime so missing/invalid expire time is treated as expired instead of panicking.
  • Add a collapsed Leader row at the end of the PD Grafana dashboard.
  • Show Leader/Primary, Raft term, lease local TTL remaining, lease keepalive response interval p99, and lease renewal failures in the Leader row.
  • Keep lease dashboard legends scoped by job first, then purpose/reason/instance, to avoid hiding service source in multi-service deployments.

Check List

Tests

  • Unit test

  • Manual test (dashboard JSON validation)

  • make gotest GOTEST_ARGS='./pkg/election -count=1'

  • make static PACKAGE_DIRECTORIES='./pkg/election' SUBMODULES=

  • jq empty metrics/grafana/pd.json

  • jq -e '.panels[-1].title == "Leader" and .panels[-1].id == 1700 and ([.panels[-1].panels[].title] == ["Leader/Primary", "Raft term", "Lease local TTL remaining", "Lease keepalive response interval", "Lease renewal failures"]) and (.panels[-1].panels[] | select(.title == "Lease keepalive response interval") | .targets | length == 1)' metrics/grafana/pd.json

  • git diff --check

image

Release note

None.

Summary by CodeRabbit

  • New Features

    • Added Prometheus metrics and updated Grafana dashboard panels to surface lease TTL, keepalive response intervals, and renewal failures.
  • Bug Fixes

    • More robust lease expiration checks and handling of invalid/non-positive TTLs to prevent incorrect expirations.
  • Tests

    • Added a unit test covering expire-time loading behavior across edge cases.
  • Chores

    • Improved local TTL tracking and explicit renewal-failure reporting for better observability.

@ti-chi-bot ti-chi-bot Bot added release-note-none Denotes a PR that doesn't merit a release note. dco-signoff: yes Indicates the PR's author has signed the dco. labels Apr 25, 2026
@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 25, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign okjiang for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 25, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Lease now owns per-purpose Prometheus metrics initialized in Grant. KeepAlive measures response intervals, updates a local TTL gauge from observed/computed expire times, treats non‑positive TTLs as invalid (increments invalid counter and stops), and increments a lease‑expired counter on watchdog timeout.

Changes

Cohort / File(s) Summary
Lease core
pkg/election/lease.go
Added metrics field and initialization in Grant(); KeepAlive() records keepalive response intervals, computes monotonic expire times via loadExpireTime(), updates local_ttl_remaining_seconds from channel updates and watchdog sampling, treats res.TTL <= 0 as invalid (increments invalidTTL and stops without publishing expire), and increments leaseExpired on watchdog timeout.
Metrics
pkg/election/metrics.go
New Prometheus collectors under namespace pd/subsystem lease: keepalive_response_interval_seconds (histogram by purpose), renewal_failure_total (counter by purpose,reason with reasons like invalid_ttl and watchdog_timeout), and local_ttl_remaining_seconds (gauge by purpose). Adds leaseMetrics type and newLeaseMetrics(purpose) factory; registers collectors.
Tests
pkg/election/lease_test.go
Added TestLoadExpireTime covering Lease.loadExpireTime for nil/empty/invalid/valid expireTime cases; imports typeutil for ZeroTime comparisons.
Dashboards
metrics/grafana/pd.json
Appended a collapsed “Leader” row with five panels: leader table, Raft term, PD lease local TTL remaining metric, 99th-percentile keepalive response interval (5m histogram_quantile over rate), and lease renewal failures (per-minute rate).

Sequence Diagram(s)

sequenceDiagram
  participant Client as Caller
  participant Lease as Lease
  participant Lessor as Etcd Lessor
  participant Stream as KeepAlive Stream
  participant Watchdog as Watchdog Timer
  participant Metrics as Prometheus Metrics

  rect rgba(200,220,255,0.5)
  Client->>Lease: Grant() -> initialize metrics
  Lease->>Lessor: KeepAlive(ctx, leaseID) -> open stream
  end

  rect rgba(220,255,200,0.5)
  Stream->>Lease: keepalive response (TTL, leaseID)
  Lease->>Lease: loadExpireTime() / compute monotonic expireTime
  Lease->>Metrics: observe keepalive_response_interval_seconds
  Lease->>Metrics: set local_ttl_remaining_seconds (max of observed/computed)
  end

  rect rgba(255,230,200,0.5)
  Watchdog->>Lease: tick -> sample expireTime
  alt TTL expired (watchdog)
    Lease->>Metrics: increment renewal_failure_total (watchdog_timeout)
    Lease->>Lease: stop keepalive processing
  else invalid TTL received (TTL <= 0)
    Stream->>Lease: keepalive response (TTL<=0)
    Lease->>Metrics: increment renewal_failure_total (invalid_ttl)
    Lease->>Lease: stop keepalive processing (no expire published)
  end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I time the ticks between each keepalive call,
Gauges hum softly and histograms sprawl,
When zeros arrive I count the doleful bell,
Watchdog naps wake me — metrics rise and swell,
A rabbit’s hop keeps leases safe for all.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding lease keepalive metrics to the election package.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description comprehensively addresses all required template sections with clear problem statement, detailed implementation explanation, thorough test coverage, and appropriate release notes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ti-chi-bot ti-chi-bot Bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 25, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
pkg/election/metrics.go (1)

19-72: LGTM — clean metrics declarations.

The collectors follow Prometheus conventions (namespace + subsystem + unit suffix), labels are low-cardinality (purpose, reason are bounded enums), and the help texts are descriptive — particularly the localTTLRemaining help text which explicitly disclaims it isn't the etcd-authoritative TTL. The reason constants are exported with proper GoDoc and reused at the call sites in lease.go.

One small ergonomic suggestion (optional): the three prometheus.MustRegister calls in init() can be collapsed into a single varargs call.

♻️ Optional: combine MustRegister calls
 func init() {
-	prometheus.MustRegister(keepAliveResponseInterval)
-	prometheus.MustRegister(renewalFailureTotal)
-	prometheus.MustRegister(localTTLRemaining)
+	prometheus.MustRegister(
+		keepAliveResponseInterval,
+		renewalFailureTotal,
+		localTTLRemaining,
+	)
 }

Based on learnings: "Use Prometheus-style metrics with subsystem + unit; avoid high-cardinality labels".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/election/metrics.go` around lines 19 - 72, The init() function currently
calls prometheus.MustRegister three times; replace those three calls with a
single variadic call to prometheus.MustRegister(keepAliveResponseInterval,
renewalFailureTotal, localTTLRemaining) to make registration concise and
idiomatic while preserving the same collectors (refer to the init function and
the collector variables keepAliveResponseInterval, renewalFailureTotal, and
localTTLRemaining).
pkg/election/lease_test.go (2)

361-370: Minor: thread purpose through the helper.

Most callers of newTestLeaseWithKeepAliveCh immediately overwrite lease.Purpose = purpose afterwards (lines 215, 241, 263, 283, 302, 316, 342). Threading the purpose into the helper would remove the seven-line repetition and keep the test setup atomic.

♻️ Suggested helper change
-func newTestLeaseWithKeepAliveCh(keepAliveCh chan *clientv3.LeaseKeepAliveResponse, leaseTimeout time.Duration) *Lease {
+func newTestLeaseWithKeepAliveCh(purpose string, keepAliveCh chan *clientv3.LeaseKeepAliveResponse, leaseTimeout time.Duration) *Lease {
 	lease := &Lease{
-		Purpose:      "test_lease",
+		Purpose:      purpose,
 		lease:        &fakeLease{keepAliveCh: keepAliveCh},
 		leaseTimeout: leaseTimeout,
 	}
 	lease.ID.Store(clientv3.LeaseID(1))
 	lease.expireTime.Store(time.Now().Add(-time.Second))
 	return lease
 }

Then at each call site:

-	lease := newTestLeaseWithKeepAliveCh(keepAliveCh, time.Hour)
-	lease.Purpose = purpose
+	lease := newTestLeaseWithKeepAliveCh(purpose, keepAliveCh, time.Hour)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/election/lease_test.go` around lines 361 - 370, Change
newTestLeaseWithKeepAliveCh to accept a purpose string and assign it to
lease.Purpose so callers don't need to overwrite it; specifically, modify the
helper signature newTestLeaseWithKeepAliveCh(keepAliveCh chan
*clientv3.LeaseKeepAliveResponse, leaseTimeout time.Duration) to include a
purpose parameter and set lease.Purpose = purpose inside the function, then
update all call sites that currently set lease.Purpose after creation to pass
the desired purpose into the newTestLeaseWithKeepAliveCh call (e.g., replace
post-construction assignments like lease.Purpose = purpose with passing purpose
into the helper).

200-208: Sleep-based monotonicity probe is acceptable but worth noting.

time.Sleep(100 * time.Millisecond) at line 205 is the only nondeterministic synchronization in the new tests. It's pragmatic — the test fails only if the monotonicity guard regresses, and a too-short sleep trivially passes — so a missing guard wouldn't cause spurious failures, only spurious passes. Fine to keep, but if you'd like a more deterministic approach, a follow-up could send a third response with a larger TTL and use Eventually to wait for it to take effect, which proves the loop has drained the smaller TTL.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/election/lease_test.go` around lines 200 - 208, The test currently uses
time.Sleep(100 * time.Millisecond) to wait for the keep-alive processing loop;
replace this nondeterministic sleep with a deterministic probe: after
sendKeepAliveResponse(t, keepAliveCh, 1) send a third keep-alive with a larger
TTL (e.g., via sendKeepAliveResponse with TTL > previous), then use a
retry/assertion helper (like re.Eventually or require.Eventually) to wait until
lease.getExpireTime() reflects the later larger TTL, which guarantees the loop
drained the smaller TTL rather than relying on an arbitrary sleep.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/election/lease.go`:
- Around line 226-241: The watchdog case can increment renewalFailureTotal
spuriously during shutdown races; modify the watchdog.C branch in
pkg/election/lease.go so after computing remaining := time.Until(actualExpire)
and before logging/incrementing you check the context (ctx.Err() == nil) —
mirror the guard used in the channel-closed branch — and only call logger.Info,
renewalFailureTotal.WithLabelValues(...).Inc(), and return when ctx is still
active; otherwise treat it as a clean shutdown (reset or continue as
appropriate). Ensure you reference watchdog.C, l.getExpireTime(),
localTTLRemaining, renewalFailureTotal, ReasonWatchdogTimeout and l.Purpose in
the change.

---

Nitpick comments:
In `@pkg/election/lease_test.go`:
- Around line 361-370: Change newTestLeaseWithKeepAliveCh to accept a purpose
string and assign it to lease.Purpose so callers don't need to overwrite it;
specifically, modify the helper signature
newTestLeaseWithKeepAliveCh(keepAliveCh chan *clientv3.LeaseKeepAliveResponse,
leaseTimeout time.Duration) to include a purpose parameter and set lease.Purpose
= purpose inside the function, then update all call sites that currently set
lease.Purpose after creation to pass the desired purpose into the
newTestLeaseWithKeepAliveCh call (e.g., replace post-construction assignments
like lease.Purpose = purpose with passing purpose into the helper).
- Around line 200-208: The test currently uses time.Sleep(100 *
time.Millisecond) to wait for the keep-alive processing loop; replace this
nondeterministic sleep with a deterministic probe: after
sendKeepAliveResponse(t, keepAliveCh, 1) send a third keep-alive with a larger
TTL (e.g., via sendKeepAliveResponse with TTL > previous), then use a
retry/assertion helper (like re.Eventually or require.Eventually) to wait until
lease.getExpireTime() reflects the later larger TTL, which guarantees the loop
drained the smaller TTL rather than relying on an arbitrary sleep.

In `@pkg/election/metrics.go`:
- Around line 19-72: The init() function currently calls prometheus.MustRegister
three times; replace those three calls with a single variadic call to
prometheus.MustRegister(keepAliveResponseInterval, renewalFailureTotal,
localTTLRemaining) to make registration concise and idiomatic while preserving
the same collectors (refer to the init function and the collector variables
keepAliveResponseInterval, renewalFailureTotal, and localTTLRemaining).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b252437b-213b-4a52-9752-a592ac7ce8bb

📥 Commits

Reviewing files that changed from the base of the PR and between ee0fa30 and 02b6b75.

📒 Files selected for processing (3)
  • pkg/election/lease.go
  • pkg/election/lease_test.go
  • pkg/election/metrics.go

Comment thread pkg/election/lease.go Outdated
@JmPotato JmPotato force-pushed the bob/lease-keepalive-metrics branch from 02b6b75 to b0c7b7f Compare April 25, 2026 14:33
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
pkg/election/lease_test.go (1)

392-401: Histogram introspection LGTM, but worth a brief comment.

The observer.(prometheus.Metric) assertion works because client_golang histogram instances implement both Observer and Metric. This is intentional but not obvious; a one-line comment explaining the assertion would help future readers and guard against regressions if someone replaces the histogram with a wrapped observer.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/election/lease_test.go` around lines 392 - 401, In
keepAliveResponseIntervalCount, add a one-line comment above the
observer.(prometheus.Metric) type assertion explaining that the current
prometheus histogram implementation also implements prometheus.Metric, so
casting the Observer to prometheus.Metric is safe here; also note that this
relies on the concrete histogram implementation and would break if the metric
were replaced by a wrapped Observer, to warn future maintainers and prevent
accidental regressions.
pkg/election/lease.go (1)

86-96: Consider initializing metrics in NewLease, not Grant.

l.metrics is currently initialized only at the end of Grant(). If a programmer ever invokes KeepAlive() before a successful Grant() (or after Grant() returns an error), the very first metric write — e.g. l.metrics.streamFailed.Inc() at line 178 — will deref a nil prometheus.Counter and panic.

Since newLeaseMetrics() only depends on l.Purpose (set in NewLease), initializing there makes the type zero-value-friendlier and matches the guideline "Keep structs zero-value friendly":

🛡️ Suggested move
 func NewLease(client *clientv3.Client, purpose string) *Lease {
-	return &Lease{
+	l := &Lease{
 		Purpose: purpose,
 		client:  client,
 		lease:   clientv3.NewLease(client),
 	}
+	l.initMetrics()
+	return l
 }
@@
 	l.expireTime.Store(start.Add(time.Duration(leaseResp.TTL) * time.Second))
-	l.initMetrics()
 	return nil

As per coding guidelines: "Keep structs zero-value friendly; init maps/slices before use".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/election/lease.go` around lines 86 - 96, The metrics field l.metrics
should be initialized in NewLease (where l.Purpose is already set) instead of
only in Grant to avoid nil derefs when methods like KeepAlive access l.metrics
before a successful Grant; call newLeaseMetrics(l.Purpose) during NewLease
construction to set l.metrics and remove or keep a no-op initMetrics call from
Grant (or guard uses) so Grant no longer is the sole initializer; update
references to l.metrics (e.g., streamFailed) to assume it is non-nil after
NewLease.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/election/lease_test.go`:
- Around line 275-292: Test uses a brittle 20ms watchdog timeout in
TestRenewalFailureWatchdogTimeout which can flake under CI; update the watchdog
duration passed to newTestLeaseWithKeepAliveCh from 20*time.Millisecond to a
more robust value (e.g., 100–200*time.Millisecond) so the test has enough time
for goroutine startup/scheduling without materially slowing it, leaving the rest
of the test (done channel, KeepAlive call, and assertions on renewalFailureValue
and localTTLRemainingValue) unchanged.
- Around line 187-208: The test uses time.Sleep(100*time.Millisecond) as a
race-prone sync; replace it with a deterministic synchronization so the
KeepAlive consumer has processed the small-TTL response before asserting
monotonicity: after sendKeepAliveResponse(t, keepAliveCh, 1) either (a) send a
marker over an unbuffered channel that the KeepAlive loop reads when it finishes
processing each response (add a test-only opt to lease.KeepAlive or a hook
callback) and wait for that marker, or (b) poll with testutil.Eventually until
lease.getExpireTime() stops changing or until a processed-response counter
increments (expose/read a test metric from the KeepAlive loop) before reading
afterSmall; update TestLeaseKeepAliveKeepsExpireMonotonic to use the chosen
deterministic wait instead of time.Sleep.

---

Nitpick comments:
In `@pkg/election/lease_test.go`:
- Around line 392-401: In keepAliveResponseIntervalCount, add a one-line comment
above the observer.(prometheus.Metric) type assertion explaining that the
current prometheus histogram implementation also implements prometheus.Metric,
so casting the Observer to prometheus.Metric is safe here; also note that this
relies on the concrete histogram implementation and would break if the metric
were replaced by a wrapped Observer, to warn future maintainers and prevent
accidental regressions.

In `@pkg/election/lease.go`:
- Around line 86-96: The metrics field l.metrics should be initialized in
NewLease (where l.Purpose is already set) instead of only in Grant to avoid nil
derefs when methods like KeepAlive access l.metrics before a successful Grant;
call newLeaseMetrics(l.Purpose) during NewLease construction to set l.metrics
and remove or keep a no-op initMetrics call from Grant (or guard uses) so Grant
no longer is the sole initializer; update references to l.metrics (e.g.,
streamFailed) to assume it is non-nil after NewLease.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b76876a9-1227-4569-9e60-6dc4d576f1b4

📥 Commits

Reviewing files that changed from the base of the PR and between 02b6b75 and b0c7b7f.

📒 Files selected for processing (3)
  • pkg/election/lease.go
  • pkg/election/lease_test.go
  • pkg/election/metrics.go

Comment thread pkg/election/lease_test.go Outdated
Comment thread pkg/election/lease_test.go Outdated
@JmPotato JmPotato force-pushed the bob/lease-keepalive-metrics branch from b0c7b7f to 85a5524 Compare April 26, 2026 09:59
@ti-chi-bot ti-chi-bot Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Apr 26, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
pkg/election/lease_test.go (1)

153-170: ⚠️ Potential issue | 🟡 Minor

Tight 20 ms watchdog still risks flake on loaded CI.

Same concern as in the previous review: a 20 ms watchdog leaves very little headroom for goroutine startup and scheduler latency. Bumping to ~100–200 ms keeps the test deterministic without materially slowing it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/election/lease_test.go` around lines 153 - 170, The 20ms watchdog timeout
in TestRenewalFailureWatchdogTimeout is too tight and may flake on CI; update
the timeout passed to newTestLeaseWithResponses (currently 20*time.Millisecond)
to a larger value (e.g., 150-200ms) to provide scheduler/goroutine headroom
while keeping the test fast, leaving the rest of the test (calls to
lease.KeepAlive, renewalFailureValue(..., reasonWatchdogTimeout), and
localTTLRemainingValue checks) unchanged.
pkg/election/lease.go (1)

155-160: ⚠️ Potential issue | 🟡 Minor

Watchdog branch can record a phantom failure during shutdown races.

When ctx is cancelled at almost the same moment timer.C fires, Go's select may still pick the timer case even though the caller has already abandoned the loop, incrementing renewalFailureTotal{reason="watchdog_timeout"} and emitting a "keep alive lease too slow" log on a clean shutdown. Mirroring the early-return on ctx.Err() != nil keeps the metric/log limited to true watchdog expirations.

TestNoFailureOnContextCancel doesn't exercise this race because leaseTimeout=time.Hour, so the timer cannot fire within the test window.

🛡️ Suggested guard
 		case <-timer.C:
+			if ctx.Err() != nil {
+				return
+			}
 			actualExpire := l.expireTime.Load().(time.Time)
 			l.metrics.ttlRemaining.Set(time.Until(actualExpire).Seconds())
 			l.metrics.watchdogTimeout.Inc()
 			log.Info("keep alive lease too slow", zap.Duration("timeout-duration", l.leaseTimeout), zap.Time("actual-expire", actualExpire), zap.String("purpose", l.Purpose))
 			return
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/election/lease.go` around lines 155 - 160, The watchdog timer branch can
misreport failures during shutdown races; in the select case handling timer.C
(where you read actualExpire via l.expireTime.Load(), set
l.metrics.ttlRemaining, increment l.metrics.watchdogTimeout and call log.Info
about "keep alive lease too slow"), first check if ctx.Err() != nil and return
early if so, mirroring the existing early-return guard used elsewhere; this
prevents emitting the watchdog_timeout metric and log when the caller has
already canceled the context.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/election/lease_test.go`:
- Around line 259-298: The fakeLease.KeepAliveOnce method mutates shared fields
ttls and err without synchronization, causing a data race when keepAliveWorker
spawns concurrent goroutines; fix by adding a mutex field (e.g., mu sync.Mutex)
to fakeLease and guard all accesses and mutations of l.ttls and l.err in
KeepAliveOnce (lock at start, update/slice l.ttls and read l.err while locked,
then unlock before returning) so concurrent calls are safe.

---

Duplicate comments:
In `@pkg/election/lease_test.go`:
- Around line 153-170: The 20ms watchdog timeout in
TestRenewalFailureWatchdogTimeout is too tight and may flake on CI; update the
timeout passed to newTestLeaseWithResponses (currently 20*time.Millisecond) to a
larger value (e.g., 150-200ms) to provide scheduler/goroutine headroom while
keeping the test fast, leaving the rest of the test (calls to lease.KeepAlive,
renewalFailureValue(..., reasonWatchdogTimeout), and localTTLRemainingValue
checks) unchanged.

In `@pkg/election/lease.go`:
- Around line 155-160: The watchdog timer branch can misreport failures during
shutdown races; in the select case handling timer.C (where you read actualExpire
via l.expireTime.Load(), set l.metrics.ttlRemaining, increment
l.metrics.watchdogTimeout and call log.Info about "keep alive lease too slow"),
first check if ctx.Err() != nil and return early if so, mirroring the existing
early-return guard used elsewhere; this prevents emitting the watchdog_timeout
metric and log when the caller has already canceled the context.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bb07ab67-6d22-4ea3-ad47-372d15157a58

📥 Commits

Reviewing files that changed from the base of the PR and between b0c7b7f and 85a5524.

📒 Files selected for processing (3)
  • pkg/election/lease.go
  • pkg/election/lease_test.go
  • pkg/election/metrics.go

Comment thread pkg/election/lease_test.go Outdated
Signed-off-by: JmPotato <github@ipotato.me>
@JmPotato JmPotato force-pushed the bob/lease-keepalive-metrics branch from 85a5524 to 9cf0ea6 Compare April 26, 2026 10:12
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
pkg/election/lease.go (1)

155-160: ⚠️ Potential issue | 🟡 Minor

Spurious leaseExpired increment on shutdown race.

When ctx.Done() and timer.C become ready simultaneously, select may pick the timer.C arm and increment l.metrics.leaseExpired even though the caller is shutting down cleanly. Consider gating the increment/log on ctx.Err() == nil:

🛡️ Suggested guard
 		case <-timer.C:
 			actualExpire := l.expireTime.Load().(time.Time)
 			l.metrics.ttlRemaining.Set(time.Until(actualExpire).Seconds())
+			if ctx.Err() != nil {
+				return
+			}
 			l.metrics.leaseExpired.Inc()
 			log.Info("keep alive lease too slow", zap.Duration("timeout-duration", l.leaseTimeout), zap.Time("actual-expire", actualExpire), zap.String("purpose", l.Purpose))
 			return
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/election/lease.go` around lines 155 - 160, The timer.C case in the lease
keep-alive loop increments l.metrics.leaseExpired and logs even when shutdown
raced with the timer; guard that work by checking the request context state
before treating it as an actual lease expiration: in the timer.C branch of the
loop (the block referencing l.expireTime.Load(), l.metrics.ttlRemaining,
l.metrics.leaseExpired, and log.Info with l.leaseTimeout and l.Purpose) first
verify ctx.Err() == nil (or equivalent not-shutting-down check) and only then
update ttlRemaining, increment l.metrics.leaseExpired, and emit the "keep alive
lease too slow" log; if ctx.Err() != nil, return without counting it as an
expiration.
pkg/election/lease_test.go (2)

153-170: ⚠️ Potential issue | 🟡 Minor

Tight 20 ms leaseTimeout may flake under load.

The test depends on the timer firing and KeepAlive returning within a short window. On a busy CI runner, goroutine startup + scheduler latency + occasional GC pause can push the actual delivery much later or even keep time.Until(actualExpire) momentarily positive (the helper seeds expireTime to now-1s). Bumping to ~100–200 ms keeps the test deterministic without slowing the suite materially.

♻️ Suggested change
-	lease := newTestLeaseWithResponses(purpose, 20*time.Millisecond)
+	lease := newTestLeaseWithResponses(purpose, 200*time.Millisecond)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/election/lease_test.go` around lines 153 - 170,
TestRenewalFailureLeaseExpired uses a tight 20ms lease timeout which can flake
under CI; update the call to newTestLeaseWithResponses in
TestRenewalFailureLeaseExpired (and any similar tests) to use a larger timeout
(≈100–200ms) instead of 20*time.Millisecond so the KeepAlive goroutine and
timers have headroom (reference: TestRenewalFailureLeaseExpired and
newTestLeaseWithResponses).

259-294: ⚠️ Potential issue | 🟠 Major

Data race in fakeLease.KeepAliveOnce under -race.

keepAliveWorker (lease.go line 184) spawns a fresh goroutine every leaseTimeout/3. In TestRenewalFailureLeaseExpired that's ≈6.6 ms, and the spawned goroutines read l.err / read+slice l.ttls concurrently with no synchronization. go test -race will flag this and responses become non-deterministic.

🔒 Suggested fix
 type fakeLease struct {
+	mu   sync.Mutex
 	ttls []int64
 	err  error
 }
@@
 func (l *fakeLease) KeepAliveOnce(context.Context, clientv3.LeaseID) (*clientv3.LeaseKeepAliveResponse, error) {
+	l.mu.Lock()
+	defer l.mu.Unlock()
 	if l.err != nil {
 		return nil, l.err
 	}
 	if len(l.ttls) == 0 {
 		return nil, errors.New("no fake keepalive response")
 	}
 	ttl := l.ttls[0]
 	l.ttls = l.ttls[1:]
 	return &clientv3.LeaseKeepAliveResponse{ID: clientv3.LeaseID(1), TTL: ttl}, nil
 }

(Add "sync" to the imports.)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/election/lease_test.go` around lines 259 - 294, fakeLease.KeepAliveOnce
is racy because keepAliveWorker spawns goroutines that concurrently read/modify
l.err and l.ttls; to fix, add synchronization to fakeLease by adding a
sync.Mutex field (or sync.RWMutex) and lock/unlock around accesses to l.err and
the l.ttls slice inside KeepAliveOnce; also add "sync" to the imports so tests
under -race become deterministic (references: fakeLease.KeepAliveOnce,
keepAliveWorker, TestRenewalFailureLeaseExpired).
🧹 Nitpick comments (1)
pkg/election/lease.go (1)

90-108: local_ttl_remaining_seconds gauge is not reset on Close().

After a lease is revoked, the gauge keeps its last observed value indefinitely (until process restart), which can be misleading in dashboards/alerts that key off "remaining TTL ≤ 0". Consider deleting the per-purpose child (or setting it to 0) in Close():

♻️ Suggested cleanup
 func (l *Lease) Close() error {
 	if l == nil {
 		return nil
 	}
 	// Reset expire time.
 	l.expireTime.Store(typeutil.ZeroTime)
+	localTTLRemaining.DeleteLabelValues(l.Purpose)
 	// Try to revoke lease to make subsequent elections faster.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/election/lease.go` around lines 90 - 108, The Close() method should clear
the per-purpose local_ttl_remaining_seconds metric so it doesn't linger after
lease revocation; update Lease.Close() (around where l.Purpose is available) to
either set the per-purpose child gauge to 0 or unregister/delete that child for
l.Purpose (before/after the revoke and expireTime reset), ensuring any metric
registry or gauge variable used for local_ttl_remaining_seconds is referenced
and cleaned up for the specific purpose.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pkg/election/lease_test.go`:
- Around line 153-170: TestRenewalFailureLeaseExpired uses a tight 20ms lease
timeout which can flake under CI; update the call to newTestLeaseWithResponses
in TestRenewalFailureLeaseExpired (and any similar tests) to use a larger
timeout (≈100–200ms) instead of 20*time.Millisecond so the KeepAlive goroutine
and timers have headroom (reference: TestRenewalFailureLeaseExpired and
newTestLeaseWithResponses).
- Around line 259-294: fakeLease.KeepAliveOnce is racy because keepAliveWorker
spawns goroutines that concurrently read/modify l.err and l.ttls; to fix, add
synchronization to fakeLease by adding a sync.Mutex field (or sync.RWMutex) and
lock/unlock around accesses to l.err and the l.ttls slice inside KeepAliveOnce;
also add "sync" to the imports so tests under -race become deterministic
(references: fakeLease.KeepAliveOnce, keepAliveWorker,
TestRenewalFailureLeaseExpired).

In `@pkg/election/lease.go`:
- Around line 155-160: The timer.C case in the lease keep-alive loop increments
l.metrics.leaseExpired and logs even when shutdown raced with the timer; guard
that work by checking the request context state before treating it as an actual
lease expiration: in the timer.C branch of the loop (the block referencing
l.expireTime.Load(), l.metrics.ttlRemaining, l.metrics.leaseExpired, and
log.Info with l.leaseTimeout and l.Purpose) first verify ctx.Err() == nil (or
equivalent not-shutting-down check) and only then update ttlRemaining, increment
l.metrics.leaseExpired, and emit the "keep alive lease too slow" log; if
ctx.Err() != nil, return without counting it as an expiration.

---

Nitpick comments:
In `@pkg/election/lease.go`:
- Around line 90-108: The Close() method should clear the per-purpose
local_ttl_remaining_seconds metric so it doesn't linger after lease revocation;
update Lease.Close() (around where l.Purpose is available) to either set the
per-purpose child gauge to 0 or unregister/delete that child for l.Purpose
(before/after the revoke and expireTime reset), ensuring any metric registry or
gauge variable used for local_ttl_remaining_seconds is referenced and cleaned up
for the specific purpose.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: acfbca14-809d-4e64-9806-90cae220fa7c

📥 Commits

Reviewing files that changed from the base of the PR and between 85a5524 and 9cf0ea6.

📒 Files selected for processing (3)
  • pkg/election/lease.go
  • pkg/election/lease_test.go
  • pkg/election/metrics.go

Signed-off-by: JmPotato <github@ipotato.me>
@ti-chi-bot ti-chi-bot Bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Apr 27, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
pkg/election/lease.go (1)

165-170: ⚠️ Potential issue | 🟡 Minor

Watchdog branch still lacks a ctx.Err() guard.

When ctx is cancelled at nearly the same instant timer.C fires, Go's select may pick the timer case even though the caller has already shut down the loop. As written, this unconditionally increments metrics.leaseExpired and logs "keep alive lease too slow", producing a phantom failure event on a clean shutdown of an already-expired lease. Mirroring the early-return behaviour of the ctx.Done() branch keeps the failure counter focused on real renewal failures.

🛡️ Suggested guard
 		case <-timer.C:
 			actualExpire := l.loadExpireTime()
 			l.metrics.ttlRemaining.Set(time.Until(actualExpire).Seconds())
+			if ctx.Err() != nil {
+				return
+			}
 			l.metrics.leaseExpired.Inc()
 			log.Info("keep alive lease too slow", zap.Duration("timeout-duration", l.leaseTimeout), zap.Time("actual-expire", actualExpire), zap.String("purpose", l.Purpose))
 			return
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/election/lease.go` around lines 165 - 170, The timer.C case in the select
(where you compute actualExpire via l.loadExpireTime(), call
l.metrics.leaseExpired.Inc(), and log.Info("keep alive lease too slow", ...))
should first check ctx.Err() and return without incrementing or logging when the
context is canceled; change the watchdog branch to read ctx.Err() (or check
ctx.Done()) and only perform l.metrics.leaseExpired.Inc() and the log.Info call
if ctx.Err() == nil so shutdown races don't produce phantom lease-failure
metrics/events.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@metrics/grafana/pd.json`:
- Around line 16400-16410: The Grafana query uses unsupported label filters on
the metric service_member_role (defined in pkg/member/metrics.go with only the
service label), so remove the nonexistent k8s_cluster and tidb_cluster selectors
from the expr; replace service_member_role{k8s_cluster="$k8s_cluster",
tidb_cluster="$tidb_cluster"} with a valid selector such as service_member_role
or service_member_role{service="$service"} (or any actual label exposed by
service_member_role) so the table will return rows when a leader exists.

---

Duplicate comments:
In `@pkg/election/lease.go`:
- Around line 165-170: The timer.C case in the select (where you compute
actualExpire via l.loadExpireTime(), call l.metrics.leaseExpired.Inc(), and
log.Info("keep alive lease too slow", ...)) should first check ctx.Err() and
return without incrementing or logging when the context is canceled; change the
watchdog branch to read ctx.Err() (or check ctx.Done()) and only perform
l.metrics.leaseExpired.Inc() and the log.Info call if ctx.Err() == nil so
shutdown races don't produce phantom lease-failure metrics/events.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 00e39094-4c7d-48fb-85b5-cc54013e24b1

📥 Commits

Reviewing files that changed from the base of the PR and between 9cf0ea6 and f902b2e.

📒 Files selected for processing (3)
  • metrics/grafana/pd.json
  • pkg/election/lease.go
  • pkg/election/lease_test.go

Comment thread metrics/grafana/pd.json
@JmPotato
Copy link
Copy Markdown
Member Author

/retest

@ti-chi-bot
Copy link
Copy Markdown
Contributor

ti-chi-bot Bot commented Apr 27, 2026

@JmPotato: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-unit-test-next-gen-2 f902b2e link true /test pull-unit-test-next-gen-2

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has signed the dco. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant