Skip to content

GH-2226: fix(executor): resolve stale threshold defaults treating 0 as unset#2229

Merged
alekspetrov merged 2 commits intomainfrom
pilot/GH-2222
Apr 7, 2026
Merged

GH-2226: fix(executor): resolve stale threshold defaults treating 0 as unset#2229
alekspetrov merged 2 commits intomainfrom
pilot/GH-2222

Conversation

@alekspetrov
Copy link
Copy Markdown
Collaborator

Summary

  • The effective*Threshold() methods on DispatcherConfig checked > 0, causing explicitly-set 0 thresholds (meaning "immediately stale") to fall through to the 30-minute default
  • Replaced the three effective* methods with a single resolveDefaults() called once at construction in NewDispatcher
  • Zero values are now honoured as-is; only StaleRecoveryInterval gets a fallback (5m) since a 0-interval ticker would panic

Test plan

  • go test -race ./internal/executor/... passes (all 4 previously-failing tests now pass)
  • go build ./... clean
  • go test ./... full suite passes

Fixes #2226
Depends on #2222

…ecutor/`)

- Add StaleRunningThreshold, StaleQueuedThreshold, StaleRecoveryInterval
  to DispatcherConfig (StaleTaskDuration kept as backwards-compat alias)
- Rewrite recoverStaleTasks() to recover both running and queued orphans,
  marking them failed (not re-queued — re-queuing without a worker just
  recreates the orphan)
- Change Start() to accept context.Context and launch runStaleRecoveryLoop
  goroutine that ticks every StaleRecoveryInterval
- Add summary log "stale recovery complete, reset N tasks" on every pass
  (even when 0) for diagnosability
- Update all Start() callers in main.go and tests
- Add tests: TestRecoverStaleTasks_QueuedAndRunning,
  TestRecoverStaleTasks_RespectsThresholds, TestRunStaleRecoveryLoop_Periodic,
  TestQueueTask_AfterRecovery
The effective*Threshold() methods checked `> 0`, so explicitly setting
thresholds to 0 (meaning "immediately stale") fell through to the 30m
default. Replace the three methods with a single resolveDefaults()
called once at construction — zero values are now honoured as-is,
and only StaleRecoveryInterval gets a fallback (5m) since a 0-interval
ticker panics.

Fixes GH-2226
@alekspetrov alekspetrov mentioned this pull request Apr 7, 2026
1 task
@alekspetrov alekspetrov merged commit 6d92123 into main Apr 7, 2026
4 checks passed
@alekspetrov alekspetrov deleted the pilot/GH-2222 branch April 7, 2026 12:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix CI failure from PR #2225

1 participant