feat: add task lifecycle diagnostics#120
Conversation
📝 WalkthroughWalkthroughThe PR adds lifecycle telemetry infrastructure spanning CLI logging tools, event logging/serialization, and turn-level execution supervision. New code enables structured log filtering and rendering via command-line, lifecycle event construction with prompt preview redaction, executor lifecycle hook callbacks, and turn supervisor orchestration tracking task phases with deadline-based failure detection. Changes
Sequence DiagramsequenceDiagram
participant Client as Caller
participant TS as TurnSupervisor
participant Executor as Pi Executor
participant Sink as Lifecycle Sink
Client->>TS: Run(context, runFunc)
activate TS
TS->>TS: recordStartTime()
TS->>Sink: Phase("task_started", fields)
TS->>Client: runFunc(context, TurnHooks)
activate Client
Client->>Executor: Run(...)
activate Executor
Executor->>Executor: hooks.PromptWritten()
Executor->>TS: PromptWritten via TurnHooks
deactivate Executor
TS->>TS: scheduleDeadline(FirstEvent)
TS->>Sink: Phase("prompt_written", fields)
Executor->>Executor: hooks.FirstEventSeen()
Executor->>TS: FirstEventSeen via TurnHooks
TS->>Sink: Phase("first_event_seen", fields)
TS->>TS: scheduleDeadline(ToolIdle)
Executor->>Executor: hooks.ToolStarted()
Executor->>TS: ToolStarted(id, name) via TurnHooks
TS->>Sink: Phase("tool_started", fields)
Executor->>Executor: hooks.ToolFinished()
Executor->>TS: ToolFinished(id, name) via TurnHooks
TS->>Sink: Phase("tool_finished", fields)
Client-->>TS: return
deactivate Client
TS->>Sink: Phase("task_timed_out", fields)
TS->>TS: Result() = {FailureCode, Retryable}
TS-->>Client: return error
deactivate TS
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Review rate limit: 6/10 reviews remaining, refill in 22 minutes and 8 seconds. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/logs_pretty.go`:
- Around line 47-49: The code currently treats any mode != colorNever the same
and calls colorizePrettyLine(), causing ANSI escapes even when piping; change
the condition around the call in the pretty-printing path so that
colorizePrettyLine(line, event.Phase) is only called when mode==colorAlways OR
when mode==colorAuto AND the output is a TTY (use an isTerminal/isTTY check on
the writer e.g., os.Stdout or the logger output). Update the branch using the
variables mode, colorNever and the function colorizePrettyLine to perform that
TTY check (or introduce a small helper like isTerminal()) so color=auto disables
ANSI when output is not a terminal.
In `@cmd/logs.go`:
- Around line 156-158: When logsOptions.lastTask is true and you call
latestTaskID(lines), handle the empty-string case instead of leaving
filter.TaskID blank: check the return value of latestTaskID(lines) and if it's
an empty string, return an error (or exit with a clear message) indicating "no
task-scoped entries found for --last-task" rather than assigning "" to
filter.TaskID (which causes filterLogLines() to match everything); update the
branch that currently sets filter.TaskID to call latestTaskID(lines) and bail
out with a clear error or non-zero exit when the result is empty.
- Around line 107-116: hasStructuredMode() currently treats color and noColor as
selectors for structured rendering; remove opts.color != "auto" and opts.noColor
from the hasStructuredMode() boolean expression so color flags do not change the
mode, and add validation in the command's option parsing/validation (where
logsOptionsState is validated) to return an error if --color/--no-color is
provided without --pretty (i.e., require opts.pretty when opts.color != "auto"
or opts.noColor is true) so color flags only affect pretty output once
structured/pretty mode is explicitly selected.
In `@internal/session/turn_supervisor.go`:
- Around line 75-125: The One-shot timers started in TurnHooks (the FirstEvent
timer in PromptWritten and the ToolIdle timer in ToolStarted) must be tracked on
the TurnSupervisor (add fields like firstEventTimer and toolIdleTimer) and
cancelled when the turn advances or finishes: start and store the timers instead
of calling time.AfterFunc inline in PromptWritten/ToolStarted, call Stop() (and
drain the channel if needed) from ToolFinished and from the cleanup path just
before run() returns or when the supervisor is marked complete, and also clear
the timer fields when stopped; ensure PromptWritten/ToolStarted/ToolFinished and
the post-run return path consistently stop any outstanding timers so stale
timers cannot call s.timeout after the turn has already completed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro Plus
Run ID: ef1b9ae3-cc18-4907-bb40-da4259db2122
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (14)
cmd/logs.gocmd/logs_filter.gocmd/logs_pretty.gocmd/logs_test.gogo.modinternal/logging/lifecycle.gointernal/logging/lifecycle_test.gointernal/pi/executor.gointernal/pi/executor_test.gointernal/pi/extension/plan-tools.jsinternal/pi/worker.gointernal/session/actor.gointernal/session/turn_supervisor.gointernal/session/turn_supervisor_test.go
| if mode != colorNever { | ||
| line = colorizePrettyLine(line, event.Phase) | ||
| } |
There was a problem hiding this comment.
color=auto currently behaves like always.
Every mode except never runs colorizePrettyLine(), so piping pretty logs to a file or another command still injects ANSI escapes. auto needs a TTY check before deciding whether to colorize.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cmd/logs_pretty.go` around lines 47 - 49, The code currently treats any mode
!= colorNever the same and calls colorizePrettyLine(), causing ANSI escapes even
when piping; change the condition around the call in the pretty-printing path so
that colorizePrettyLine(line, event.Phase) is only called when mode==colorAlways
OR when mode==colorAuto AND the output is a TTY (use an isTerminal/isTTY check
on the writer e.g., os.Stdout or the logger output). Update the branch using the
variables mode, colorNever and the function colorizePrettyLine to perform that
TTY check (or introduce a small helper like isTerminal()) so color=auto disables
ANSI when output is not a terminal.
| func (opts logsOptionsState) hasStructuredMode() bool { | ||
| return opts.sessionID != "" || | ||
| opts.taskID != "" || | ||
| opts.lastTask || | ||
| opts.since > 0 || | ||
| opts.level != "" || | ||
| opts.pretty || | ||
| opts.json || | ||
| opts.color != "auto" || | ||
| opts.noColor |
There was a problem hiding this comment.
Keep color flags from changing the command mode.
hasStructuredMode() treats --color and --no-color as selectors for structured rendering, so gsd-cloud logs --no-color stops tailing the raw file and switches to a one-shot pretty snapshot. These flags should only modify pretty output once structured mode is already selected, or validation should require --pretty.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cmd/logs.go` around lines 107 - 116, hasStructuredMode() currently treats
color and noColor as selectors for structured rendering; remove opts.color !=
"auto" and opts.noColor from the hasStructuredMode() boolean expression so color
flags do not change the mode, and add validation in the command's option
parsing/validation (where logsOptionsState is validated) to return an error if
--color/--no-color is provided without --pretty (i.e., require opts.pretty when
opts.color != "auto" or opts.noColor is true) so color flags only affect pretty
output once structured/pretty mode is explicitly selected.
| if logsOptions.lastTask { | ||
| filter.TaskID = latestTaskID(lines) | ||
| } |
There was a problem hiding this comment.
Don't fall back to the entire log when --last-task finds nothing.
If latestTaskID(lines) returns "", filter.TaskID stays empty and filterLogLines() passes every event through. On a log file with no task-scoped entries yet, --last-task would silently dump the full log instead of returning no matches or a clear error.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cmd/logs.go` around lines 156 - 158, When logsOptions.lastTask is true and
you call latestTaskID(lines), handle the empty-string case instead of leaving
filter.TaskID blank: check the return value of latestTaskID(lines) and if it's
an empty string, return an error (or exit with a clear message) indicating "no
task-scoped entries found for --last-task" rather than assigning "" to
filter.TaskID (which causes filterLogLines() to match everything); update the
branch that currently sets filter.TaskID to call latestTaskID(lines) and bail
out with a clear error or non-zero exit when the result is empty.
| hooks := TurnHooks{ | ||
| PromptWritten: func() { | ||
| s.emit("prompt_written", nil) | ||
| if s.opts.Deadlines.FirstEvent > 0 { | ||
| time.AfterFunc(s.opts.Deadlines.FirstEvent, func() { | ||
| s.mu.Lock() | ||
| shouldTimeout := !s.firstEvent && s.result.FailureCode == "" | ||
| s.mu.Unlock() | ||
| if shouldTimeout { | ||
| s.timeout("no_first_event_timeout", true) | ||
| } | ||
| }) | ||
| } | ||
| }, | ||
| FirstEventSeen: func() { | ||
| s.mu.Lock() | ||
| s.firstEvent = true | ||
| s.mu.Unlock() | ||
| s.emit("first_event_seen", nil) | ||
| }, | ||
| FirstVisibleEventSeen: func() { | ||
| s.mu.Lock() | ||
| s.firstVisible = true | ||
| s.retrySafe = false | ||
| s.mu.Unlock() | ||
| s.emit("first_visible_event_seen", nil) | ||
| }, | ||
| ToolStarted: func(toolCallID string, toolName string) { | ||
| s.mu.Lock() | ||
| s.retrySafe = false | ||
| s.mu.Unlock() | ||
| s.emit("tool_started", map[string]any{"toolCallId": toolCallID, "toolName": toolName}) | ||
| if s.opts.Deadlines.ToolIdle > 0 { | ||
| time.AfterFunc(s.opts.Deadlines.ToolIdle, func() { | ||
| s.timeout("tool_idle_timeout", false) | ||
| }) | ||
| } | ||
| }, | ||
| ToolFinished: func(toolCallID string, toolName string) { | ||
| s.emit("tool_finished", map[string]any{"toolCallId": toolCallID, "toolName": toolName}) | ||
| }, | ||
| } | ||
|
|
||
| err := run(ctx, hooks) | ||
| s.mu.Lock() | ||
| hasFailure := s.result.FailureCode != "" | ||
| s.mu.Unlock() | ||
| if hasFailure { | ||
| return turnFailureError{result: s.Result()} | ||
| } | ||
| return err |
There was a problem hiding this comment.
Cancel outstanding deadline timers when the turn advances or exits.
ToolStarted() arms a one-shot ToolIdle timer, but ToolFinished() never stops it and Run() never marks the supervisor as complete before returning. That means a turn can start a tool, finish it quickly, keep streaming for longer than ToolIdle, and still get canceled by the stale timer. The same late-fire problem exists for the FirstEvent timer after run() has already returned. Track these timers on TurnSupervisor and stop them on state transitions / final completion before returning.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@internal/session/turn_supervisor.go` around lines 75 - 125, The One-shot
timers started in TurnHooks (the FirstEvent timer in PromptWritten and the
ToolIdle timer in ToolStarted) must be tracked on the TurnSupervisor (add fields
like firstEventTimer and toolIdleTimer) and cancelled when the turn advances or
finishes: start and store the timers instead of calling time.AfterFunc inline in
PromptWritten/ToolStarted, call Stop() (and drain the channel if needed) from
ToolFinished and from the cleanup path just before run() returns or when the
supervisor is marked complete, and also clear the timer fields when stopped;
ensure PromptWritten/ToolStarted/ToolFinished and the post-run return path
consistently stop any outstanding timers so stale timers cannot call s.timeout
after the turn has already completed.
Summary
Dependency
Verification
Post-merge
Summary by CodeRabbit
Release Notes
New Features
Chores