tweak codebuff-local-cli from runs by gpt-5.4

jahooma · jahooma · commit 1070287ae3c9 · 2026-03-06T17:21:09.000-08:00
diff --git a/.agents/codebuff-local-cli.ts b/.agents/codebuff-local-cli.ts
@@ -12,6 +12,16 @@ const baseDefinition = createCliAgent({
     'No permission flags needed for Codebuff local dev server.',
   model: 'anthropic/claude-opus-4.6',
   skipPrepPhase: true,
+  cliSpecificDocs: `## Codebuff CLI Specific Guidance
+
+- The ready state is the Codebuff banner, working directory, and bordered input box with the agent selector.
+- For smoke tests, \`/help\` is useful because it validates the overlay, shortcuts, features, and credits copy in one step.
+- For implementation-oriented tests, prefer asking the CLI to inspect or reason about a specific file rather than making edits unless the parent prompt explicitly asks for edits.
+- Long Codebuff responses live in a scrollable viewport. If the bottom of the answer already shows the core recommendation, do not spend many extra steps trying to reconstruct every hidden line.
+- Avoid key combinations like Shift+Arrow or repeated history/navigation probing unless you have a clear reason; they can open overlays or mutate the input state unexpectedly.
+- A good implementation-test flow is usually: initial ready capture → task sent/in-progress capture → response-complete capture → optional follow-up-ready or follow-up-complete capture.
+- If you need a follow-up, keep it narrow and specific rather than re-asking the whole task.
+- If the current session becomes clearly unusable, report that failure; do not silently start a replacement session and continue as though nothing happened.`,
   spawnerPromptExtras: `**Purpose:** E2E visual testing of the Codebuff CLI itself. This agent starts a local dev Codebuff CLI instance and interacts with it to verify UI behavior.
 
 **When to use:**
@@ -97,7 +107,7 @@ const definition: AgentDefinition = {
       input: {
         role: 'user',
         content: 'A ' + CLI_NAME + ' tmux session has been started: `' + sessionName + '`\n\n' +
-          'Use this session for all CLI interactions. The session name must be included in your final output.\n\n' +
+          'Use this session for all CLI interactions. Treat it as the canonical session for this run. If it fails, report that explicitly instead of silently starting another session. The session name must be included in your final output.\n\n' +
           'Proceed with the task using the helper scripts:\n' +
           '- Send commands: `./scripts/tmux/tmux-cli.sh send "' + sessionName + '" "..."`\n' +
           '- Capture output: `./scripts/tmux/tmux-cli.sh capture "' + sessionName + '" --label "..."`\n' +
diff --git a/.agents/lib/cli-agent-prompts.ts b/.agents/lib/cli-agent-prompts.ts
@@ -111,6 +111,16 @@ export function getSystemPrompt(config: CliAgentConfig): string {
 
 **Important:** ${config.permissionNote}
 ${cliSpecificSection}
+## Operating Heuristics
+
+- Treat the provided tmux session as the single source of truth. Do not start a second session unless the current one has clearly failed and you are explicitly recovering from that failure.
+- Prefer fewer, higher-value captures over many overlapping captures.
+- A capture is worth taking when the UI meaningfully changes: startup ready state, help overlay open, task in progress, task complete, clean follow-up-ready state, or an error state.
+- Avoid exploratory key presses that can mutate the UI state unless they are necessary for the task.
+- If the CLI already shows enough evidence in the current viewport, do not keep scrolling or recapturing just to get a more perfect screenshot.
+- If a long response is partially off-screen, prefer summarizing from the visible evidence instead of repeatedly trying viewport-recovery tricks unless the missing content is essential.
+- Do not use \`read_files\` on tmux capture artifacts from inside the CLI tester run; rely on the terminal capture output you already obtained and let the parent agent inspect saved capture files later if needed.
+
 ## Helper Scripts
 
 Use these scripts in \`scripts/tmux/\` to interact with the CLI session:
@@ -238,6 +248,8 @@ Use ${config.cliName} to complete implementation tasks like building features, f
    ./scripts/tmux/tmux-cli.sh capture "$SESSION" --label "work-continued" --wait 30
    \`\`\`
 
+   Prefer at most 1-2 progress captures before deciding whether you already have enough evidence.
+
 4. **Send follow-up prompts** if needed to refine or continue the work:
    \`\`\`bash
    ./scripts/tmux/tmux-cli.sh send "$SESSION" "<follow-up instructions>"
@@ -258,7 +270,7 @@ Use ${config.cliName} to complete implementation tasks like building features, f
 ### Tips
 
 - Break complex tasks into smaller prompts
-- Capture frequently to track progress
+- Prefer high-value captures tied to meaningful UI changes rather than frequent overlapping captures
 - Use descriptive labels for captures
 - Check intermediate results before moving on`
 }
diff --git a/.agents/sessions/03-06-0850-cli-tester-efficiency/LESSONS.md b/.agents/sessions/03-06-0850-cli-tester-efficiency/LESSONS.md
@@ -0,0 +1,73 @@
+# Lessons: CLI tester efficiency and CLI knowledge improvements
+
+## What went well
+
+- The SDK-driven harness made it straightforward to collect full event streams, stream chunks, structured outputs, and tmux capture paths for repeated `codebuff-local-cli` runs.
+- The baseline runs clearly exposed behavior patterns instead of relying on intuition.
+- The Codebuff CLI itself was capable and informative during implementation-oriented runs; most inefficiency came from the tester agent’s workflow rather than the CLI under test.
+
+## What was tricky
+
+- The `codebuff-local-cli` agent uses only `run_terminal_command`, `add_message`, and `set_output`, so all tester intelligence has to come from prompt/instruction quality rather than richer tooling.
+- Long Codebuff CLI responses live in a scrollable viewport. The tester spent many extra steps trying to recover hidden content even when the visible portion already contained enough evidence.
+- One smoke run silently started a second tmux session mid-run, showing that the current guidance was too weak about preserving session continuity and treating failure recovery explicitly.
+- Reading tmux capture artifacts from inside the tester run is ineffective because the agent does not have `read_files`; attempts to recover more evidence should therefore be avoided unless the current viewport is truly insufficient.
+
+## Quantified before/after findings
+
+### Smoke scenario
+
+- Baseline smoke runs: `27` and `38` total events, with one run silently starting a replacement tmux session mid-run.
+- Post-change smoke run: `27` total events, `10` tool calls, `3` captures, no replacement session, and clearer capture labels (`initial-state`, `after-help`, `after-2plus2`).
+
+### Implementation scenario
+
+- Baseline implementation runs:
+  - tool calls: `19` and `21`
+  - captures: `8` and `7`
+  - total cost: `30` and `40`
+  - strong evidence of wasted viewport-recovery actions (page up/down, history keys, extra captures, direct tmux scrollback commands)
+- Post-change implementation run:
+  - tool calls: `10`
+  - captures: `4`
+  - total cost: `14`
+  - no viewport-recovery thrashing; the tester captured the ready state, in-progress state, response, and follow-up response and then stopped.
+
+## Baseline findings
+
+- Smoke runs were mostly efficient, but their capture labels were generic and the agent did not explicitly reason about why each capture was worth taking.
+- One smoke run restarted the session instead of treating the original session as canonical, inflating event/tool counts.
+- Implementation runs showed the biggest inefficiency: excessive viewport recovery actions (page up/down, arrow keys, extra captures, direct tmux scrollback commands) after the key recommendation was already visible.
+- The tester lacked Codebuff-specific guidance about:
+  - what the ready state looks like,
+  - when `/help` is especially valuable,
+  - how to structure a good implementation-oriented test,
+  - and when to stop chasing perfect captures of long responses.
+
+## What changed behavior most
+
+- Adding a canonical-session instruction prevented silent session replacement behavior and made failure handling expectations explicit.
+- Adding the shared “high-value capture” heuristic reduced redundant captures and discouraged overlapping progress snapshots.
+- Adding explicit guidance to stop chasing hidden viewport text eliminated the biggest source of waste in implementation-oriented runs.
+- Adding Codebuff-specific flow guidance improved follow-up quality and reduced exploratory key usage.
+
+## Changes made from baseline evidence
+
+- Added shared operating heuristics to bias CLI testers toward fewer, higher-value captures and away from unnecessary UI mutation.
+- Added explicit guidance to avoid `read_files` on tmux artifacts from inside the tester run.
+- Added Codebuff-specific testing guidance covering ready state, smoke-test flow, implementation-test flow, long-response behavior, and session continuity expectations.
+- Added best-effort harness cleanup when a run throws after a tmux session has already been created.
+
+## Cautionary note
+
+- Different runs may disagree about whether adjacent edge cases are worth fixing. For example, one post-change implementation run argued that the original-case `isEnvFile` call path was acceptable because `.env` files are conventionally lowercase, while earlier baseline runs framed nearby case handling as security-sensitive. Future work should settle those questions with source-of-truth tests or project policy, not by trusting a single run’s opinion.
+
+## Known limitation
+
+- The analysis harness now does best-effort tmux cleanup when a run throws after a session has already been created, but it still does not implement a hard per-run abort/timeout with guaranteed teardown if `client.run()` stalls indefinitely. Future iterations should add explicit run cancellation once the preferred timeout mechanism is settled.
+
+## What we intentionally did not change
+
+- We did not change the tmux helper scripts because the baseline problems were primarily agent-behavior issues, not script failures.
+- We did not broaden the tester’s tool access; this pass focuses on making the current workflow smarter rather than increasing power.
+- We did not change the shared output schema because the existing `set_output` contract was sufficient for analysis once the agent behavior improved.
diff --git a/.agents/sessions/03-06-0850-cli-tester-efficiency/PLAN.md b/.agents/sessions/03-06-0850-cli-tester-efficiency/PLAN.md
@@ -0,0 +1,57 @@
+# Plan: CLI tester efficiency and CLI knowledge improvements
+
+## Implementation Steps
+
+1. Build an SDK-driven analysis harness for the CLI tester runs.
+   - Add a reproducible script or test helper that runs `codebuff-local-cli` through the SDK with `handleEvent` and `handleStreamChunk` collection.
+   - Standardize artifact naming for comparison (for example `baseline-smoke-run1`, `baseline-implementation-run2`, `post-smoke-run1`).
+   - Define and persist a consistent metrics schema per run, including event counts by type, tool-call counts, unique tool names, spawned-agent counts, capture counts, and notable wait/capture observations.
+   - Build in explicit failure-path handling for missing API key, auth failure, tmux startup failure, and hung runs, including cleanup where possible.
+
+2. Execute baseline mixed-scenario runs and document findings.
+   - Run the smoke scenario twice and the implementation scenario twice.
+   - Keep the comparison controlled by using the same prompts, logging granularity, and timeout policy across baseline runs.
+   - Inspect each run’s SDK trace and tmux session logs.
+   - Record concrete inefficiencies, wasted actions, and missing Codebuff-CLI knowledge to drive the prompt/template changes.
+
+3. Improve the shared CLI tester prompt layer.
+   - Update `.agents/lib/cli-agent-prompts.ts` so CLI testers have sharper workflow guidance.
+   - Add targeted guidance on when to gather prep context, when to capture, how to detect progress/completion, and how to avoid low-value repeated actions.
+   - Keep knowledge additions evidence-based and avoid prompt bloat.
+
+4. Improve shared CLI tester orchestration and the concrete `codebuff-local-cli` agent.
+   - Update `.agents/lib/create-cli-agent.ts` if shared orchestration behavior needs refinement.
+   - Update `.agents/codebuff-local-cli.ts` with Codebuff-CLI-specific knowledge and workflow refinements informed by baseline evidence.
+   - Ensure the agent remains focused on CLI UI testing and uses the tmux helper scripts efficiently.
+   - Keep output contract compatibility intact.
+
+5. Add or update validation coverage.
+   - Add tests for shared CLI-agent prompt/template behavior and/or the analysis harness.
+   - Include compatibility-oriented checks for the shared CLI-agent layer.
+   - At minimum, verify the `.agents` layer still typechecks and that `claude-code-cli`, `codex-cli`, `gemini-cli`, and `codebuff-local-cli` still satisfy shared construction/schema expectations.
+
+6. Re-run post-change verification scenarios.
+   - Run at least one smoke and one implementation scenario after changes using the same prompts and comparison controls.
+   - Compare outputs/artifacts against the baseline.
+   - Treat the step as successful if the post-change runs show at least two improvement signals such as fewer duplicate captures, fewer redundant waits/follow-ups, clearer evidence in captures/output, or better scenario-specific verification behavior.
+
+7. Write session documentation and capture durable lessons.
+   - Record before/after findings in `LESSONS.md`.
+   - Document what was intentionally not changed and why.
+   - Update relevant skill files only with broadly reusable insights.
+
+## Dependencies / Ordering
+
+- Step 1 must happen before baseline analysis in Step 2.
+- Step 2 should happen before Steps 3–4 so improvements are evidence-based.
+- Step 3 should happen before or alongside Step 4 because shared prompt guidance informs the concrete agent behavior.
+- Step 5 should follow implementation so tests validate the actual behavior.
+- Step 6 depends on Steps 3–5 being complete.
+- Step 7 should happen after validation so lessons reflect the final state.
+
+## Risk Areas
+
+- The requested `cli-ui-tester` name does not exist directly in the repo, so the harness must target the correct concrete agent (`codebuff-local-cli`) and shared template layer consistently.
+- SDK-driven CLI runs may fail due to auth, tmux availability, or local CLI startup issues; the harness should make failures inspectable rather than opaque.
+- Richer CLI knowledge can easily become prompt bloat, so additions must stay targeted to observed failures.
+- Shared-layer changes can affect multiple CLI tester agents, so compatibility checks are important.
diff --git a/.agents/sessions/03-06-0850-cli-tester-efficiency/SPEC.md b/.agents/sessions/03-06-0850-cli-tester-efficiency/SPEC.md
diff --git a/.agents/skills/meta/SKILL.md b/.agents/skills/meta/SKILL.md