|
| 1 | +# Lessons: CLI tester efficiency and CLI knowledge improvements |
| 2 | + |
| 3 | +## What went well |
| 4 | + |
| 5 | +- The SDK-driven harness made it straightforward to collect full event streams, stream chunks, structured outputs, and tmux capture paths for repeated `codebuff-local-cli` runs. |
| 6 | +- The baseline runs clearly exposed behavior patterns instead of relying on intuition. |
| 7 | +- The Codebuff CLI itself was capable and informative during implementation-oriented runs; most inefficiency came from the tester agent’s workflow rather than the CLI under test. |
| 8 | + |
| 9 | +## What was tricky |
| 10 | + |
| 11 | +- The `codebuff-local-cli` agent uses only `run_terminal_command`, `add_message`, and `set_output`, so all tester intelligence has to come from prompt/instruction quality rather than richer tooling. |
| 12 | +- Long Codebuff CLI responses live in a scrollable viewport. The tester spent many extra steps trying to recover hidden content even when the visible portion already contained enough evidence. |
| 13 | +- One smoke run silently started a second tmux session mid-run, showing that the current guidance was too weak about preserving session continuity and treating failure recovery explicitly. |
| 14 | +- Reading tmux capture artifacts from inside the tester run is ineffective because the agent does not have `read_files`; attempts to recover more evidence should therefore be avoided unless the current viewport is truly insufficient. |
| 15 | + |
| 16 | +## Quantified before/after findings |
| 17 | + |
| 18 | +### Smoke scenario |
| 19 | + |
| 20 | +- Baseline smoke runs: `27` and `38` total events, with one run silently starting a replacement tmux session mid-run. |
| 21 | +- Post-change smoke run: `27` total events, `10` tool calls, `3` captures, no replacement session, and clearer capture labels (`initial-state`, `after-help`, `after-2plus2`). |
| 22 | + |
| 23 | +### Implementation scenario |
| 24 | + |
| 25 | +- Baseline implementation runs: |
| 26 | + - tool calls: `19` and `21` |
| 27 | + - captures: `8` and `7` |
| 28 | + - total cost: `30` and `40` |
| 29 | + - strong evidence of wasted viewport-recovery actions (page up/down, history keys, extra captures, direct tmux scrollback commands) |
| 30 | +- Post-change implementation run: |
| 31 | + - tool calls: `10` |
| 32 | + - captures: `4` |
| 33 | + - total cost: `14` |
| 34 | + - no viewport-recovery thrashing; the tester captured the ready state, in-progress state, response, and follow-up response and then stopped. |
| 35 | + |
| 36 | +## Baseline findings |
| 37 | + |
| 38 | +- Smoke runs were mostly efficient, but their capture labels were generic and the agent did not explicitly reason about why each capture was worth taking. |
| 39 | +- One smoke run restarted the session instead of treating the original session as canonical, inflating event/tool counts. |
| 40 | +- Implementation runs showed the biggest inefficiency: excessive viewport recovery actions (page up/down, arrow keys, extra captures, direct tmux scrollback commands) after the key recommendation was already visible. |
| 41 | +- The tester lacked Codebuff-specific guidance about: |
| 42 | + - what the ready state looks like, |
| 43 | + - when `/help` is especially valuable, |
| 44 | + - how to structure a good implementation-oriented test, |
| 45 | + - and when to stop chasing perfect captures of long responses. |
| 46 | + |
| 47 | +## What changed behavior most |
| 48 | + |
| 49 | +- Adding a canonical-session instruction prevented silent session replacement behavior and made failure handling expectations explicit. |
| 50 | +- Adding the shared “high-value capture” heuristic reduced redundant captures and discouraged overlapping progress snapshots. |
| 51 | +- Adding explicit guidance to stop chasing hidden viewport text eliminated the biggest source of waste in implementation-oriented runs. |
| 52 | +- Adding Codebuff-specific flow guidance improved follow-up quality and reduced exploratory key usage. |
| 53 | + |
| 54 | +## Changes made from baseline evidence |
| 55 | + |
| 56 | +- Added shared operating heuristics to bias CLI testers toward fewer, higher-value captures and away from unnecessary UI mutation. |
| 57 | +- Added explicit guidance to avoid `read_files` on tmux artifacts from inside the tester run. |
| 58 | +- Added Codebuff-specific testing guidance covering ready state, smoke-test flow, implementation-test flow, long-response behavior, and session continuity expectations. |
| 59 | +- Added best-effort harness cleanup when a run throws after a tmux session has already been created. |
| 60 | + |
| 61 | +## Cautionary note |
| 62 | + |
| 63 | +- Different runs may disagree about whether adjacent edge cases are worth fixing. For example, one post-change implementation run argued that the original-case `isEnvFile` call path was acceptable because `.env` files are conventionally lowercase, while earlier baseline runs framed nearby case handling as security-sensitive. Future work should settle those questions with source-of-truth tests or project policy, not by trusting a single run’s opinion. |
| 64 | + |
| 65 | +## Known limitation |
| 66 | + |
| 67 | +- The analysis harness now does best-effort tmux cleanup when a run throws after a session has already been created, but it still does not implement a hard per-run abort/timeout with guaranteed teardown if `client.run()` stalls indefinitely. Future iterations should add explicit run cancellation once the preferred timeout mechanism is settled. |
| 68 | + |
| 69 | +## What we intentionally did not change |
| 70 | + |
| 71 | +- We did not change the tmux helper scripts because the baseline problems were primarily agent-behavior issues, not script failures. |
| 72 | +- We did not broaden the tester’s tool access; this pass focuses on making the current workflow smarter rather than increasing power. |
| 73 | +- We did not change the shared output schema because the existing `set_output` contract was sufficient for analysis once the agent behavior improved. |
0 commit comments