garrytan · boinger · Mar 18, 2026 · Mar 20, 2026 · Mar 20, 2026 · Mar 20, 2026
diff --git a/.agents/skills/gstack-codebase-audit/SKILL.md b/.agents/skills/gstack-codebase-audit/SKILL.md
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -542,7 +542,7 @@ Thanks to @osc, @Explorer1092, @Qike-Li, @francoisaubert1, @itstimwhite, @yinanl
 - **Claude now has an adversarial mode.** A fresh Claude subagent with no checklist bias reviews your code like an attacker — finding edge cases, race conditions, security holes, and silent data corruption that the structured review might miss. Findings are classified as FIXABLE (auto-fixed) or INVESTIGATE (your call).
 - **Review dashboard shows "Adversarial" instead of "Codex Review."** The dashboard row reflects the new multi-model reality — it tracks whichever adversarial passes actually ran, not just Codex.
 
-## [0.9.5.0] - 2026-03-21 — Builder Ethos
+## [0.9.5.0] - 2026-03-21 — Builder Ethos + Codebase Audit
 
 ### Added
 
@@ -554,6 +554,7 @@ Thanks to @osc, @Explorer1092, @Qike-Li, @francoisaubert1, @itstimwhite, @yinanl
 - **`/investigate` searches on hypothesis failure.** When your first debugging hypothesis is wrong, gstack searches for the exact error message and known framework issues before guessing again.
 - **`/design-consultation` three-layer synthesis.** Competitive research now uses the structured Layer 1/2/3 framework to find where your product should deliberately break from category norms.
 - **CEO review saves context when handing off to `/office-hours`.** When `/plan-ceo-review` suggests running `/office-hours` first, it now saves a handoff note with your system audit findings and any discussion so far. When you come back and re-invoke `/plan-ceo-review`, it picks up that context automatically — no more starting from scratch.
+- **`/codebase-audit` — full codebase health check with a fix pipeline.** Run it against any project — new to you, old code, or code you wrote yesterday — and get a structured audit: bugs, security issues, architecture problems, tech debt, test gaps, and improvement opportunities. When it's done, it writes a fix plan and offers to chain into `/plan-eng-review` for the substantive items. Three modes: full audit, quick smoke test (2 min), and regression (diff against previous audit with score tracking). Includes health scoring (100-point scale, calibrated against real projects), dependency CVE scanning, git churn analysis, and machine-readable baseline output.
 
 ## [0.9.4.1] - 2026-03-20
 

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -1,3 +1,7 @@
+# CLAUDE.md
+
+Project instructions for Claude Code working on gstack — a skill and tooling suite for Claude Code.
+
 # gstack development
 
 ## Commands
@@ -307,3 +311,82 @@ The active skill lives at `~/.claude/skills/gstack/`. After making changes:
 3. Rebuild: `cd ~/.claude/skills/gstack && bun run build`
 
 Or copy the binary directly: `cp browse/dist/browse ~/.claude/skills/gstack/browse/dist/browse`
+
+## Template placeholder reference
+
+The generator (`scripts/gen-skill-docs.ts`) resolves `{{PLACEHOLDER}}` tokens in
+`.tmpl` files. The full set (defined in `RESOLVERS` at line ~1090):
+
+| Placeholder | Source | What it expands to |
+|---|---|---|
+| `COMMAND_REFERENCE` | `browse/src/commands.ts` | Categorized command table (Navigation, Reading, etc.) |
+| `SNAPSHOT_FLAGS` | `browse/src/snapshot.ts` | Flag reference table for `snapshot` command |
+| `PREAMBLE` | inline in generator | Shared skill preamble (session awareness, project context) |
+| `BROWSE_SETUP` | inline in generator | `$B` alias setup + binary detection block |
+| `BASE_BRANCH_DETECT` | inline in generator | Shell snippet to detect `main`/`master` dynamically |
+| `QA_METHODOLOGY` | inline in generator | QA health rubric + bug-severity taxonomy |
+| `DESIGN_METHODOLOGY` | inline in generator | Design review rubric + severity levels |
+| `DESIGN_REVIEW_LITE` | inline in generator | Lightweight design review pass for `/review` |
+| `REVIEW_DASHBOARD` | inline in generator | Review summary dashboard format |
+| `TEST_BOOTSTRAP` | inline in generator | Test discovery + run block for `/ship` |
+
+To add a new placeholder: add a resolver function + entry in `RESOLVERS`.
+
+## Browse architecture
+
+The `browse/` subsystem is a client-server headless browser built on Playwright.
+
+**Daemon model:** The CLI (`cli.ts`) is a thin HTTP client. The server (`server.ts`)
+is a persistent Chromium daemon that stays alive across commands (auto-shutdown
+after 30 min idle). CLI auto-starts/restarts the server as needed.
+
+**State file:** `.gstack/browse.json` stores `{ pid, port, token, startedAt,
+binaryVersion }`. The CLI reads this to find the server; the server writes it on
+startup. Token is a random UUID for auth.
+
+**Command dispatch:** Commands are split into 3 sets in `commands.ts`:
+- `READ_COMMANDS` — page inspection (text, html, links, js, console, etc.)
+- `WRITE_COMMANDS` — page mutation (goto, click, fill, scroll, etc.)
+- `META_COMMANDS` — server/tab/visual ops (screenshot, tabs, snapshot, chain, etc.)
+
+The server routes each command to `handleReadCommand`, `handleWriteCommand`, or
+`handleMetaCommand` based on set membership.
+
+**Ref system:** `snapshot` assigns `@e1`/`@e2`/`@c1` refs to elements, stored in
+`BrowserManager.refMap`. Later commands resolve `@e3` → the Playwright `Locator`
+from the last snapshot. `-C` flag adds `@c` refs for non-ARIA clickable elements.
+
+**Logging:** 3 `CircularBuffer` instances (console, network, dialog) in `buffers.ts`,
+flushed to `.gstack/browse-{console,network,dialog}.log`. Ring buffer with fixed
+capacity — old entries are overwritten.
+
+**Tests:** Direct handler invocation against `BrowserManager` (no HTTP layer),
+using a shared test server in `browse/test/test-server.ts`.
+
+## Test infrastructure internals
+
+Key helpers in `test/helpers/` that the test system depends on:
+
+**`touchfiles.ts`** — Diff-based test selection. Maps test names → file glob
+patterns. `selectTests()` checks `git diff` against base branch, runs only tests
+whose dependencies changed. `GLOBAL_TOUCHFILES` (session-runner, eval-store,
+llm-judge, gen-skill-docs, touchfiles itself, test-server) trigger all tests.
+
+**`session-runner.ts`** — Spawns `claude -p` as a subprocess (not Agent SDK),
+streams NDJSON output for real-time progress. Returns `SkillTestResult` with tool
+calls, browse errors, cost estimate, exit reason, and full transcript.
+
+**`eval-store.ts`** — `EvalCollector` accumulates test results, writes them to
+`~/.gstack-dev/evals/{version}-{branch}-{tier}-{timestamp}.json`. Prints summary
+table and auto-compares with the previous run. Comparison functions exported for
+`eval:compare` CLI.
+
+**`llm-judge.ts`** — Two judge types via `callJudge()` (claude-sonnet-4-6):
+- `judge()` — doc quality scorer (clarity, completeness, actionability; 1-5 each)
+- `outcomeJudge()` — planted-bug detection scorer (detection rate, false positives,
+  evidence quality)
+
+**Observability files:**
+- `~/.gstack-dev/e2e-live.json` — heartbeat updated during E2E runs
+- Partial results persisted during long runs for crash recovery
+- NDJSON transcripts saved per-test for debugging
diff --git a/README.md b/README.md
@@ -46,11 +46,11 @@ Fork it. Improve it. Make it yours. And if you want to hate on free open source
 
 Open Claude Code and paste this. Claude does the rest.
 
-> Install gstack: run **`git clone https://github.com/garrytan/gstack.git ~/.claude/skills/gstack && cd ~/.claude/skills/gstack && ./setup`** then add a "gstack" section to CLAUDE.md that says to use the /browse skill from gstack for all web browsing, never use mcp\_\_claude-in-chrome\_\_\* tools, and lists the available skills: /office-hours, /plan-ceo-review, /plan-eng-review, /plan-design-review, /design-consultation, /review, /ship, /land-and-deploy, /canary, /benchmark, /browse, /qa, /qa-only, /design-review, /setup-browser-cookies, /setup-deploy, /retro, /investigate, /document-release, /codex, /cso, /autoplan, /careful, /freeze, /guard, /unfreeze, /gstack-upgrade. Then ask the user if they also want to add gstack to the current project so teammates get it.
+> Install gstack: run **`git clone https://github.com/garrytan/gstack.git ~/.claude/skills/gstack && cd ~/.claude/skills/gstack && ./setup`** then add a "gstack" section to CLAUDE.md that says to use the /browse skill from gstack for all web browsing, never use mcp\_\_claude-in-chrome\_\_\* tools, and lists the available skills: /office-hours, /plan-ceo-review, /plan-eng-review, /plan-design-review, /design-consultation, /review, /ship, /land-and-deploy, /canary, /benchmark, /browse, /qa, /qa-only, /design-review, /setup-browser-cookies, /setup-deploy, /retro, /investigate, /document-release, /codebase-audit, /codex, /cso, /autoplan, /careful, /freeze, /guard, /unfreeze, /gstack-upgrade. Then ask the user if they also want to add gstack to the current project so teammates get it.
 
 ### Step 2: Add to your repo so teammates get it (optional)
 
-> Add gstack to this project: run **`cp -Rf ~/.claude/skills/gstack .claude/skills/gstack && rm -rf .claude/skills/gstack/.git && cd .claude/skills/gstack && ./setup`** then add a "gstack" section to this project's CLAUDE.md that says to use the /browse skill from gstack for all web browsing, never use mcp\_\_claude-in-chrome\_\_\* tools, lists the available skills: /office-hours, /plan-ceo-review, /plan-eng-review, /plan-design-review, /design-consultation, /review, /ship, /land-and-deploy, /canary, /benchmark, /browse, /qa, /qa-only, /design-review, /setup-browser-cookies, /setup-deploy, /retro, /investigate, /document-release, /codex, /cso, /careful, /freeze, /guard, /unfreeze, /gstack-upgrade, and tells Claude that if gstack skills aren't working, run `cd .claude/skills/gstack && ./setup` to build the binary and register skills.
+> Add gstack to this project: run **`cp -Rf ~/.claude/skills/gstack .claude/skills/gstack && rm -rf .claude/skills/gstack/.git && cd .claude/skills/gstack && ./setup`** then add a "gstack" section to this project's CLAUDE.md that says to use the /browse skill from gstack for all web browsing, never use mcp\_\_claude-in-chrome\_\_\* tools, lists the available skills: /office-hours, /plan-ceo-review, /plan-eng-review, /plan-design-review, /design-consultation, /review, /ship, /land-and-deploy, /canary, /benchmark, /browse, /qa, /qa-only, /design-review, /setup-browser-cookies, /setup-deploy, /retro, /investigate, /document-release, /codebase-audit, /codex, /cso, /autoplan, /careful, /freeze, /guard, /unfreeze, /gstack-upgrade, and tells Claude that if gstack skills aren't working, run `cd .claude/skills/gstack && ./setup` to build the binary and register skills.
 
 Real files get committed to your repo (not a submodule), so `git clone` just works. Everything lives inside `.claude/`. Nothing touches your PATH or runs in the background.
 
@@ -156,6 +156,7 @@ Each skill feeds into the next. `/office-hours` writes a design doc that `/plan-
 | `/canary` | **SRE** | Post-deploy monitoring loop. Watches for console errors, performance regressions, and page failures. |
 | `/benchmark` | **Performance Engineer** | Baseline page load times, Core Web Vitals, and resource sizes. Compare before/after on every PR. |
 | `/document-release` | **Technical Writer** | Update all project docs to match what you just shipped. Catches stale READMEs automatically. |
+| `/codebase-audit` | **Code Auditor** | Full codebase audit from cold. Finds bugs, security issues, tech debt, architecture problems, and test gaps. Report only — never touches code. |
 | `/retro` | **Eng Manager** | Team-aware weekly retro. Per-person breakdowns, shipping streaks, test health trends, growth opportunities. `/retro global` runs across all your projects and AI tools (Claude Code, Codex, Gemini). |
 | `/browse` | **QA Engineer** | Give the agent eyes. Real Chromium browser, real clicks, real screenshots. ~100ms per command. `$B connect` launches your real Chrome as a headed window — watch every action live. |
 | `/setup-browser-cookies` | **Session Manager** | Import cookies from your real browser (Chrome, Arc, Brave, Edge) into the headless session. Test authenticated pages. |
@@ -266,8 +267,8 @@ Use /browse from gstack for all web browsing. Never use mcp__claude-in-chrome__*
 Available skills: /office-hours, /plan-ceo-review, /plan-eng-review, /plan-design-review,
 /design-consultation, /review, /ship, /land-and-deploy, /canary, /benchmark, /browse,
 /qa, /qa-only, /design-review, /setup-browser-cookies, /setup-deploy, /retro,
-/investigate, /document-release, /codex, /cso, /autoplan, /careful, /freeze, /guard,
-/unfreeze, /gstack-upgrade.
+/investigate, /document-release, /codebase-audit, /codex, /cso, /autoplan, /careful,
+/freeze, /guard, /unfreeze, /gstack-upgrade.
 ```
 
 ## License