Coacker orchestrates a multi-phase pipeline with role-separated agents (Blue Team / Red Team) to perform automated security reviews. It discovers vulnerabilities, validates findings with PoC tests, generates fixes, and reviews fix PRs — all driven by a label-based state machine on GitHub Issues.
CLI → Brain → Player → Backend → Shared
↓
Toolkit (optional: AstAnalyzer, McpClient, Sandbox, RepoMap)
- Shared — Types, TOML config loader, colored logger
- Backend — Backend interface + implementations (ClaudeCode, AG/CDP, API) + Toolkit
- Player — Executes multi-step Tasks, manages step sequencing and result collection
- Brain — State machine orchestrators: AuditBrain, ValidateBrain, FixBrain, ReviewBrain
- CLI — Entry point, Pipeline runner, Poller (label-driven state machine)
| Backend | Config | Description |
|---|---|---|
claude-code |
type = "claude-code" |
Recommended. Pure CLI, zero timeouts, best Recall |
api |
type = "api" |
Direct LLM API calls (OpenAI, Anthropic, Gemini) |
ag |
type = "ag" |
Legacy. AG IDE via CDP, unstable |
Coacker runs a closed-loop pipeline from discovery to fix:
Audit → Issue → Validate → Fix → PR → Review → (human merge) → Done
↑ │
└── reject: close PR ───┘
Explores the codebase and discovers vulnerabilities using role-separated agents.
Intention → Explores project → Splits into review sub-tasks
Per sub-task:
Implement → Describes code paths and state changes (facts only)
→ Review → Code quality / security audit (Blue Team)
→ Attack → Business logic flaw hunting (Red Team)
→ Propose → Creates GitHub issues for findings
Gap Analysis → Finds uncovered areas → Spawns new sub-tasks (iterative)
Consolidation → AI synthesizes executive summary
Validates each coacker:pending issue by generating PoC test code.
Per issue:
Understand → Read issue + source code, assess testability
Test Gen → Write PoC test code
Test Review → Independent review (new conversation)
→ ACCEPT → Commit tests, push branch, mark coacker:validated
→ REJECT → Retry (max 3) or mark coacker:invalid
Generates fixes for validated issues and creates PRs.
Per validated issue:
Analyze → Deep-read issue, tests, and source code
Fix Gen → Write fix code
Fix Review → Self-review for correctness
→ FIXED → Commit fix + tests, push branch, create PR
→ UNFIXABLE → Mark coacker:unfixable
→ RETRY → Revert to validated (max 3 attempts, then coacker:needs-human)
Reviews fix PRs and provides feedback.
Per PR:
Diff Review → Analyze PR diff against issue context
Verdict → Approve, request changes, or comment
→ APPROVE → Post approval comment
→ REJECT → Close PR, post feedback on issue, revert to validated
| Role | Team | Focus |
|---|---|---|
| Intention | — | Project exploration and task breakdown |
| Implement | — | Execution paths, state changes, dependencies (facts only) |
| Reviewer | Blue | Engineering quality: leaks, concurrency, validation |
| Attacker | Red | Business logic flaws: auth bypass, state inconsistency |
| Issue Proposer | — | Creates GitHub issues for Critical/High findings |
| Gap Analyzer | — | Reviews coverage, identifies gaps, deduplicates |
| Consolidator | — | Synthesizes all findings into executive summary |
| Test Generator | Blue | Writes PoC test code for validation |
| Fix Generator | — | Writes fix code for validated issues |
| PR Reviewer | Blue | Reviews fix PRs for correctness |
| Mode | Config | Description |
|---|---|---|
audit |
brain.type = "audit" |
Single audit run (default) |
validate |
brain.type = "validate" |
Validate pending issues |
fix |
brain.type = "fix" |
Fix validated issues |
review |
brain.type = "review" |
Review fix PRs |
poller |
brain.type = "poller" |
Recommended for continuous operation. 4 parallel phase loops, label-driven |
pipeline |
brain.type = "pipeline" |
Sequential loop (legacy) |
The poller runs 4 independent loops in parallel, each scanning GitHub issues/PRs by label:
Audit loop (every 1h) → Creates issues with coacker:pending
Validate loop (every 5min) → Scans coacker:pending → validated/invalid
Fix loop (every 5min) → Scans coacker:validated → creates PR
Review loop (every 5min) → Scans coacker:pr-pending → approve/reject
Each phase uses git worktree isolation. On restart, intermediate states are automatically recovered.
# 1. Install
pnpm install
# 2. Configure
cp config.example.toml config.toml
# Edit config.toml — set project.entry, project.origin, and backend.type
# 3. Run
npx tsx packages/cli/src/main.ts- Node.js >= 18
ghCLI authenticated (gh auth login)- pnpm
All settings live in config.toml. See docs/configuration.md for the full reference.
[project]
root = "."
entry = "src/main.ts"
intent = "Security audit of smart contracts"
origin = "owner/repo"
[backend]
type = "claude-code"
[brain]
type = "audit"
[brain.audit]
inoculation = trueThe best configuration from 26 optimization runs (F1=44%, Precision=32%, Recall=70%):
[backend]
type = "claude-code"
[brain.audit]
maxGapRounds = 3
maxSubTasks = 25
inoculation = true # Most effective single optimization — always keep ON
fineSplitting = true # Per-contract subtask splitting
deepDive = true # Second-pass review
postFilter = false # Hurts Recall — do not enableGroup all findings under a single tracking issue:
[project]
origin = "owner/repo"
parentIssue = "Coacker Audit: my-project v1.0"At startup, Coacker searches for an existing open issue with that title (or creates one), then automatically links every new finding as a GitHub sub-issue.
When origin is set, Coacker runs preflight checks before any brain mode:
- Label initialization — All 13
coacker:*pipeline labels are auto-created with color coding if missing - Parent issue resolution — Finds or creates the parent tracking issue
If either check fails, Coacker aborts rather than running without proper issue tracking.
All development commands use just (task runner):
just # List all recipes
just check # Type check all packages
just check-pkg brain # Type check single package
just lint # Lint
just lint-fix # Auto-fix lint issues
just fmt # Format
just test # Run tests
just ci # All checks (typecheck + lint + format check)
just run # Run audit with default config# LLM-as-Judge evaluation against benchmark
npx tsx scripts/eval-judge.ts output
# Multi-run comparison
npx tsx scripts/compare-runs.ts
# Retry failed subtasks
npx tsx scripts/retry-failed.ts27 known bugs (4 Critical + 4 High + 12 Medium + 7 Low) from Gravity Chain Core Contracts. See scripts/benchmark-baseline.json.
output/
├── state.json — Brain state (crash recovery)
├── conversations/ — Full conversation logs per task
├── reports/ — Per-task structured reports
├── report.md — Final consolidated audit report
├── validate/ — Validation results
├── fix/ — Fix results
└── review/ — Review results
- Inoculation is the most effective single optimization — always keep ON
- Output format constraints hurt quality — improve input (enrichment), don't constrain output
- More subtasks >25 increases FP without improving Recall — diminishing returns
- SAST alone is negative — but SAST + DeepDive combination improves Precision
- Claude Code backend > AG IDE — zero timeouts, better Recall, no network dependency
- Eval has +/-10% variance — only trust large, consistent effects
- TypeScript 5.7 (strict), ES2022 target, ESNext modules
- pnpm workspaces, Vitest, ESLint + Prettier
- Playwright (CDP), smol-toml, tsx
- Slither + Aderyn (SAST), tree-sitter (AST)
MIT