Coacker — Multi-Agent AI Code Security Auditor

Coacker orchestrates a multi-phase pipeline with role-separated agents (Blue Team / Red Team) to perform automated security reviews. It discovers vulnerabilities, validates findings with PoC tests, generates fixes, and reviews fix PRs — all driven by a label-based state machine on GitHub Issues.

Architecture

CLI → Brain → Player → Backend → Shared
               ↓
            Toolkit (optional: AstAnalyzer, McpClient, Sandbox, RepoMap)

Shared — Types, TOML config loader, colored logger
Backend — Backend interface + implementations (ClaudeCode, AG/CDP, API) + Toolkit
Player — Executes multi-step Tasks, manages step sequencing and result collection
Brain — State machine orchestrators: AuditBrain, ValidateBrain, FixBrain, ReviewBrain
CLI — Entry point, Pipeline runner, Poller (label-driven state machine)

Backend Options

Backend	Config	Description
`claude-code`	`type = "claude-code"`	Recommended. Pure CLI, zero timeouts, best Recall
`api`	`type = "api"`	Direct LLM API calls (OpenAI, Anthropic, Gemini)
`ag`	`type = "ag"`	Legacy. AG IDE via CDP, unstable

Pipeline

Coacker runs a closed-loop pipeline from discovery to fix:

Audit → Issue → Validate → Fix → PR → Review → (human merge) → Done
                  ↑                       │
                  └── reject: close PR ───┘

Phase 1: Audit

Explores the codebase and discovers vulnerabilities using role-separated agents.

Intention       → Explores project → Splits into review sub-tasks
Per sub-task:
  Implement     → Describes code paths and state changes (facts only)
  → Review      → Code quality / security audit (Blue Team)
  → Attack      → Business logic flaw hunting (Red Team)
  → Propose     → Creates GitHub issues for findings
Gap Analysis    → Finds uncovered areas → Spawns new sub-tasks (iterative)
Consolidation   → AI synthesizes executive summary

Phase 2: Validate

Validates each coacker:pending issue by generating PoC test code.

Per issue:
  Understand    → Read issue + source code, assess testability
  Test Gen      → Write PoC test code
  Test Review   → Independent review (new conversation)
  → ACCEPT      → Commit tests, push branch, mark coacker:validated
  → REJECT      → Retry (max 3) or mark coacker:invalid

Phase 3: Fix

Generates fixes for validated issues and creates PRs.

Per validated issue:
  Analyze       → Deep-read issue, tests, and source code
  Fix Gen       → Write fix code
  Fix Review    → Self-review for correctness
  → FIXED       → Commit fix + tests, push branch, create PR
  → UNFIXABLE   → Mark coacker:unfixable
  → RETRY       → Revert to validated (max 3 attempts, then coacker:needs-human)

Phase 4: Review

Reviews fix PRs and provides feedback.

Per PR:
  Diff Review   → Analyze PR diff against issue context
  Verdict       → Approve, request changes, or comment
  → APPROVE     → Post approval comment
  → REJECT      → Close PR, post feedback on issue, revert to validated

Agent Roles

Role	Team	Focus
Intention	—	Project exploration and task breakdown
Implement	—	Execution paths, state changes, dependencies (facts only)
Reviewer	Blue	Engineering quality: leaks, concurrency, validation
Attacker	Red	Business logic flaws: auth bypass, state inconsistency
Issue Proposer	—	Creates GitHub issues for Critical/High findings
Gap Analyzer	—	Reviews coverage, identifies gaps, deduplicates
Consolidator	—	Synthesizes all findings into executive summary
Test Generator	Blue	Writes PoC test code for validation
Fix Generator	—	Writes fix code for validated issues
PR Reviewer	Blue	Reviews fix PRs for correctness

Operating Modes

Mode	Config	Description
`audit`	`brain.type = "audit"`	Single audit run (default)
`validate`	`brain.type = "validate"`	Validate pending issues
`fix`	`brain.type = "fix"`	Fix validated issues
`review`	`brain.type = "review"`	Review fix PRs
`poller`	`brain.type = "poller"`	Recommended for continuous operation. 4 parallel phase loops, label-driven
`pipeline`	`brain.type = "pipeline"`	Sequential loop (legacy)

Poller (Label-Driven State Machine)

The poller runs 4 independent loops in parallel, each scanning GitHub issues/PRs by label:

Audit loop       (every 1h)    → Creates issues with coacker:pending
Validate loop    (every 5min)  → Scans coacker:pending → validated/invalid
Fix loop         (every 5min)  → Scans coacker:validated → creates PR
Review loop      (every 5min)  → Scans coacker:pr-pending → approve/reject

Each phase uses git worktree isolation. On restart, intermediate states are automatically recovered.

Quick Start

# 1. Install
pnpm install

# 2. Configure
cp config.example.toml config.toml
# Edit config.toml — set project.entry, project.origin, and backend.type

# 3. Run
npx tsx packages/cli/src/main.ts

Prerequisites

Node.js >= 18
gh CLI authenticated (gh auth login)
pnpm

Configuration

All settings live in config.toml. See docs/configuration.md for the full reference.

Minimal Config

[project]
root = "."
entry = "src/main.ts"
intent = "Security audit of smart contracts"
origin = "owner/repo"

[backend]
type = "claude-code"

[brain]
type = "audit"

[brain.audit]
inoculation = true

Recommended Config (Run21 Baseline)

The best configuration from 26 optimization runs (F1=44%, Precision=32%, Recall=70%):

[backend]
type = "claude-code"

[brain.audit]
maxGapRounds = 3
maxSubTasks = 25
inoculation = true        # Most effective single optimization — always keep ON
fineSplitting = true      # Per-contract subtask splitting
deepDive = true           # Second-pass review
postFilter = false        # Hurts Recall — do not enable

Parent Issue Tracking

Group all findings under a single tracking issue:

[project]
origin = "owner/repo"
parentIssue = "Coacker Audit: my-project v1.0"

At startup, Coacker searches for an existing open issue with that title (or creates one), then automatically links every new finding as a GitHub sub-issue.

Preflight Checks

When origin is set, Coacker runs preflight checks before any brain mode:

Label initialization — All 13 coacker:* pipeline labels are auto-created with color coding if missing
Parent issue resolution — Finds or creates the parent tracking issue

If either check fails, Coacker aborts rather than running without proper issue tracking.

Commands

All development commands use just (task runner):

just              # List all recipes
just check        # Type check all packages
just check-pkg brain  # Type check single package
just lint         # Lint
just lint-fix     # Auto-fix lint issues
just fmt          # Format
just test         # Run tests
just ci           # All checks (typecheck + lint + format check)
just run          # Run audit with default config

Evaluation

# LLM-as-Judge evaluation against benchmark
npx tsx scripts/eval-judge.ts output

# Multi-run comparison
npx tsx scripts/compare-runs.ts

# Retry failed subtasks
npx tsx scripts/retry-failed.ts

Benchmark

27 known bugs (4 Critical + 4 High + 12 Medium + 7 Low) from Gravity Chain Core Contracts. See scripts/benchmark-baseline.json.

Output

output/
├── state.json              — Brain state (crash recovery)
├── conversations/          — Full conversation logs per task
├── reports/                — Per-task structured reports
├── report.md               — Final consolidated audit report
├── validate/               — Validation results
├── fix/                    — Fix results
└── review/                 — Review results

Key Lessons (from 26 runs)

Inoculation is the most effective single optimization — always keep ON
Output format constraints hurt quality — improve input (enrichment), don't constrain output
More subtasks >25 increases FP without improving Recall — diminishing returns
SAST alone is negative — but SAST + DeepDive combination improves Precision
Claude Code backend > AG IDE — zero timeouts, better Recall, no network dependency
Eval has +/-10% variance — only trust large, consistent effects

Tech Stack

TypeScript 5.7 (strict), ES2022 target, ESNext modules
pnpm workspaces, Vitest, ESLint + Prettier
Playwright (CDP), smol-toml, tsx
Slither + Aderyn (SAST), tree-sitter (AST)

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.claude/commands		.claude/commands
.coacker/docs		.coacker/docs
docs		docs
packages		packages
scripts		scripts
skills		skills
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
config.example.toml		config.example.toml
eslint.config.mjs		eslint.config.mjs
justfile		justfile
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
readme.md		readme.md
test_rbac.py		test_rbac.py
tsconfig.base.json		tsconfig.base.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Coacker — Multi-Agent AI Code Security Auditor

Architecture

Backend Options

Pipeline

Phase 1: Audit

Phase 2: Validate

Phase 3: Fix

Phase 4: Review

Agent Roles

Operating Modes

Poller (Label-Driven State Machine)

Quick Start

Prerequisites

Configuration

Minimal Config

Recommended Config (Run21 Baseline)

Parent Issue Tracking

Preflight Checks

Commands

Evaluation

Benchmark

Output

Key Lessons (from 26 runs)

Tech Stack

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Coacker — Multi-Agent AI Code Security Auditor

Architecture

Backend Options

Pipeline

Phase 1: Audit

Phase 2: Validate

Phase 3: Fix

Phase 4: Review

Agent Roles

Operating Modes

Poller (Label-Driven State Machine)

Quick Start

Prerequisites

Configuration

Minimal Config

Recommended Config (Run21 Baseline)

Parent Issue Tracking

Preflight Checks

Commands

Evaluation

Benchmark

Output

Key Lessons (from 26 runs)

Tech Stack

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages