Skip to content

Galxe/gravity-reviewer

Repository files navigation

Coacker — Multi-Agent AI Code Security Auditor

Coacker orchestrates a multi-phase pipeline with role-separated agents (Blue Team / Red Team) to perform automated security reviews. It discovers vulnerabilities, validates findings with PoC tests, generates fixes, and reviews fix PRs — all driven by a label-based state machine on GitHub Issues.

Architecture

CLI → Brain → Player → Backend → Shared
               ↓
            Toolkit (optional: AstAnalyzer, McpClient, Sandbox, RepoMap)
  • Shared — Types, TOML config loader, colored logger
  • Backend — Backend interface + implementations (ClaudeCode, AG/CDP, API) + Toolkit
  • Player — Executes multi-step Tasks, manages step sequencing and result collection
  • Brain — State machine orchestrators: AuditBrain, ValidateBrain, FixBrain, ReviewBrain
  • CLI — Entry point, Pipeline runner, Poller (label-driven state machine)

Backend Options

Backend Config Description
claude-code type = "claude-code" Recommended. Pure CLI, zero timeouts, best Recall
api type = "api" Direct LLM API calls (OpenAI, Anthropic, Gemini)
ag type = "ag" Legacy. AG IDE via CDP, unstable

Pipeline

Coacker runs a closed-loop pipeline from discovery to fix:

Audit → Issue → Validate → Fix → PR → Review → (human merge) → Done
                  ↑                       │
                  └── reject: close PR ───┘

Phase 1: Audit

Explores the codebase and discovers vulnerabilities using role-separated agents.

Intention       → Explores project → Splits into review sub-tasks
Per sub-task:
  Implement     → Describes code paths and state changes (facts only)
  → Review      → Code quality / security audit (Blue Team)
  → Attack      → Business logic flaw hunting (Red Team)
  → Propose     → Creates GitHub issues for findings
Gap Analysis    → Finds uncovered areas → Spawns new sub-tasks (iterative)
Consolidation   → AI synthesizes executive summary

Phase 2: Validate

Validates each coacker:pending issue by generating PoC test code.

Per issue:
  Understand    → Read issue + source code, assess testability
  Test Gen      → Write PoC test code
  Test Review   → Independent review (new conversation)
  → ACCEPT      → Commit tests, push branch, mark coacker:validated
  → REJECT      → Retry (max 3) or mark coacker:invalid

Phase 3: Fix

Generates fixes for validated issues and creates PRs.

Per validated issue:
  Analyze       → Deep-read issue, tests, and source code
  Fix Gen       → Write fix code
  Fix Review    → Self-review for correctness
  → FIXED       → Commit fix + tests, push branch, create PR
  → UNFIXABLE   → Mark coacker:unfixable
  → RETRY       → Revert to validated (max 3 attempts, then coacker:needs-human)

Phase 4: Review

Reviews fix PRs and provides feedback.

Per PR:
  Diff Review   → Analyze PR diff against issue context
  Verdict       → Approve, request changes, or comment
  → APPROVE     → Post approval comment
  → REJECT      → Close PR, post feedback on issue, revert to validated

Agent Roles

Role Team Focus
Intention Project exploration and task breakdown
Implement Execution paths, state changes, dependencies (facts only)
Reviewer Blue Engineering quality: leaks, concurrency, validation
Attacker Red Business logic flaws: auth bypass, state inconsistency
Issue Proposer Creates GitHub issues for Critical/High findings
Gap Analyzer Reviews coverage, identifies gaps, deduplicates
Consolidator Synthesizes all findings into executive summary
Test Generator Blue Writes PoC test code for validation
Fix Generator Writes fix code for validated issues
PR Reviewer Blue Reviews fix PRs for correctness

Operating Modes

Mode Config Description
audit brain.type = "audit" Single audit run (default)
validate brain.type = "validate" Validate pending issues
fix brain.type = "fix" Fix validated issues
review brain.type = "review" Review fix PRs
poller brain.type = "poller" Recommended for continuous operation. 4 parallel phase loops, label-driven
pipeline brain.type = "pipeline" Sequential loop (legacy)

Poller (Label-Driven State Machine)

The poller runs 4 independent loops in parallel, each scanning GitHub issues/PRs by label:

Audit loop       (every 1h)    → Creates issues with coacker:pending
Validate loop    (every 5min)  → Scans coacker:pending → validated/invalid
Fix loop         (every 5min)  → Scans coacker:validated → creates PR
Review loop      (every 5min)  → Scans coacker:pr-pending → approve/reject

Each phase uses git worktree isolation. On restart, intermediate states are automatically recovered.

Quick Start

# 1. Install
pnpm install

# 2. Configure
cp config.example.toml config.toml
# Edit config.toml — set project.entry, project.origin, and backend.type

# 3. Run
npx tsx packages/cli/src/main.ts

Prerequisites

  • Node.js >= 18
  • gh CLI authenticated (gh auth login)
  • pnpm

Configuration

All settings live in config.toml. See docs/configuration.md for the full reference.

Minimal Config

[project]
root = "."
entry = "src/main.ts"
intent = "Security audit of smart contracts"
origin = "owner/repo"

[backend]
type = "claude-code"

[brain]
type = "audit"

[brain.audit]
inoculation = true

Recommended Config (Run21 Baseline)

The best configuration from 26 optimization runs (F1=44%, Precision=32%, Recall=70%):

[backend]
type = "claude-code"

[brain.audit]
maxGapRounds = 3
maxSubTasks = 25
inoculation = true        # Most effective single optimization — always keep ON
fineSplitting = true      # Per-contract subtask splitting
deepDive = true           # Second-pass review
postFilter = false        # Hurts Recall — do not enable

Parent Issue Tracking

Group all findings under a single tracking issue:

[project]
origin = "owner/repo"
parentIssue = "Coacker Audit: my-project v1.0"

At startup, Coacker searches for an existing open issue with that title (or creates one), then automatically links every new finding as a GitHub sub-issue.

Preflight Checks

When origin is set, Coacker runs preflight checks before any brain mode:

  1. Label initialization — All 13 coacker:* pipeline labels are auto-created with color coding if missing
  2. Parent issue resolution — Finds or creates the parent tracking issue

If either check fails, Coacker aborts rather than running without proper issue tracking.

Commands

All development commands use just (task runner):

just              # List all recipes
just check        # Type check all packages
just check-pkg brain  # Type check single package
just lint         # Lint
just lint-fix     # Auto-fix lint issues
just fmt          # Format
just test         # Run tests
just ci           # All checks (typecheck + lint + format check)
just run          # Run audit with default config

Evaluation

# LLM-as-Judge evaluation against benchmark
npx tsx scripts/eval-judge.ts output

# Multi-run comparison
npx tsx scripts/compare-runs.ts

# Retry failed subtasks
npx tsx scripts/retry-failed.ts

Benchmark

27 known bugs (4 Critical + 4 High + 12 Medium + 7 Low) from Gravity Chain Core Contracts. See scripts/benchmark-baseline.json.

Output

output/
├── state.json              — Brain state (crash recovery)
├── conversations/          — Full conversation logs per task
├── reports/                — Per-task structured reports
├── report.md               — Final consolidated audit report
├── validate/               — Validation results
├── fix/                    — Fix results
└── review/                 — Review results

Key Lessons (from 26 runs)

  • Inoculation is the most effective single optimization — always keep ON
  • Output format constraints hurt quality — improve input (enrichment), don't constrain output
  • More subtasks >25 increases FP without improving Recall — diminishing returns
  • SAST alone is negative — but SAST + DeepDive combination improves Precision
  • Claude Code backend > AG IDE — zero timeouts, better Recall, no network dependency
  • Eval has +/-10% variance — only trust large, consistent effects

Tech Stack

  • TypeScript 5.7 (strict), ES2022 target, ESNext modules
  • pnpm workspaces, Vitest, ESLint + Prettier
  • Playwright (CDP), smol-toml, tsx
  • Slither + Aderyn (SAST), tree-sitter (AST)

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages