A deterministic, resumable, tournament-based orchestrator for LLM-driven software development.
Why AutoDev · Design Principles · Architecture · Tournaments · Quick Start
AutoDev is a typed, crash-safe, auditable orchestration engine that turns your existing Claude Code or Cursor subscription into a disciplined multi-agent development pipeline. It decomposes every feature request into a bounded finite state machine — explore → plan → critique → code → review → test → merge — and wraps the two highest-leverage decisions (the plan, and every individual diff) in a tournament-based self-refinement loop: a fresh critic identifies faults, a fresh author proposes a revision, a synthesizer merges the two, and N independent judges rank the variants via Borda count with a conservative tiebreak to the incumbent. Each role runs as a specialized agent with its own system prompt, model tier, and allow-listed tools — so no model ever evaluates its own output. The entire trajectory (delegations, tool calls, diffs, critiques, judge rankings, test results) is appended to a content-addressed JSONL ledger under .autodev/, making every run replayable, diffable, and auditable. If the process dies mid-task, autodev resume reconstructs state via ledger replay and continues at the exact FSM edge it left.
pip install ai-autodevIf you have shipped production code with a single-agent AI tool, you have hit all of these failure modes. AutoDev is designed around the assumption that each is a structural property of the single-agent model, not a prompt-engineering problem.
-
Self-grading is a known failure mode. A single model evaluating its own output inherits the same blind spots that produced the output. AutoDev routes evaluation through independent judges with fresh context — a structured-refinement algorithm designed to close the generation-evaluation gap between what a model can produce and what it can reliably evaluate. Separation of concerns at the agent level, not just the prompt level.
-
Non-determinism in the model, determinism in the pipeline. The orchestrator is a pure Python FSM. Every state transition is a ledger append with a SHA-256 content hash chained to the previous entry. Same inputs + same recorded adapter outputs → byte-identical final state. This is what makes replay-based debugging and tournament regression testing tractable.
-
Crash-safety is a contract, not a hope. The plan ledger uses
filelock(withthread_local=Falseso asyncio tasks serialize correctly), atomictmp → renamewrites, and CAS hash chaining. Corrupted or partial writes are detected on replay;autodev doctorwill refuse to proceed against a torn ledger. -
Typed contracts at every boundary. Every data structure crossing a process or async boundary — delegations, agent responses, evidence bundles, plan snapshots, tournament pass results — is validated by pydantic v2 with
extra="forbid". No silent coercion. No "it worked on my laptop" schema drift between runs. -
No API keys, no vendor lock-in. Every LLM call is a subprocess invocation of
claude -porcursor agent --printagainst your existing subscription. ThePlatformAdapteris a protocol with four methods; a third adapter is ~150 LOC. Swap in any CLI-accessible model without touching the orchestrator or tournament engine. -
Cost is bounded and inspectable, not emergent. Per-task guardrails (
max_tool_calls,max_duration_s,max_diff_bytes), tournament hard caps (max_rounds,num_judges,convergence_k), a loop detector that fingerprints repeated invocations, and acost_budget_usd_per_plancircuit breaker. Every run emits a cost projection before execution begins. -
Extensibility via
entry_points, not forks. Custom QA gates, judge providers, and agent extensions load viaimportlib.metadata.entry_points(group="autodev.plugins"). Ship a separate wheel, get picked up at nextautodev doctor. No patched core, no monkey-patching. -
Research-grounded, not vibes-based. The tournament algorithm, convergence criteria, and Borda aggregation are faithful implementations of a published self-refinement approach — not invented here. We document the deviations (no per-call temperature control in subscription CLIs; we rely on fresh-context stochasticity) so you can reason about when the technique applies.
| Principle | Mechanism |
|---|---|
| Separation of concerns at the agent level | 14 role-specialized agents (architect, explorer, domain_expert, developer, reviewer, test_engineer, critic_sounding_board, critic_drift_verifier, docs, designer, critic_t, architect_b, synthesizer, judge) — each with its own prompt, model tier, and tool allow-list |
| Typed contracts, validated at every boundary | pydantic v2 strict models for all cross-process data; ledger entries are signed with SHA-256 chain hashes |
| Deterministic orchestration over stochastic agents | Pure Python FSM drives transitions; LLM calls are leaves in the graph, never nodes |
| Append-only persistence, derived projections | plan-ledger.jsonl is the source of truth; plan.json is a derived snapshot. Never mutate. |
| Crash-safe by construction | filelock + atomic rename + replay-based reconstruction. autodev resume is idempotent. |
| Parallelism only where safe | Tournaments run judges concurrently via asyncio.gather capped by max_parallel_subprocesses; all other work is serial |
| Fresh context per tournament role | Every critic, author, synthesizer, and judge invocation is a fresh subprocess with no prior session state |
| Everything is a file | All state under .autodev/. No daemons, no sockets, no shared memory. Diffable in git. |
You describe the feature. AutoDev coordinates the rest.
Build a REST API with user registration, login, and JWT auth.
1. explorer → scans the repo, reports structure, conventions, dependencies
2. domain_expert → domain guidance (auth patterns, secret management, JWT gotchas)
3. architect → drafts a phased plan
4. Plan ┌─ critic_t → "here is everything wrong with this plan"
Tournament →│ architect_b → revised plan B
(up to 15) │ synthesizer → plan AB = merge(A, B) with randomized labels
└─ N judges → rank [A, B, AB] via Borda; conservative tiebreak to A
5. critic_t → plan-gate: APPROVED / NEEDS_REVISION / REJECTED
6. For each task, serially:
developer → produces diff_A
QA gates → syntax → lint → build → tests → secretscan
(fail → retry developer with structured feedback, up to qa_retry_limit)
reviewer → correctness + architecture check
test_eng → writes tests, runs them
Impl ┌─ git worktree /a ← diff_A
Tournament →│ git worktree /b ← developer redo guided by critic
(always-on) │ git worktree /ab ← developer synthesis of A and B
└─ judge ranks; winner merged to main; losers pruned
7. docs + knowledge updated
8. If retries exhausted → escalate to critic_sounding_board (pre-abort sanity check)
At every step, state is persisted to .autodev/. If you kill the process, autodev resume continues from the last ledger checkpoint.
stateDiagram-v2
[*] --> pending
pending --> in_progress: developer assigned
pending --> skipped: user skips
in_progress --> coded: developer output ready
coded --> auto_gated: QA gates pass
auto_gated --> reviewed: reviewer APPROVED
reviewed --> tested: test_engineer completes
tested --> tournamented: impl tournament finishes
tournamented --> complete: evidence bundle written
coded --> in_progress: QA gate failed (retry)
auto_gated --> in_progress: reviewer NEEDS_CHANGES (retry)
reviewed --> in_progress: tests failed (retry)
tested --> in_progress: retry
in_progress --> blocked: retries exhausted or guardrail breached
coded --> blocked: guardrail breached
auto_gated --> blocked: guardrail breached
reviewed --> blocked: guardrail breached
tested --> blocked: guardrail breached
tournamented --> blocked: guardrail breached
blocked --> in_progress: autodev resume
complete --> [*]
skipped --> [*]
style pending fill:#e3f2fd
style in_progress fill:#fff3e0
style complete fill:#c8e6c9
style blocked fill:#ffcdd2
style skipped fill:#eeeeee
Everything is on disk, in a format you can grep, diff, and replay.
.autodev/
├── config.json # Versioned pydantic schema
├── spec.md # Your feature intent (human-editable)
├── plan.json # Derived snapshot — DO NOT EDIT
├── plan-ledger.jsonl # ★ Source of truth: append-only, CAS-chained
├── knowledge.jsonl # Per-project lessons (ranked, deduped)
├── rejected_lessons.jsonl # Block list — prevents re-learning loops
├── evidence/
│ ├── {task_id}-developer.json # Prompt, response, diff, tool calls
│ ├── {task_id}-review.json # Reviewer findings
│ ├── {task_id}-test.json # Test command, stdout/stderr, pass/fail
│ ├── {task_id}-tournament.json # Full round-by-round tournament trace
│ └── {task_id}.patch # Applied unified diff
├── tournaments/
│ ├── plan-{plan_id}/
│ │ ├── initial_a.md # The drafted plan before the tournament
│ │ ├── final_output.md # What the tournament converged on
│ │ ├── history.json # Per-pass winners, judge rankings, streak
│ │ └── pass_NN/
│ │ ├── version_a.md # Incumbent
│ │ ├── critic.md # Critique text
│ │ ├── version_b.md # Author B's revision
│ │ ├── version_ab.md # Synthesizer's merge
│ │ └── result.json # Judge rankings + Borda scores
│ └── impl-{task_id}/
│ ├── a/ b/ ab/ # Git worktrees (pruned after merge)
│ └── ...
└── sessions/{session_id}/
├── events.jsonl # structlog audit trail
└── snapshot.json
Every decision is reconstructable from disk — why a judge ranked B above A, which critic feedback drove a retry, what the developer's first attempt looked like. This is what separates AutoDev from "prompt chains with extra steps."
cd /your/project
autodev init --inline # renders agent files for both platforms, writes auto-resume config
# or specify a platform explicitly:
autodev init --inline --platform claude # Claude Code only
autodev init --inline --platform cursor # Cursor onlyThen, in your agent (Claude Code or Cursor), describe what you want built:
Build a REST API with user registration, login, and JWT auth.
Your agent reads delegation files from .autodev/delegations/, executes tasks with its own tools, writes responses to .autodev/responses/, and runs autodev resume to continue the pipeline. You never manually switch agents.
autodev init --platform claude_code # or cursor
autodev plan "Add subtract(a, b) with a pytest test"
autodev executeEvery role spawns a fresh claude -p / cursor agent --print subprocess. Tournament judges always use subprocesses even in inline mode — fresh context per judge is load-bearing.
autodev status # phase, task FSM state, evidence counts
autodev logs [--session SID] # tail events.jsonl
autodev doctor # health check (CLIs, config, plugins, guardrails)
autodev tournament --phase=plan \ # ad-hoc tournament runner (debugging)
--input spec.md \
--dry-runautodev resume # Replays ledger, continues at last FSM edge
autodev prune --older-than 30d # GC stale tournament artifacts
autodev reset --hard # Destructive — clears plan stateflowchart TB
subgraph CLI["CLI (click)"]
direction LR
init["init"]
plan["plan"]
execute["execute"]
resume["resume"]
status["status"]
tournament_cmd["tournament"]
doctor["doctor"]
logs["logs"]
plugins_cmd["plugins"]
prune["prune"]
reset["reset"]
end
subgraph Orchestrator["Orchestrator (FSM)"]
direction TB
fsm["Python FSM drives transitions<br/>LLM calls are leaves, not nodes"]
plan_phase["PLAN phase"]
execute_phase["EXECUTE phase"]
fsm --> plan_phase
fsm --> execute_phase
end
CLI --> Orchestrator
subgraph Components
direction LR
tournament["Tournament Engine<br/>Borda aggregation<br/>conservative tiebreak"]
registry["Agent Registry<br/>14 roles"]
qa["QA Gates<br/>syntax / lint / build /<br/>test_runner / secretscan"]
state["Durable State<br/>SHA-256 chained ledger<br/>+ evidence / tournaments"]
guardrails["Guardrails<br/>duration / calls /<br/>diff-size / loop detector"]
plugins_sub["Plugin Registry<br/>entry_points"]
end
Orchestrator --> Components
Components --> PlatformAdapter
subgraph Adapters["Platform Adapters (Protocol)"]
direction LR
inline["Inline<br/>(delegation files)"]
claude["Claude Code<br/>claude -p"]
cursor["Cursor<br/>cursor agent --print"]
end
PlatformAdapter --> Adapters
CLI:::cli
Orchestrator:::orch
Components:::comp
Adapters:::adap
classDef cli fill:#e1f5fe
classDef orch fill:#fff3e0
classDef comp fill:#e8f5e9
classDef adap fill:#f3e5f5
note1["Tournament judges always use subprocess adapters<br/>for context independence — even in inline mode."]
Serial by default. One specialist at a time. Parallelism only inside the tournament — N judges via asyncio.gather, capped by max_parallel_subprocesses. No shared mutable state across agents.
You do not manually switch between these. The orchestrator invokes them.
| Agent | Role | Invoked |
|---|---|---|
architect |
Plan drafting, delegation decisions | PLAN phase |
explorer |
Codebase reconnaissance | Before planning |
domain_expert |
Domain research | During planning |
developer |
Implements one task | EXECUTE phase |
reviewer |
Correctness + architecture review | After each task |
test_engineer |
Writes & runs tests | After each task |
critic_sounding_board |
Pre-escalation sanity check | On retry exhaustion |
critic_drift_verifier |
Post-phase plan-vs-reality drift check | Before phase_complete |
docs |
Post-phase documentation | End of each phase |
designer |
UI scaffolds (opt-in) | UI work |
critic_t |
Plan-gate + Tournament critic (finds problems, no fixes) | PLAN + Tournaments |
architect_b |
Tournament revision agent | Tournaments |
synthesizer |
Merges A + B with randomized labels | Tournaments |
judge |
Ranks A/B/AB via Borda count | Tournaments |
Agent prompts live in src/autodev/agents/prompts/<name>.md, each with YAML frontmatter declaring role, model tier, and tool allow-list. Python drives delegation — prompts contain no inline @agent handoffs, so agents stay focused on their assigned task. Tournament role prompts (critic_t, architect_b, synthesizer, judge) live in src/autodev/tournament/prompts.py. Tool allow-lists are enforced via --allowed-tools (Claude Code) or prompt-level constraints (Cursor).
After the architect drafts a plan, and after every developer task passes QA gates, AutoDev runs a self-refinement tournament:
flowchart TD
A["Incumbent A"] --> B["critic_t<br/>What's wrong?"]
B --> C["architect_b<br/>Revised B"]
C --> D["synthesizer<br/>Merge X,Y → AB"]
D --> E["N judges<br/>Rank A,B,AB"]
E --> F["Borda aggregation"]
F --> G{winner == A?}
G -->|Yes| H["streak++"]
G -->|No| I["streak = 0<br/>incumbent = winner"]
H --> J{streak ≥ k?}
J -->|Yes| K["CONVERGED"]
J -->|No| L["Next pass"]
I --> L
L --> B
style A fill:#e3f2fd
style K fill:#c8e6c9
style J fill:#fff9c4
style G fill:#ffecb3
num_judges=3, convergence_k=2, max_rounds=15 by default. Converges when the incumbent wins two passes in a row.
num_judges=1, convergence_k=1, max_rounds=3 by default. Variants are materialized in git worktrees:
/a← developer's initial diff/b← developer re-run guided by critic_t feedback/ab← developer synthesis of A and B- Judge picks the winner; winner's diff merged to main worktree;
/a /b /abpruned.
The always-on default is tuned for cost: 4 subprocess calls per round × max 3 rounds = hard ceiling of 12 extra LLM calls per task. Disable per-run with autodev execute --no-impl-tournament or per-project in config.json.
- Fresh context per role — no in-context contamination. The critic cannot see the author's rationale; the judge cannot see the critic's critique; labels are randomized before judges see variants.
- Borda aggregation with conservative tiebreak — ties default to the incumbent. Prevents thrash on marginal improvements.
- Convergence criterion — the incumbent must win
kconsecutive rounds. Prevents infinite refinement loops.
See docs/design_documentation/tournaments.md for the full algorithm and proof sketches.
| AutoDev | Single-agent AI coding | Prompt chains | |
|---|---|---|---|
| Multiple specialized agents | ✅ 14 roles | ❌ | Partial |
| Plan tournament before coding | ✅ Borda-ranked, converges in k passes |
❌ | ❌ |
| Implementation tournament per task | ✅ A/B/AB judged with git worktree isolation | ❌ | ❌ |
| Reviewer ≠ author | ✅ enforced at agent level | ❌ | ❌ |
| QA gates (lint, build, test, secrets) | ✅ with bounded retry + escalation | ❌ | Ad-hoc |
| Append-only CAS ledger | ✅ SHA-256 chained, replay-safe | ❌ | ❌ |
| Crash-safe resume | ✅ autodev resume |
❌ | ❌ |
| Works inside your existing agent | ✅ inline mode | N/A | ❌ |
Plugin ecosystem (entry_points) |
✅ | ❌ | ❌ |
| No API keys | ✅ subscription-based | — | — |
| Cost guardrails (duration / calls / budget) | ✅ per-task + per-plan | ❌ | ❌ |
| Typed contracts (pydantic v2 strict) | ✅ everywhere | ❌ | ❌ |
# Install once, globally
pip install ai-autodev
# Opt in per project
cd /your/project
autodev init --inlineThe --inline flag renders agent files for both platforms (.claude/agents/<role>.md and .cursor/rules/<role>.mdc) and writes a managed auto-resume section instructing your agent to:
- Watch
.autodev/delegations/for work. - Execute tasks using its own tools.
- Write a JSON response to the specified
response_path(validated against theAgentResultpydantic schema). - Run
autodev resumeto continue.
The managed section is idempotent — running init --inline again updates without clobbering your own content.
git clone https://github.com/mohamedameen/autodev
cd autodev
uv sync
uv run pytest -v # 733 tests, 94% coverage.autodev/config.json is a versioned, strict pydantic schema. Model defaults are platform-aware — Claude Code uses model aliases (opus/sonnet/haiku) that resolve to the latest version, while Cursor uses explicit models with auto for intelligent selection and automatic fallback on rate limits. Each agent also has a configurable max_turns — the number of turns the agent gets per invocation (tool-heavy roles like developer get more turns; text-only tournament roles get 1).
Note: On Cursor,
architectandarchitect_bdefault toopuswith automatic fallback toautowhen rate limited. Roles likeexplorer,developer,test_engineerdefault toautofor intelligent model selection.
| Command | Purpose |
|---|---|
autodev init [--inline] [--platform …] [--force] |
Scaffold .autodev/, render agent files, write auto-resume config |
autodev plan "<intent>" |
PLAN phase: explore → domain_expert → architect-draft → plan tournament → critic_t-gate → persist |
autodev execute [--task ID] [--dry-run] [--no-impl-tournament] |
EXECUTE phase: developer → QA gates → review → tests → impl tournament → advance |
autodev resume |
Replay ledger, continue at last FSM edge |
autodev status |
Current phase, task FSM states, evidence counts, knowledge summary |
autodev tournament --phase=plan|impl --input FILE [--dry-run] |
Ad-hoc tournament runner (debugging, experimentation) |
autodev doctor |
CLI detection, config validation, plugin discovery, guardrail configuration |
autodev logs [--session SID] |
Tail the structlog event stream |
autodev plugins |
List discovered plugins (QA gates, judge providers, agent extensions) |
autodev prune [--older-than 30d] |
GC stale tournament artifacts |
autodev reset [--hard] |
Destructive: clear .autodev/plan* (prompts for confirmation) |
AutoDev consumes your existing subscription quota — no API keys, no per-token billing, no surprise invoices. Every call is a subprocess invocation against your logged-in Claude Code or Cursor session.
Rough upper bound per plan (approximate, varies by task complexity):
- Plan phase: 5 – 8 calls (explorer + domain_expert + architect + plan-tournament × up to 15 rounds × ~4 calls + critic_t)
- Per task: 4 – 7 calls (developer + retries + reviewer + test_engineer + impl-tournament × up to 3 rounds × ~4 calls)
Cost reduction levers:
autodev execute --no-impl-tournament # skip impl tournaments entirely// or in config.json
"tournaments": {
"plan": { "num_judges": 1, "max_rounds": 5 },
"impl": { "enabled": false },
"auto_disable_for_models": ["opus", "sonnet"]
}Before autodev execute, the orchestrator prints a projected call count. You can abort before anything runs.
See docs/design_documentation/cost.md for the full breakdown and tuning guide.
# setup.cfg / pyproject.toml of your plugin package
[project.entry-points."autodev.plugins"]
my_gate = "my_package.gates:MyCustomGate" # QAGatePlugin
my_judge = "my_package.judges:MyJudge" # JudgeProviderPlugin
my_agent = "my_package.agents:MyAgentSpec" # AgentExtensionPluginInstall the wheel, run autodev doctor — your plugin is live. The registry validates against the QAGatePlugin, JudgeProviderPlugin, and AgentExtensionPlugin protocols; invalid plugins are skipped with a warning (no hard fail).
Add a new platform adapter by implementing the PlatformAdapter protocol (four methods: init_workspace, execute, parallel, healthcheck). See src/autodev/adapters/claude_code.py (~250 LOC) as a reference.
| Platform | Mode | Mechanism |
|---|---|---|
| Claude Code | Inline | .claude/CLAUDE.md managed section, delegation files |
| Claude Code | Standalone | claude -p "<prompt>" --output-format json subprocess per role |
| Cursor | Inline | .cursor/rules/autodev.mdc with alwaysApply: true |
| Cursor | Standalone | cursor agent "<prompt>" --print --output-format json |
Platform selection precedence:
--inlineor--platformCLI flagAUTODEV_PLATFORMenvironment variableconfig.json.platform(when not"auto")- Auto-detect:
claude --versionsucceeds →claude_code; elsecursor --version→cursor; elseautodev doctorsurfaces a diagnostic
AutoDev is opinionated. Things it intentionally does not do:
- No in-context delegation. Agents do not
@mentioneach other mid-conversation. The Python FSM delegates; the LLM stays focused. - No background agents. Everything is serial except tournament judges. No race conditions by construction.
- No per-call temperature control.
claude -pandcursor agent --printdon't expose it; the tournament relies on fresh-context stochasticity instead. Documented deviation from the reference algorithm. - No implicit auto-merge to
main. AutoDev commits to worktrees and updates the main worktree; pushing is your call. - No telemetry. Nothing phones home. Events stay in
.autodev/sessions/. - No monkey-patching the platform CLI. We invoke the public CLI surface (
claude -p,cursor agent --print) and parse documented JSON output. If the CLI drifts, the adapter breaks loudly, not silently.
git clone https://github.com/mohamedameen/autodev
cd autodev
uv sync
# Full test suite
uv run pytest -v # 733 passing, 94% coverage
# Property-based tests (Borda math)
uv run pytest --hypothesis-show-statistics tests/test_tournament_borda_aggregation.py
# Targeted
uv run pytest tests/test_state_ledger.py -v
# Integration tests (mocked adapters)
uv run pytest tests/integration/ -v
# Live smoke tests (requires claude CLI logged in)
AUTODEV_LIVE=1 uv run pytest tests/ -k live -v
# Lint / typecheck
uv run ruff check src/
uv run mypy src/The test suite includes (94% coverage, all source files ≥80%):
- Unit: ledger atomicity, Borda aggregation, parse_ranking, plan manager under contention, knowledge ranking, adapter type round-trips, CLI commands, QA gates, config defaults, autologging
- Integration: tiny-repo E2E with stubbed adapters for determinism, impl tournament full-flow with git worktrees
- Property: Borda invariants via Hypothesis
- Replay: tournament determinism against recorded reference fixtures
- Live: opt-in smoke tests against real Claude Code / Cursor CLIs (
AUTODEV_LIVE=1)
docs/architecture.md— subsystems, FSM states, data flowdocs/design_documentation/tournaments.md— algorithm, convergence, configurationdocs/design_documentation/adapters.md— platform adapter contract, inline mode, writing a new adapterdocs/design_documentation/agents.md— all 15 agent roles, prompts, tool mapsdocs/design_documentation/cost.md— subscription cost model, tuning guide
examples/subtract/— smoke test: add asubtract(a, b)function to a tiny repoexamples/jwt_auth/— realistic spec: build JWT authentication end-to-end
AutoDev combines two threads of research:
- Self-refinement with independent evaluators. Iterative LLM improvement closes the generation-evaluation gap when a fresh critic identifies faults, a fresh author proposes a revision, a synthesizer merges the two, and N fresh judges rank the variants via Borda count with a conservative tiebreak — making "do nothing" a first-class winning outcome. AutoDev implements this algorithm as its plan and implementation tournaments.
- Coordinator-led multi-agent orchestration. Serial delegation from a planning coordinator to specialized workers (developer, reviewer, test engineer, critic) — with gates between phases, evidence persisted per task, and bounded retry before escalation — produces more reliable output than monolithic single-agent systems. AutoDev adopts this pattern for its EXECUTE phase.
GPL-3.0 — see LICENSE.
{ "schema_version": "1.0.0", "platform": "auto", // "claude_code" | "cursor" | "inline" | "auto" "agents": { // Claude Code defaults (aliases resolve to latest): "architect": { "model": "opus", "max_turns": 5 }, "explorer": { "model": "haiku", "max_turns": 3 }, "domain_expert": { "model": "sonnet", "max_turns": 3 }, "developer": { "model": "sonnet", "max_turns": 10 }, "reviewer": { "model": "sonnet", "max_turns": 3 }, "test_engineer": { "model": "sonnet", "max_turns": 5 }, "critic_t": { "model": "sonnet", "max_turns": 1 }, "architect_b": { "model": "sonnet", "max_turns": 5 }, "synthesizer": { "model": "sonnet", "max_turns": 1 }, "judge": { "model": "sonnet", "max_turns": 1 }, "critic_sounding_board": { "model": "sonnet", "max_turns": 3 }, "critic_drift_verifier": { "model": "sonnet", "max_turns": 3 }, "docs": { "model": "sonnet", "max_turns": 3 }, "designer": { "model": "sonnet", "max_turns": 3 } // Cursor defaults (with rate limit fallback): // "architect": { "model": "opus" }, // → auto on 429 // "explorer": { "model": "auto" }, // auto-selects best model // "developer": { "model": "auto" }, // auto-selects best model // etc. }, "tournaments": { "plan": { "enabled": true, "num_judges": 3, "convergence_k": 2, "max_rounds": 15 }, "impl": { "enabled": true, "num_judges": 1, "convergence_k": 1, "max_rounds": 3 }, "max_parallel_subprocesses": 3, "auto_disable_for_models": ["opus"] // gains plateau at high-tier models that can reliably self-evaluate }, "qa_gates": { "syntax_check": true, "lint": true, "build_check": true, "test_runner": true, "secretscan": true, "sast_scan": false, "mutation_test": false }, "qa_retry_limit": 3, "guardrails": { "max_tool_calls_per_task": 60, "max_duration_s_per_task": 900, "max_diff_bytes": 5242880, "cost_budget_usd_per_plan": null }, "hive": { "enabled": true, "path": "~/.local/share/autodev/shared-learnings.jsonl" } }