Multi-agent observation framework that orchestrates Claude Code and Cursor to build Supabase apps, logging every decision and failure to analyze where AI coding agents break down.
This is a measurement tool. It answers the question: when AI agents are given a task to build an app with Supabase and minimal hand-holding, where do they still get things wrong?
By logging every step, tool call, verification verdict, and smoke test result, we get a structured dataset that shows:
- Knowledge gaps — which tasks consistently fail? If RLS policy steps fail 8 out of 10 runs with similar errors, that's a gap in training data or documentation, not randomness
- Recovery effectiveness — when the verifier triggers a web search, do the findings actually help the retry succeed? If searches for "supabase realtime" never lead to passing retries, the available docs aren't good enough
- Tool & model comparison — Different agents and models implement the same prompt, letting us compare which ones use more Bash calls, read more files before editing, or recover better from failures.
The goal output: identify where documentation, examples, or training data need improvement so these tools build correctly.
- Python 3.9+
- Claude Code CLI installed and authenticated
- Cursor installed (if using Cursor as implementer)
- A Supabase project for logging
# Clone and install core dependencies (python-dotenv, supabase)
pip install -r requirements.txt
# For browser tests (optional)
pip install playwright httpx
playwright install chromium
# Configure environment
cp .env.example .env
# Edit .env with your Supabase credentials (for logging)Run the migration to create log tables in your Supabase project:
# Copy contents of migration.sql and run in Supabase SQL Editor
# Dashboard > SQL Editor > New Query > Paste > RunThis creates:
orchestrator_runs— one row per orchestration sessionorchestrator_steps— each plan/implement/verify phaseorchestrator_events— tool calls, file writes, errors
Verify everything is configured correctly:
python preflight.pyThis checks CLI tools, Supabase connectivity, and database schema.
# Basic run (uses Claude for planning/verification, Cursor for implementation)
python orchestrator.py "Build a todo app with Supabase auth"
# With runtime testing against a Supabase project
python orchestrator.py "Build a todo app with auth" \
--supabase-url https://xxx.supabase.co \
--supabase-anon-key xxx \
--supabase-service-key xxx
# List previous runs
python orchestrator.py --list-runs
# Resume a failed run from step 3
python orchestrator.py --resume abc123 --start-step 3If the orchestration is too thin — no verification, no RLS testing, no replanning — the agents produce code that looks right but doesn't work, and you can't tell where it broke. If the orchestration is too thick — the system prompt includes the exact SQL, the exact config entries, the exact implementation patterns — the agents just follow instructions and you learn nothing about their actual knowledge.
The goal is enough structure that failures are meaningful, not so much that we are hiding them. The orchestration catches failures and gives agents a chance to recover, but it doesn't prevent failures from happening in the first place. Logging captures every step of this — what failed, how the agent tried to recover, and whether it succeeded. That's the dataset.
You: "Build a Supabase todo app with auth"
│
▼
┌───────────────────────┐
│ Python Orchestrator │
└───────────┬───────────┘
│
┌───────────────┼───────────────┐
│ │ │
▼ ▼ ▼
┌────────┐ ┌──────────┐ ┌──────────┐
│ Agent │ │ Agent │ │ Agent │
│ Plans │───▶│Implements│───▶│ Verifies │
└────────┘ └──────────┘ └────┬─────┘
▲ │
│ ┌────────────────────┤
│ ▼ ▼
│ [caveats?] [retry]
│ │ │
│ ▼ │
│ ┌──────────┐ │
└───│ Replans │◀──────────────┘
│if needed │
└──────────┘
│
▼
┌───────────────────────┐
│ Supabase JSONB logs │
└───────────────────────┘
One prompt in → fully built project + complete observation dataset out.
For each step, the orchestrator runs:
- Plan — Agent generates a step-by-step implementation plan tagged with build_phase (setup, schema, backend, frontend,
testing, deployment) - Implement — Agent builds the step
- Verify — Agent checks the work and returns a verdict
- Resolve — Based on the verdict:
┌────────────────┬─────────────────────────────────────────────────────────┐
│ Verdict │ What happens │
├────────────────┼─────────────────────────────────────────────────────────┤
│ PROCEED │ Run replan checkpoint, then next step │
├────────────────┼─────────────────────────────────────────────────────────┤
│ RETRY │ Append issues, re-run implementation │
├────────────────┼─────────────────────────────────────────────────────────┤
│ WEB_SEARCH │ Search docs, append findings, retry │
├────────────────┼─────────────────────────────────────────────────────────┤
│ RUN_DIAGNOSTIC │ Run a command (npx tsc, npm test), append output, retry │
├────────────────┼─────────────────────────────────────────────────────────┤
│ SKIP │ Skip step with reason │
├────────────────┼─────────────────────────────────────────────────────────┤
│ MODIFY_PLAN │ Trigger replan checkpoint │
└────────────────┴─────────────────────────────────────────────────────────┘
- Replan Checkpoint — After step completion, evaluate if remaining steps need adjustment. If implementation diverged,
regenerate remaining steps. Completed steps stay locked. - Log — Everything goes to Supabase
Loop controls:
- resolution_count — max 7 resolution actions (retry, search, diagnostic) per step
- Replan — separate from resolution budget, runs after step passes
After implementation, the orchestrator runs multiple verification layers to catch different failure modes:
| Test Layer | What It Tests | When It Runs |
|---|---|---|
| API Verification | Tables exist, REST endpoints respond | Per-step, on schema changes |
| RLS Tests | Row Level Security policies enforce correctly | Per-step, when step mentions RLS/policies |
| Smoke Test | Build succeeds, app starts, auth works, storage works | After all steps complete |
| Edge Function Tests | Edge functions deploy and execute | Per-step, on backend steps with functions |
| Playwright Browser Tests | E2E user flows work in real browser | After smoke test passes |
Test execution flow:
During Implementation (per-step)
│
├── API Verification ──▶ On schema steps (checks tables exist)
│
└── RLS Tests ─────────▶ On steps mentioning RLS/policies
(retry loop within step resolution)
After All Steps Complete
│
▼
Smoke Test ──────▶ Fix Loop (max 2 retries)
│
▼
Browser Tests ────▶ Fix Loop (max 2 retries)
│
▼
Results logged to Supabase
Playwright browser tests verify:
- Auth redirect — unauthenticated users see login
- Login flow — user can sign in
- Create resource — authenticated CRUD works
- Realtime sync — User B sees User A's changes without refresh (requires two browser contexts)
All test results are logged with pass/fail status, error messages, and duration for analysis.
python orchestrator.py [prompt] [options]
| Argument | Description |
|---|---|
prompt |
What to build (the project goal) |
| Argument | Description |
|---|---|
--project-dir DIR |
Directory to create project in (default: auto-generated) |
| Argument | Description |
|---|---|
--resume RUN_ID |
Resume a previous run |
--start-step N |
Start from step N (with --resume) |
| Argument | Description |
|---|---|
--max-retries N |
Max RETRY verdicts per step (default: 2) |
--skip-smoke-test |
Skip the smoke test phase |
--encourage-web-search |
Encourage agents to use WebSearch proactively |
| Argument | Description |
|---|---|
--planner {claude,cursor} |
Tool for planning (default: claude) |
--implementer {claude,cursor} |
Tool for implementation (default: cursor) |
--verifier {claude,cursor} |
Tool for verification (default: claude) |
| Argument | Description |
|---|---|
--claude-model MODEL |
Model for Claude Code |
--cursor-model MODEL |
Model for Cursor Agent |
Inject phase-specific guidance into implementation steps.
| Argument | Description |
|---|---|
--skills-mode {none,passive,on-demand} |
Injection mode (default: none) |
--skills-source PATH |
Path to skills directory (default: ./skills) |
--skills-filter {all,phase-matched} |
File selection strategy (default: phase-matched) |
Modes:
none— No skills injectionpassive— Append skill content to system prompton-demand— Copy skills directory to project, add prompt hint
Filters:
phase-matched— Load{build_phase}.mdmatching the current step's phaseall— Always loadall.mdregardless of phase
Build phases: setup, schema, backend, frontend, testing, deployment, fix
| Argument | Description |
|---|---|
--supabase-url URL |
REST API URL |
--supabase-anon-key KEY |
Anon key |
--supabase-service-key KEY |
Service role key (for auth/admin) |
--supabase-db-url URL |
Postgres connection string (for migrations) |
--supabase-project-ref REF |
Project ref (for Edge Function deployment) |
| Argument | Description |
|---|---|
--list-runs |
List all previous runs |
Everything is stored as JSONB in Supabase and queryable with SQL:
- Runs & steps: run metadata + step records (phase, tool, build_phase, duration, timestamps)
- Commands executed: shell commands run per step (also included in EXIT_ERROR for debugging)
- Process output: stdout/stderr + exit code per step
- Tool calls/events: tool events (Read, Write, Edit, Bash, WebSearch, WebFetch) stored in
orchestrator_events - Verification verdicts: PASS/FAIL/PARTIAL in parsed_result with reasoning
- Normalized errors:
PARSED_ERROR: explicit errors from AI outputEXIT_ERROR: non-zero exit code + stderr tail + commands_run
- Timing: duration per step
After runs complete, use analyzer.py to explore the logged data:
# Full analysis of a run
python analyzer.py <run_id>
# Show only errors
python analyzer.py <run_id> --errors
# Tool usage breakdown (which tools called, how often)
python analyzer.py <run_id> --tools
# Timeline of events
python analyzer.py <run_id> --timeline
# Deep dive on a specific step
python analyzer.py <run_id> --step 3
# Save full analysis to reports/
python analyzer.py <run_id> --save-report
# Export as JSON
python analyzer.py <run_id> --export report
# Compare two runs
python analyzer.py --compare <run_id_1> <run_id_2>The database also includes views for common queries:
orchestrator_run_summary— aggregated stats per runorchestrator_errors— all errors across runsorchestrator_tool_usage— tool call frequencyorchestrator_commands— shell commands executed
A web-based dashboard for exploring run data, viewing step details, and analyzing cross-run patterns.
# Start the dashboard (production mode)
python run_dashboard.py serve
# Start with hot-reload for development
python run_dashboard.py serve --devThen open http://localhost:8000 in your browser.
| Page | Description |
|---|---|
| Run List | All runs with status, duration, retry counts, classification breakdown. Filter by status, sort by any column. |
| Run Detail | Step-by-step timeline with expandable details. Click any step to see classification, errors, resolution actions, web searches. |
| Patterns | Cross-run analysis: error category heatmap, top failure patterns, self-correction leaderboard, tool comparison. |
The dashboard includes an AI classifier that analyzes failed steps and categorizes them:
| Classification | Meaning |
|---|---|
| Architectural | Fundamental approach is wrong — needs different strategy |
| Implementation | Right approach, wrong execution — fixable with retries |
| Clean Pass | Step succeeded without retries |
# Classify all unclassified runs
python run_dashboard.py classify
# Classify a specific run
python run_dashboard.py classify <run_id>
# Force reclassification
python run_dashboard.py classify <run_id> --forceClassification requires an ANTHROPIC_API_KEY in your .env file.
# Start dashboard server
python run_dashboard.py serve # Production mode (serves built frontend)
python run_dashboard.py serve --dev # Dev mode (API only, run Vite separately)
# Ingest reports from ./reports/ directory
python run_dashboard.py ingest # Ingest new reports only
python run_dashboard.py ingest --force # Re-ingest all reports
# Run AI classification
python run_dashboard.py classify # Classify all unclassified
python run_dashboard.py classify <id> # Classify specific runFor frontend development with hot-reload:
# Terminal 1: Start API server
python run_dashboard.py serve --dev
# Terminal 2: Start Vite dev server
cd dashboard/frontend && npm run devThen open http://localhost:5173 (Vite proxies API calls to port 8000).
To rebuild the production frontend:
cd dashboard/frontend && npm run buildorchestrator/
├── orchestrator.py # Main orchestration loop
├── analyzer.py # Post-run analysis tool
├── run_dashboard.py # Dashboard CLI entry point
├── preflight.py # Pre-run verification
├── storage.py # Supabase storage backend
├── playwright_tests.py # Browser test runner
├── migration.sql # Database schema
├── requirements.txt # Python dependencies
├── .env.example # Environment template
├── skills/ # Phase-specific guidance files
│ ├── all.md # Universal guidance
│ ├── setup.md # Project setup phase
│ ├── schema.md # Database schema phase
│ ├── backend.md # Backend/API phase
│ ├── frontend.md # Frontend phase
│ ├── testing.md # Testing phase
│ ├── deployment.md # Deployment phase
│ └── fix.md # Error fix phase
└── dashboard/ # Analysis dashboard
├── backend/ # FastAPI backend
│ ├── app.py # API routes
│ ├── db.py # SQLite database layer
│ ├── ingest.py # Report ingestion
│ └── classifier.py # AI classification
└── frontend/ # React frontend (Vite)
└── src/
├── pages/ # RunList, RunDetail, Patterns
└── components/
MIT