Supa Agent Orchestrator

Multi-agent observation framework that orchestrates Claude Code and Cursor to build Supabase apps, logging every decision and failure to analyze where AI coding agents break down.

Purpose

This is a measurement tool. It answers the question: when AI agents are given a task to build an app with Supabase and minimal hand-holding, where do they still get things wrong?

By logging every step, tool call, verification verdict, and smoke test result, we get a structured dataset that shows:

Knowledge gaps — which tasks consistently fail? If RLS policy steps fail 8 out of 10 runs with similar errors, that's a gap in training data or documentation, not randomness
Recovery effectiveness — when the verifier triggers a web search, do the findings actually help the retry succeed? If searches for "supabase realtime" never lead to passing retries, the available docs aren't good enough
Tool & model comparison — Different agents and models implement the same prompt, letting us compare which ones use more Bash calls, read more files before editing, or recover better from failures.

The goal output: identify where documentation, examples, or training data need improvement so these tools build correctly.

Installation

Prerequisites

Python 3.9+
Claude Code CLI installed and authenticated
Cursor installed (if using Cursor as implementer)
A Supabase project for logging

Setup

# Clone and install core dependencies (python-dotenv, supabase)
pip install -r requirements.txt

# For browser tests (optional)
pip install playwright httpx
playwright install chromium

# Configure environment
cp .env.example .env
# Edit .env with your Supabase credentials (for logging)

Database Setup

Run the migration to create log tables in your Supabase project:

# Copy contents of migration.sql and run in Supabase SQL Editor
# Dashboard > SQL Editor > New Query > Paste > Run

This creates:

orchestrator_runs — one row per orchestration session
orchestrator_steps — each plan/implement/verify phase
orchestrator_events — tool calls, file writes, errors

Preflight Check

Verify everything is configured correctly:

python preflight.py

This checks CLI tools, Supabase connectivity, and database schema.

Quick Start

# Basic run (uses Claude for planning/verification, Cursor for implementation)
python orchestrator.py "Build a todo app with Supabase auth"

# With runtime testing against a Supabase project
python orchestrator.py "Build a todo app with auth" \
  --supabase-url https://xxx.supabase.co \
  --supabase-anon-key xxx \
  --supabase-service-key xxx

# List previous runs
python orchestrator.py --list-runs

# Resume a failed run from step 3
python orchestrator.py --resume abc123 --start-step 3

The orchestration balancing act

If the orchestration is too thin — no verification, no RLS testing, no replanning — the agents produce code that looks right but doesn't work, and you can't tell where it broke. If the orchestration is too thick — the system prompt includes the exact SQL, the exact config entries, the exact implementation patterns — the agents just follow instructions and you learn nothing about their actual knowledge.

The goal is enough structure that failures are meaningful, not so much that we are hiding them. The orchestration catches failures and gives agents a chance to recover, but it doesn't prevent failures from happening in the first place. Logging captures every step of this — what failed, how the agent tried to recover, and whether it succeeded. That's the dataset.

How It Works

You: "Build a Supabase todo app with auth"                                    
                      │                                                         
                      ▼                                                         
          ┌───────────────────────┐                                             
          │  Python Orchestrator  │                                             
          └───────────┬───────────┘                                             
                      │                                                         
      ┌───────────────┼───────────────┐                                         
      │               │               │                                         
      ▼               ▼               ▼                                         
  ┌────────┐    ┌──────────┐    ┌──────────┐                                    
  │ Agent  │    │  Agent   │    │  Agent   │                                    
  │ Plans  │───▶│Implements│───▶│ Verifies │                                    
  └────────┘    └──────────┘    └────┬─────┘                                    
      ▲                              │                                          
      │         ┌────────────────────┤                                          
      │         ▼                    ▼                                          
      │    [caveats?]            [retry]                                        
      │         │                    │                                          
      │         ▼                    │                                          
      │   ┌──────────┐               │                                          
      └───│ Replans  │◀──────────────┘                                          
          │if needed │                                                          
          └──────────┘                                                          
                  │                                                             
                  ▼                                                             
      ┌───────────────────────┐                                                 
      │  Supabase JSONB logs  │                                                 
      └───────────────────────┘

One prompt in → fully built project + complete observation dataset out.

For each step, the orchestrator runs:

Plan — Agent generates a step-by-step implementation plan tagged with build_phase (setup, schema, backend, frontend,
testing, deployment)
Implement — Agent builds the step
Verify — Agent checks the work and returns a verdict
Resolve — Based on the verdict:

  ┌────────────────┬─────────────────────────────────────────────────────────┐                                               
  │    Verdict     │                      What happens                       │                                               
  ├────────────────┼─────────────────────────────────────────────────────────┤                                               
  │ PROCEED        │ Run replan checkpoint, then next step                   │                                               
  ├────────────────┼─────────────────────────────────────────────────────────┤                                               
  │ RETRY          │ Append issues, re-run implementation                    │                                               
  ├────────────────┼─────────────────────────────────────────────────────────┤                                               
  │ WEB_SEARCH     │ Search docs, append findings, retry                     │                                               
  ├────────────────┼─────────────────────────────────────────────────────────┤                                               
  │ RUN_DIAGNOSTIC │ Run a command (npx tsc, npm test), append output, retry │                                               
  ├────────────────┼─────────────────────────────────────────────────────────┤                                               
  │ SKIP           │ Skip step with reason                                   │                                               
  ├────────────────┼─────────────────────────────────────────────────────────┤                                               
  │ MODIFY_PLAN    │ Trigger replan checkpoint                               │                                               
  └────────────────┴─────────────────────────────────────────────────────────┘

Replan Checkpoint — After step completion, evaluate if remaining steps need adjustment. If implementation diverged,
regenerate remaining steps. Completed steps stay locked.
Log — Everything goes to Supabase

Loop controls:

resolution_count — max 7 resolution actions (retry, search, diagnostic) per step
Replan — separate from resolution budget, runs after step passes

Test Frameworks

After implementation, the orchestrator runs multiple verification layers to catch different failure modes:

Test Layer	What It Tests	When It Runs
API Verification	Tables exist, REST endpoints respond	Per-step, on schema changes
RLS Tests	Row Level Security policies enforce correctly	Per-step, when step mentions RLS/policies
Smoke Test	Build succeeds, app starts, auth works, storage works	After all steps complete
Edge Function Tests	Edge functions deploy and execute	Per-step, on backend steps with functions
Playwright Browser Tests	E2E user flows work in real browser	After smoke test passes

Test execution flow:

During Implementation (per-step)
        │
        ├── API Verification ──▶ On schema steps (checks tables exist)
        │
        └── RLS Tests ─────────▶ On steps mentioning RLS/policies
                                  (retry loop within step resolution)

After All Steps Complete
        │
        ▼
   Smoke Test ──────▶ Fix Loop (max 2 retries)
        │
        ▼
  Browser Tests ────▶ Fix Loop (max 2 retries)
        │
        ▼
    Results logged to Supabase

Playwright browser tests verify:

Auth redirect — unauthenticated users see login
Login flow — user can sign in
Create resource — authenticated CRUD works
Realtime sync — User B sees User A's changes without refresh (requires two browser contexts)

All test results are logged with pass/fail status, error messages, and duration for analysis.

Usage

python orchestrator.py [prompt] [options]

Positional

Argument	Description
`prompt`	What to build (the project goal)

Project

Argument	Description
`--project-dir DIR`	Directory to create project in (default: auto-generated)

Resume

Argument	Description
`--resume RUN_ID`	Resume a previous run
`--start-step N`	Start from step N (with --resume)

Execution

Argument	Description
`--max-retries N`	Max RETRY verdicts per step (default: 2)
`--skip-smoke-test`	Skip the smoke test phase
`--encourage-web-search`	Encourage agents to use WebSearch proactively

Agent Tools

Argument	Description
`--planner {claude,cursor}`	Tool for planning (default: claude)
`--implementer {claude,cursor}`	Tool for implementation (default: cursor)
`--verifier {claude,cursor}`	Tool for verification (default: claude)

Models

Argument	Description
`--claude-model MODEL`	Model for Claude Code
`--cursor-model MODEL`	Model for Cursor Agent

Skills Injection

Inject phase-specific guidance into implementation steps.

Argument	Description
`--skills-mode {none,passive,on-demand}`	Injection mode (default: none)
`--skills-source PATH`	Path to skills directory (default: ./skills)
`--skills-filter {all,phase-matched}`	File selection strategy (default: phase-matched)

Modes:

none — No skills injection
passive — Append skill content to system prompt
on-demand — Copy skills directory to project, add prompt hint

Filters:

phase-matched — Load {build_phase}.md matching the current step's phase
all — Always load all.md regardless of phase

Build phases: setup, schema, backend, frontend, testing, deployment, fix

Supabase (for runtime testing)

Argument	Description
`--supabase-url URL`	REST API URL
`--supabase-anon-key KEY`	Anon key
`--supabase-service-key KEY`	Service role key (for auth/admin)
`--supabase-db-url URL`	Postgres connection string (for migrations)
`--supabase-project-ref REF`	Project ref (for Edge Function deployment)

Other

Argument	Description
`--list-runs`	List all previous runs

What Gets Logged

Everything is stored as JSONB in Supabase and queryable with SQL:

Runs & steps: run metadata + step records (phase, tool, build_phase, duration, timestamps)
Commands executed: shell commands run per step (also included in EXIT_ERROR for debugging)
Process output: stdout/stderr + exit code per step
Tool calls/events: tool events (Read, Write, Edit, Bash, WebSearch, WebFetch) stored in orchestrator_events
Verification verdicts: PASS/FAIL/PARTIAL in parsed_result with reasoning
Normalized errors:
- PARSED_ERROR: explicit errors from AI output
- EXIT_ERROR: non-zero exit code + stderr tail + commands_run
Timing: duration per step

Analysis Tools

After runs complete, use analyzer.py to explore the logged data:

# Full analysis of a run
python analyzer.py <run_id>

# Show only errors
python analyzer.py <run_id> --errors

# Tool usage breakdown (which tools called, how often)
python analyzer.py <run_id> --tools

# Timeline of events
python analyzer.py <run_id> --timeline

# Deep dive on a specific step
python analyzer.py <run_id> --step 3

# Save full analysis to reports/
python analyzer.py <run_id> --save-report

# Export as JSON
python analyzer.py <run_id> --export report

# Compare two runs
python analyzer.py --compare <run_id_1> <run_id_2>

The database also includes views for common queries:

orchestrator_run_summary — aggregated stats per run
orchestrator_errors — all errors across runs
orchestrator_tool_usage — tool call frequency
orchestrator_commands — shell commands executed

Analysis Dashboard

A web-based dashboard for exploring run data, viewing step details, and analyzing cross-run patterns.

Quick Start

# Start the dashboard (production mode)
python run_dashboard.py serve

# Start with hot-reload for development
python run_dashboard.py serve --dev

Then open http://localhost:8000 in your browser.

Features

Page	Description
Run List	All runs with status, duration, retry counts, classification breakdown. Filter by status, sort by any column.
Run Detail	Step-by-step timeline with expandable details. Click any step to see classification, errors, resolution actions, web searches.
Patterns	Cross-run analysis: error category heatmap, top failure patterns, self-correction leaderboard, tool comparison.

AI Classification

The dashboard includes an AI classifier that analyzes failed steps and categorizes them:

Classification	Meaning
Architectural	Fundamental approach is wrong — needs different strategy
Implementation	Right approach, wrong execution — fixable with retries
Clean Pass	Step succeeded without retries

# Classify all unclassified runs
python run_dashboard.py classify

# Classify a specific run
python run_dashboard.py classify <run_id>

# Force reclassification
python run_dashboard.py classify <run_id> --force

Classification requires an ANTHROPIC_API_KEY in your .env file.

CLI Commands

# Start dashboard server
python run_dashboard.py serve           # Production mode (serves built frontend)
python run_dashboard.py serve --dev     # Dev mode (API only, run Vite separately)

# Ingest reports from ./reports/ directory
python run_dashboard.py ingest          # Ingest new reports only
python run_dashboard.py ingest --force  # Re-ingest all reports

# Run AI classification
python run_dashboard.py classify        # Classify all unclassified
python run_dashboard.py classify <id>   # Classify specific run

Development

For frontend development with hot-reload:

# Terminal 1: Start API server
python run_dashboard.py serve --dev

# Terminal 2: Start Vite dev server
cd dashboard/frontend && npm run dev

Then open http://localhost:5173 (Vite proxies API calls to port 8000).

To rebuild the production frontend:

cd dashboard/frontend && npm run build

Project Structure

orchestrator/
├── orchestrator.py      # Main orchestration loop
├── analyzer.py          # Post-run analysis tool
├── run_dashboard.py     # Dashboard CLI entry point
├── preflight.py         # Pre-run verification
├── storage.py           # Supabase storage backend
├── playwright_tests.py  # Browser test runner
├── migration.sql        # Database schema
├── requirements.txt     # Python dependencies
├── .env.example         # Environment template
├── skills/              # Phase-specific guidance files
│   ├── all.md           # Universal guidance
│   ├── setup.md         # Project setup phase
│   ├── schema.md        # Database schema phase
│   ├── backend.md       # Backend/API phase
│   ├── frontend.md      # Frontend phase
│   ├── testing.md       # Testing phase
│   ├── deployment.md    # Deployment phase
│   └── fix.md           # Error fix phase
└── dashboard/           # Analysis dashboard
    ├── backend/         # FastAPI backend
    │   ├── app.py       # API routes
    │   ├── db.py        # SQLite database layer
    │   ├── ingest.py    # Report ingestion
    │   └── classifier.py # AI classification
    └── frontend/        # React frontend (Vite)
        └── src/
            ├── pages/   # RunList, RunDetail, Patterns
            └── components/

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
dashboard		dashboard
skills		skills
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
analyzer.py		analyzer.py
migration.sql		migration.sql
orchestrator.py		orchestrator.py
playwright_tests.py		playwright_tests.py
preflight.py		preflight.py
requirements.txt		requirements.txt
run_dashboard.py		run_dashboard.py
storage.py		storage.py

Folders and files

Latest commit

History

Repository files navigation

Supa Agent Orchestrator

Purpose

Installation

Prerequisites

Setup

Database Setup

Preflight Check

Quick Start

The orchestration balancing act

How It Works

Test Frameworks

Usage

Positional

Project

Resume

Execution

Agent Tools

Models

Skills Injection

Supabase (for runtime testing)

Other

What Gets Logged

Analysis Tools

Analysis Dashboard

Quick Start

Features

AI Classification

CLI Commands

Development

Project Structure

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages