Skip to content

4L3k51/supa-agent-observer

Repository files navigation

Supa Agent Orchestrator

Multi-agent observation framework that orchestrates Claude Code and Cursor to build Supabase apps, logging every decision and failure to analyze where AI coding agents break down.

Purpose

This is a measurement tool. It answers the question: when AI agents are given a task to build an app with Supabase and minimal hand-holding, where do they still get things wrong?

By logging every step, tool call, verification verdict, and smoke test result, we get a structured dataset that shows:

  • Knowledge gaps — which tasks consistently fail? If RLS policy steps fail 8 out of 10 runs with similar errors, that's a gap in training data or documentation, not randomness
  • Recovery effectiveness — when the verifier triggers a web search, do the findings actually help the retry succeed? If searches for "supabase realtime" never lead to passing retries, the available docs aren't good enough
  • Tool & model comparison — Different agents and models implement the same prompt, letting us compare which ones use more Bash calls, read more files before editing, or recover better from failures.

The goal output: identify where documentation, examples, or training data need improvement so these tools build correctly.

Installation

Prerequisites

  • Python 3.9+
  • Claude Code CLI installed and authenticated
  • Cursor installed (if using Cursor as implementer)
  • A Supabase project for logging

Setup

# Clone and install core dependencies (python-dotenv, supabase)
pip install -r requirements.txt

# For browser tests (optional)
pip install playwright httpx
playwright install chromium

# Configure environment
cp .env.example .env
# Edit .env with your Supabase credentials (for logging)

Database Setup

Run the migration to create log tables in your Supabase project:

# Copy contents of migration.sql and run in Supabase SQL Editor
# Dashboard > SQL Editor > New Query > Paste > Run

This creates:

  • orchestrator_runs — one row per orchestration session
  • orchestrator_steps — each plan/implement/verify phase
  • orchestrator_events — tool calls, file writes, errors

Preflight Check

Verify everything is configured correctly:

python preflight.py

This checks CLI tools, Supabase connectivity, and database schema.

Quick Start

# Basic run (uses Claude for planning/verification, Cursor for implementation)
python orchestrator.py "Build a todo app with Supabase auth"

# With runtime testing against a Supabase project
python orchestrator.py "Build a todo app with auth" \
  --supabase-url https://xxx.supabase.co \
  --supabase-anon-key xxx \
  --supabase-service-key xxx

# List previous runs
python orchestrator.py --list-runs

# Resume a failed run from step 3
python orchestrator.py --resume abc123 --start-step 3

The orchestration balancing act

If the orchestration is too thin — no verification, no RLS testing, no replanning — the agents produce code that looks right but doesn't work, and you can't tell where it broke. If the orchestration is too thick — the system prompt includes the exact SQL, the exact config entries, the exact implementation patterns — the agents just follow instructions and you learn nothing about their actual knowledge.

The goal is enough structure that failures are meaningful, not so much that we are hiding them. The orchestration catches failures and gives agents a chance to recover, but it doesn't prevent failures from happening in the first place. Logging captures every step of this — what failed, how the agent tried to recover, and whether it succeeded. That's the dataset.

How It Works

You: "Build a Supabase todo app with auth"                                    
                      │                                                         
                      ▼                                                         
          ┌───────────────────────┐                                             
          │  Python Orchestrator  │                                             
          └───────────┬───────────┘                                             
                      │                                                         
      ┌───────────────┼───────────────┐                                         
      │               │               │                                         
      ▼               ▼               ▼                                         
  ┌────────┐    ┌──────────┐    ┌──────────┐                                    
  │ Agent  │    │  Agent   │    │  Agent   │                                    
  │ Plans  │───▶│Implements│───▶│ Verifies │                                    
  └────────┘    └──────────┘    └────┬─────┘                                    
      ▲                              │                                          
      │         ┌────────────────────┤                                          
      │         ▼                    ▼                                          
      │    [caveats?]            [retry]                                        
      │         │                    │                                          
      │         ▼                    │                                          
      │   ┌──────────┐               │                                          
      └───│ Replans  │◀──────────────┘                                          
          │if needed │                                                          
          └──────────┘                                                          
                  │                                                             
                  ▼                                                             
      ┌───────────────────────┐                                                 
      │  Supabase JSONB logs  │                                                 
      └───────────────────────┘          

One prompt in → fully built project + complete observation dataset out.

For each step, the orchestrator runs:

  1. Plan — Agent generates a step-by-step implementation plan tagged with build_phase (setup, schema, backend, frontend,
    testing, deployment)
  2. Implement — Agent builds the step
  3. Verify — Agent checks the work and returns a verdict
  4. Resolve — Based on the verdict:
  ┌────────────────┬─────────────────────────────────────────────────────────┐                                               
  │    Verdict     │                      What happens                       │                                               
  ├────────────────┼─────────────────────────────────────────────────────────┤                                               
  │ PROCEED        │ Run replan checkpoint, then next step                   │                                               
  ├────────────────┼─────────────────────────────────────────────────────────┤                                               
  │ RETRY          │ Append issues, re-run implementation                    │                                               
  ├────────────────┼─────────────────────────────────────────────────────────┤                                               
  │ WEB_SEARCH     │ Search docs, append findings, retry                     │                                               
  ├────────────────┼─────────────────────────────────────────────────────────┤                                               
  │ RUN_DIAGNOSTIC │ Run a command (npx tsc, npm test), append output, retry │                                               
  ├────────────────┼─────────────────────────────────────────────────────────┤                                               
  │ SKIP           │ Skip step with reason                                   │                                               
  ├────────────────┼─────────────────────────────────────────────────────────┤                                               
  │ MODIFY_PLAN    │ Trigger replan checkpoint                               │                                               
  └────────────────┴─────────────────────────────────────────────────────────┘                                               
  1. Replan Checkpoint — After step completion, evaluate if remaining steps need adjustment. If implementation diverged,
    regenerate remaining steps. Completed steps stay locked.
  2. Log — Everything goes to Supabase

Loop controls:

  • resolution_count — max 7 resolution actions (retry, search, diagnostic) per step
  • Replan — separate from resolution budget, runs after step passes

Test Frameworks

After implementation, the orchestrator runs multiple verification layers to catch different failure modes:

Test Layer What It Tests When It Runs
API Verification Tables exist, REST endpoints respond Per-step, on schema changes
RLS Tests Row Level Security policies enforce correctly Per-step, when step mentions RLS/policies
Smoke Test Build succeeds, app starts, auth works, storage works After all steps complete
Edge Function Tests Edge functions deploy and execute Per-step, on backend steps with functions
Playwright Browser Tests E2E user flows work in real browser After smoke test passes

Test execution flow:

During Implementation (per-step)
        │
        ├── API Verification ──▶ On schema steps (checks tables exist)
        │
        └── RLS Tests ─────────▶ On steps mentioning RLS/policies
                                  (retry loop within step resolution)

After All Steps Complete
        │
        ▼
   Smoke Test ──────▶ Fix Loop (max 2 retries)
        │
        ▼
  Browser Tests ────▶ Fix Loop (max 2 retries)
        │
        ▼
    Results logged to Supabase

Playwright browser tests verify:

  1. Auth redirect — unauthenticated users see login
  2. Login flow — user can sign in
  3. Create resource — authenticated CRUD works
  4. Realtime sync — User B sees User A's changes without refresh (requires two browser contexts)

All test results are logged with pass/fail status, error messages, and duration for analysis.

Usage

python orchestrator.py [prompt] [options]

Positional

Argument Description
prompt What to build (the project goal)

Project

Argument Description
--project-dir DIR Directory to create project in (default: auto-generated)

Resume

Argument Description
--resume RUN_ID Resume a previous run
--start-step N Start from step N (with --resume)

Execution

Argument Description
--max-retries N Max RETRY verdicts per step (default: 2)
--skip-smoke-test Skip the smoke test phase
--encourage-web-search Encourage agents to use WebSearch proactively

Agent Tools

Argument Description
--planner {claude,cursor} Tool for planning (default: claude)
--implementer {claude,cursor} Tool for implementation (default: cursor)
--verifier {claude,cursor} Tool for verification (default: claude)

Models

Argument Description
--claude-model MODEL Model for Claude Code
--cursor-model MODEL Model for Cursor Agent

Skills Injection

Inject phase-specific guidance into implementation steps.

Argument Description
--skills-mode {none,passive,on-demand} Injection mode (default: none)
--skills-source PATH Path to skills directory (default: ./skills)
--skills-filter {all,phase-matched} File selection strategy (default: phase-matched)

Modes:

  • none — No skills injection
  • passive — Append skill content to system prompt
  • on-demand — Copy skills directory to project, add prompt hint

Filters:

  • phase-matched — Load {build_phase}.md matching the current step's phase
  • all — Always load all.md regardless of phase

Build phases: setup, schema, backend, frontend, testing, deployment, fix

Supabase (for runtime testing)

Argument Description
--supabase-url URL REST API URL
--supabase-anon-key KEY Anon key
--supabase-service-key KEY Service role key (for auth/admin)
--supabase-db-url URL Postgres connection string (for migrations)
--supabase-project-ref REF Project ref (for Edge Function deployment)

Other

Argument Description
--list-runs List all previous runs

What Gets Logged

Everything is stored as JSONB in Supabase and queryable with SQL:

  • Runs & steps: run metadata + step records (phase, tool, build_phase, duration, timestamps)
  • Commands executed: shell commands run per step (also included in EXIT_ERROR for debugging)
  • Process output: stdout/stderr + exit code per step
  • Tool calls/events: tool events (Read, Write, Edit, Bash, WebSearch, WebFetch) stored in orchestrator_events
  • Verification verdicts: PASS/FAIL/PARTIAL in parsed_result with reasoning
  • Normalized errors:
    • PARSED_ERROR: explicit errors from AI output
    • EXIT_ERROR: non-zero exit code + stderr tail + commands_run
  • Timing: duration per step

Analysis Tools

After runs complete, use analyzer.py to explore the logged data:

# Full analysis of a run
python analyzer.py <run_id>

# Show only errors
python analyzer.py <run_id> --errors

# Tool usage breakdown (which tools called, how often)
python analyzer.py <run_id> --tools

# Timeline of events
python analyzer.py <run_id> --timeline

# Deep dive on a specific step
python analyzer.py <run_id> --step 3

# Save full analysis to reports/
python analyzer.py <run_id> --save-report

# Export as JSON
python analyzer.py <run_id> --export report

# Compare two runs
python analyzer.py --compare <run_id_1> <run_id_2>

The database also includes views for common queries:

  • orchestrator_run_summary — aggregated stats per run
  • orchestrator_errors — all errors across runs
  • orchestrator_tool_usage — tool call frequency
  • orchestrator_commands — shell commands executed

Analysis Dashboard

A web-based dashboard for exploring run data, viewing step details, and analyzing cross-run patterns.

Quick Start

# Start the dashboard (production mode)
python run_dashboard.py serve

# Start with hot-reload for development
python run_dashboard.py serve --dev

Then open http://localhost:8000 in your browser.

Features

Page Description
Run List All runs with status, duration, retry counts, classification breakdown. Filter by status, sort by any column.
Run Detail Step-by-step timeline with expandable details. Click any step to see classification, errors, resolution actions, web searches.
Patterns Cross-run analysis: error category heatmap, top failure patterns, self-correction leaderboard, tool comparison.

AI Classification

The dashboard includes an AI classifier that analyzes failed steps and categorizes them:

Classification Meaning
Architectural Fundamental approach is wrong — needs different strategy
Implementation Right approach, wrong execution — fixable with retries
Clean Pass Step succeeded without retries
# Classify all unclassified runs
python run_dashboard.py classify

# Classify a specific run
python run_dashboard.py classify <run_id>

# Force reclassification
python run_dashboard.py classify <run_id> --force

Classification requires an ANTHROPIC_API_KEY in your .env file.

CLI Commands

# Start dashboard server
python run_dashboard.py serve           # Production mode (serves built frontend)
python run_dashboard.py serve --dev     # Dev mode (API only, run Vite separately)

# Ingest reports from ./reports/ directory
python run_dashboard.py ingest          # Ingest new reports only
python run_dashboard.py ingest --force  # Re-ingest all reports

# Run AI classification
python run_dashboard.py classify        # Classify all unclassified
python run_dashboard.py classify <id>   # Classify specific run

Development

For frontend development with hot-reload:

# Terminal 1: Start API server
python run_dashboard.py serve --dev

# Terminal 2: Start Vite dev server
cd dashboard/frontend && npm run dev

Then open http://localhost:5173 (Vite proxies API calls to port 8000).

To rebuild the production frontend:

cd dashboard/frontend && npm run build

Project Structure

orchestrator/
├── orchestrator.py      # Main orchestration loop
├── analyzer.py          # Post-run analysis tool
├── run_dashboard.py     # Dashboard CLI entry point
├── preflight.py         # Pre-run verification
├── storage.py           # Supabase storage backend
├── playwright_tests.py  # Browser test runner
├── migration.sql        # Database schema
├── requirements.txt     # Python dependencies
├── .env.example         # Environment template
├── skills/              # Phase-specific guidance files
│   ├── all.md           # Universal guidance
│   ├── setup.md         # Project setup phase
│   ├── schema.md        # Database schema phase
│   ├── backend.md       # Backend/API phase
│   ├── frontend.md      # Frontend phase
│   ├── testing.md       # Testing phase
│   ├── deployment.md    # Deployment phase
│   └── fix.md           # Error fix phase
└── dashboard/           # Analysis dashboard
    ├── backend/         # FastAPI backend
    │   ├── app.py       # API routes
    │   ├── db.py        # SQLite database layer
    │   ├── ingest.py    # Report ingestion
    │   └── classifier.py # AI classification
    └── frontend/        # React frontend (Vite)
        └── src/
            ├── pages/   # RunList, RunDetail, Patterns
            └── components/

License

MIT

About

Observation framework that coordinates planner/verifier agent and implementer agent to build apps in Supabase while logging every token, tool call, and web search.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors