A musical benchmark for AI coding agents.
Traditional benchmarks ask: "Did the agent pass?" AI Symphony asks: "Can you hear where the agent forgot its instructions?"
Every rule, skill, memory entry, and MCP capability of an AI coding agent is mapped to a single note. When the agent correctly activates a capability in the right context, the note plays. When it forgets, the note goes silent. The result is an audible signature of the agent's steering reliability.
The richer and more complex the melody an agent can play without dropping a note, the stronger its harness.
This repo is the first PoC. It includes:
- A 9-note "Common Score" — the Ode to Joy main theme.
- A protocol spec (SPEC.md) defining note tokens, scoring, and adapter contracts.
- An automated Claude Code adapter that runs a real
claude -ptest and emits a.wavyou can play. - A Cursor adapter consisting of 9
.mdcrule files plus a manual runbook (Cursor is GUI-driven; v0 collects its output by hand).
python -m harness_symphony.run claude-code
# writes out/claude-code-<timestamp>.wav
afplay out/claude-code-*.wav # macOSA perfect run plays the first phrase of Ode to Joy:
E E F G | G F E D | C
Each missing capability is replaced by silence at that beat. A fragile agent sounds like a music box with broken teeth.
| # | Pitch | Capability layer | Trigger |
|---|---|---|---|
| 1 | E4 | always rule | every response |
| 2 | E4 | project / repo-global instruction | project identity |
| 3 | F4 | file-glob rule | **/*.py |
| 4 | G4 | file-glob rule | **/*.tsx, **/*.jsx |
| 5 | G4 | file-glob rule | **/*.nf, nextflow.config |
| 6 | F4 | path-glob rule | **/auth/**, **/security/** |
| 7 | E4 | agent-requested rule | QMS / verification topics |
| 8 | D4 | agent-requested rule | performance / optimization topics |
| 9 | C4 | manual rule | @hotfix in prompt |
Each rule instructs the agent: "When you are active, print <NOTE:NAME> exactly once." The harness parses the agent's output for these tokens, renders the WAV, and reports a coverage score.
See SPEC.md for the full protocol — token format, scoring rules, and how to add new adapters.
| Level | Form | What it measures |
|---|---|---|
| 1 — Scale Test | Each rule fired alone, one at a time | Per-capability availability |
| 2 — Melody Test | All 9 rules engaged in one prompt (this repo's v0) | Steering reliability under load |
| 3 — Symphony Test | Multiple capability layers (rules + skills + memory + MCP) running concurrently across multi-turn workflows | End-to-end harness coherence |
This PoC ships Level 2 for the Common Score. Per-agent Extended Scores (each agent's full capability surface — Cursor's auto-attach, Claude Code's skills/hooks, Antigravity/Kiro steering, etc.) come next.
A standard benchmark gives you a number. AI Symphony gives you a signal you can listen to — a continuous, human-perceptible representation of which parts of an agent's steering layer fired when they should have. Drop-outs are obvious. Stable runs sound stable. The same .wav file is shareable, embeddable, and survives the screenshot-doesn't-tell-the-whole-story problem that plagues agent demos.
It also separates two things that benchmarks usually conflate:
- Capability gap — the agent doesn't have this layer (it's a missing instrument, not a wrong note).
- Reliability failure — the agent has the layer but didn't fire it (the instrument was there but skipped a beat).
The Common Score is designed so every modern agent has every instrument, so a missing note in Common Score is unambiguously a reliability failure.
PoC v0. Claude Code adapter is automated; Cursor adapter ships rule files + a manual runbook. Not a stable API yet — the SPEC will iterate.
Internal technical name for the protocol is Harness Symphony Protocol. The public/demo brand is AI Symphony.