Skip to content

boheling/harness-symphony

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI Symphony

A musical benchmark for AI coding agents.

Traditional benchmarks ask: "Did the agent pass?" AI Symphony asks: "Can you hear where the agent forgot its instructions?"

Every rule, skill, memory entry, and MCP capability of an AI coding agent is mapped to a single note. When the agent correctly activates a capability in the right context, the note plays. When it forgets, the note goes silent. The result is an audible signature of the agent's steering reliability.

The richer and more complex the melody an agent can play without dropping a note, the stronger its harness.

This repo is the first PoC. It includes:

  • A 9-note "Common Score" — the Ode to Joy main theme.
  • A protocol spec (SPEC.md) defining note tokens, scoring, and adapter contracts.
  • An automated Claude Code adapter that runs a real claude -p test and emits a .wav you can play.
  • A Cursor adapter consisting of 9 .mdc rule files plus a manual runbook (Cursor is GUI-driven; v0 collects its output by hand).

Hear it

python -m harness_symphony.run claude-code
# writes out/claude-code-<timestamp>.wav
afplay out/claude-code-*.wav   # macOS

A perfect run plays the first phrase of Ode to Joy:

E E F G | G F E D | C

Each missing capability is replaced by silence at that beat. A fragile agent sounds like a music box with broken teeth.


The 9-note Common Score

# Pitch Capability layer Trigger
1 E4 always rule every response
2 E4 project / repo-global instruction project identity
3 F4 file-glob rule **/*.py
4 G4 file-glob rule **/*.tsx, **/*.jsx
5 G4 file-glob rule **/*.nf, nextflow.config
6 F4 path-glob rule **/auth/**, **/security/**
7 E4 agent-requested rule QMS / verification topics
8 D4 agent-requested rule performance / optimization topics
9 C4 manual rule @hotfix in prompt

Each rule instructs the agent: "When you are active, print <NOTE:NAME> exactly once." The harness parses the agent's output for these tokens, renders the WAV, and reports a coverage score.

See SPEC.md for the full protocol — token format, scoring rules, and how to add new adapters.


Levels

Level Form What it measures
1 — Scale Test Each rule fired alone, one at a time Per-capability availability
2 — Melody Test All 9 rules engaged in one prompt (this repo's v0) Steering reliability under load
3 — Symphony Test Multiple capability layers (rules + skills + memory + MCP) running concurrently across multi-turn workflows End-to-end harness coherence

This PoC ships Level 2 for the Common Score. Per-agent Extended Scores (each agent's full capability surface — Cursor's auto-attach, Claude Code's skills/hooks, Antigravity/Kiro steering, etc.) come next.


Why this is interesting

A standard benchmark gives you a number. AI Symphony gives you a signal you can listen to — a continuous, human-perceptible representation of which parts of an agent's steering layer fired when they should have. Drop-outs are obvious. Stable runs sound stable. The same .wav file is shareable, embeddable, and survives the screenshot-doesn't-tell-the-whole-story problem that plagues agent demos.

It also separates two things that benchmarks usually conflate:

  • Capability gap — the agent doesn't have this layer (it's a missing instrument, not a wrong note).
  • Reliability failure — the agent has the layer but didn't fire it (the instrument was there but skipped a beat).

The Common Score is designed so every modern agent has every instrument, so a missing note in Common Score is unambiguously a reliability failure.


Status

PoC v0. Claude Code adapter is automated; Cursor adapter ships rule files + a manual runbook. Not a stable API yet — the SPEC will iterate.

Internal technical name for the protocol is Harness Symphony Protocol. The public/demo brand is AI Symphony.

About

AI Symphony — a musical benchmark for AI coding agents: each capability maps to a note, so you can hear when an agent drops its instructions.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages