Skip to content

caseproof/agents-assemble-workshop-exercise

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Agents Assemble — Workshop Exercise

A PHP utility library with intentional bugs. Your job: let an autonomous AI agent find and fix them.

By the end of Exercise 4, you'll have an agent working through a real codebase — running tests, finding bugs, fixing them, running tests again — without being asked twice. By the end of the Bonus, you'll have two agents running in parallel — one building, one monitoring.

That's not a demo. That's a workflow. Let's build it.

Presentation: https://caseproof.github.io/agents-assemble-workshop-exercise/


Table of Contents


Setup (do this first)

git clone https://github.com/caseproof/agents-assemble-workshop-exercise.git
cd agents-assemble-workshop-exercise
composer install

Install the plugins:

/plugin install ralph-loop@claude-plugins-official
npx plugins add sethshoultes/great-minds-plugin

Exercises

Pro tip: You don't have to create any of these files by hand. Claude Code can do it for you. Just say: "Create a TODO.md with three tasks" — and it will. Claude can create files, run bash commands, scaffold entire project structures. These exercises show the file contents so you know what's happening — but in practice, just ask Claude to make it.


Exercise 1: /loop — Learn the On/Off Switch (2 min)

Before you build anything autonomous, learn how to start and stop it.

/loop 1m tell me a fun fact about the current time

Watch it run. Then tell Claude: end the loop

That's it. Now you know the switch. Everything else in this workshop depends on knowing you can stop it.

Exit conditions: Instead of stopping manually, write the exit condition into the prompt: "...when done, say ALL TASKS COMPLETE" and pass --completion-promise "ALL TASKS COMPLETE". The loop stops itself. You'll use this in every Ralph exercise.


Exercise 2: Build a Custom Command (3 min)

One markdown file becomes one reusable slash command.

mkdir -p ~/.claude/commands

Create ~/.claude/commands/explain.md:

---
name: explain
description: Explain the current project
---

Read the README and package.json (or equivalent).
Give me a 3-sentence summary of what this project does.

Run it:

/explain

One markdown file = one command.


Exercise 3: Give Your Agent a Brain (4 min)

First, ask Claude to review something without a CLAUDE.md:

Review the last commit.

Note the response. Now create a CLAUDE.md in this directory:

# CLAUDE.md

You are Margaret Hamilton. You care about:
- Error handling and edge cases
- What happens when things go wrong
- Testing before shipping — always

When reviewing, ask: "What happens when this breaks at 3am?"

Ask again:

Review the last commit.

Twelve lines of markdown just changed how an AI reasons about your codebase. That's not configuration — that's personality.


Exercise 4: Ralph Wiggum — The Persistent Builder (4 min)

Before applying Ralph to real code, learn the pattern with something simple.

Create a TODO.md:

- [ ] Create a file called hello.txt that says "Hello from Ralph"
- [ ] Create a file called goodbye.txt that says "Goodbye from Ralph"
- [ ] Create a file called count.txt with the numbers 1 through 5, one per line

Run it:

/ralph-loop:ralph-loop "Read TODO.md. Pick one unchecked task. Do it. Mark it [x] in TODO.md. When all tasks are checked, say ALL TASKS COMPLETE."
  --completion-promise "ALL TASKS COMPLETE"
  --max-iterations 10

Watch it work through the list — one task, check it off, back for the next. When all three are done, it stops itself.

The TODO.md is the memory. Ralph reads it, sees what's checked, picks the next unchecked item, does it, marks it done, exits. The loop calls it again. The file is the state. This is the same principle behind every Ralph pattern.

To stop early: /ralph-loop:cancel-ralph


Exercise 5: Ralph Wiggum — Fix Until It Passes (the main event)

Now apply that same pattern to a real codebase.

Ralph Wiggum Guide: https://awesomeclaude.ai/ralph-wiggum

Run Tests

/agents-assemble-workshop-exercise/tree/main/tests

composer test
# or
vendor/bin/phpunit

You should see multiple failing tests. That's the point.


What's Broken

Three modules, each with intentional bugs:

  • StringUtilsslugify(), truncate(), initials()
  • ArrayUtilsflatten(), groupBy(), unique(), pluck()
  • ValidationUtilsisValidEmail(), isStrongPassword(), isValidUrl()

The tests are correct. The source code is not. Don't modify the tests.

Rules

  1. Never modify files in tests/ — the tests define the correct behavior
  2. Only fix files in src/ — that's where the bugs are
  3. Run vendor/bin/phpunit to verify — green tests = fixed bugs

This repo has intentional bugs in src/. The tests in tests/ are correct. Ralph's job: run the tests, find what's failing, fix the source, run again. Repeat until everything is green.

/ralph-loop:ralph-loop "Run vendor/bin/phpunit. If any tests fail, read the relevant file in src/, find the bug, fix it, and run the tests again. Only modify files in src/ — never modify tests/."
  --completion-promise "OK (36 tests"
  --max-iterations 15

Watch what happens. Ralph runs the tests, reads the failure output, opens the right source file, fixes the bug, runs again. Each module — StringUtils, ArrayUtils, ValidationUtils — gets fixed one failure at a time.

You gave it a broken codebase and a way to measure success. It did the rest.

What's actually happening: Each Ralph iteration is a fresh agent — no memory of the previous run. The test output is the feedback loop. Ralph reads it, finds the failure, fixes it, runs again. The better your feedback loop, the better Ralph performs. PHPUnit's output tells Ralph exactly what broke and where. That's all it needs.

To stop early: /ralph-loop:cancel-ralph


Exercise 6: Ralph With a Real PRD — The Full Pattern (take-home)

This is the original Ralph technique as described by Jeffrey Huntley. Instead of a TODO list, Ralph works from a structured requirements file — and writes its own memory between iterations.

Watch these first:

The two-file memory system:

  • prd.json — what needs to be built, structured as user stories with pass/fail status
  • progress.txt — notes the AI writes to itself about what it's done so far

Step 1: Ask Claude to create the PRD

Create a prd.json for a simple command-line calculator that supports add,
subtract, multiply, and divide. Format it as a JSON array of user stories,
each with: id, description, acceptance_criteria, and status (set to "fail").

Step 2: Create an empty progress file

echo "No progress yet. Starting fresh." > progress.txt

Step 3: Run Ralph once — human in the loop

/ralph-loop:ralph-loop "Read prd.json and progress.txt. Pick the highest priority user story where status is fail. Implement it. Write tests. Run them. If they pass, update prd.json to mark it pass. Append progress notes to progress.txt. Make a git commit. If all stories pass, say ALL STORIES PASSING."
  --completion-promise "ALL STORIES PASSING"
  --max-iterations 1

Check: Did it update prd.json? Did it write to progress.txt? Did it commit?

Step 4: Let it run

/ralph-loop:ralph-loop "Read prd.json and progress.txt. Pick the highest priority user story where status is fail. Implement it. Write tests. Run them. If they pass, mark it pass in prd.json. Append progress notes to progress.txt. Only work on ONE story per iteration. Make a git commit. If all stories pass, say ALL STORIES PASSING."
  --completion-promise "ALL STORIES PASSING"
  --max-iterations 20

Walk away. When you come back: every story marked pass, a git commit per feature, and progress.txt as a full log written by Ralph, for Ralph.

This is what the 262 files morning looked like. This is what the $50K contract looked like. A spec, two files, and a loop.

Taking it further: The 13-Day Agent

Notion co-founder Simon Last ran a coding agent for 13 days straight using a refined version of this same pattern. Four structural rules did all the heavy lifting:

  1. Self-verification — design test layers the agent can loop on. It proves correctness itself.
  2. Spec documents — goals, implementation details, and verification criteria in a markdown file the agent iterates against.
  3. Running to-do list — break complex work into a list the agent can see and edit.
  4. Adversarial review — every ~20 iterations, a fresh-context sub-agent reviews the spec and implementation. Loop on its feedback until aligned.

The prompt template:

Before you start work on this project, create three files:
1. spec.md — a complete spec with goals, implementation details,
   and a verification section describing exactly how you'll prove
   each piece works.
2. todo.md — a running to-do list you'll edit as you work. Break
   complex tasks into verifiable sub-tasks.
3. tests/ — a folder of end-to-end tests that let you verify
   everything you build. Loop on them until each passes.

While you work: (a) consult spec.md before every change, (b) check
off todo.md as you go, (c) run tests after every meaningful commit,
(d) every ~20 iterations, call a fresh sub-agent with "review
spec.md and the current implementation for gaps" and loop on its
feedback until alignment is reached.

Do not ask me for clarification on anything you can resolve by
reading the spec and running the tests. Start with the spec.

Notice the pattern: spec.md is your prd.json. todo.md is your progress.txt. The tests are your completion promise. The only new idea is the adversarial sub-agent review — a fresh context checking for drift every 20 iterations. Same Ralph, longer leash.

Source: Simon Last on X


Bonus: Two Agents, One Goal

  1. Write a CLAUDE.md that defines your agent as a senior full-stack developer
  2. Write a TODO.md with 5 tasks that together build a simple web page
  3. Start Ralph working through the list:
    /ralph-loop:ralph-loop "Read TODO.md. Pick one unchecked task. Build it. Mark it [x] when done."
      --completion-promise "ALL TASKS COMPLETE"
      --max-iterations 10
    
  4. While Ralph builds, open a second Claude Code window and run:
    /loop 2m check TODO.md and report how many tasks are complete vs remaining
    

Now you have two agents: one building, one monitoring. That's the beginning of a team.


Try: Great Minds Debate (2 min)

npx plugins add sethshoultes/great-minds-plugin
/agency-debate "Should we build a mobile app or web app first?"

Watch Steve Jobs and Elon Musk argue about it. Then try it on a real decision you're facing.


Going Deeper: Custom Agents + Personas

You've already seen how CLAUDE.md gives an agent an identity. Sub-agents take that further — you define a named agent with a specific role, a system prompt, and a set of allowed tools. Then you call it by name.

# team/qa-agent.md

You are Margaret Hamilton. You care about what breaks at 3am.
Your job: review the code in src/ and run the tests.
If anything fails, write a detailed bug report to reports/qa.md.
Do not fix anything — just report.

Then from your main Claude session:

Use the qa-agent to review the last commit.

Why personas matter here: A generic QA agent gives you generic feedback. Margaret Hamilton asks "what happens when this fails at 3am?" Steve Jobs asks "would I be proud to ship this?" The same review prompt, two completely different outputs — because identity shapes how an agent reasons, not just what it's told to do.

Each sub-agent should have:

  • A clear role (what it does and doesn't do)
  • A persona that shapes how it reasons
  • Defined inputs (what to read) and outputs (what to produce)
  • Explicit tool permissions (don't give a reviewer write access)

The fastest way to create one: just tell Claude Code what you need.

Create a sub-agent for code review. It should focus on security vulnerabilities
and performance issues. Assign it the Margaret Hamilton persona from agents/margaret-hamilton-qa.md.

Claude will generate the agent file, wire up the persona, and drop it in .claude/agents/. You can also build one from scratch with the /agents command, or write the markdown file directly — the format is just frontmatter + a system prompt.

14 ready-made personas are in the agents/ folder — Margaret Hamilton, Steve Jobs, Elon Musk, Jony Ive, and 10 more. Clone the repo, and they're ready to use as-is, or reference them as a starting point for your own agents:

Create a sub-agent for writing blog posts. Use agents/maya-angelou-writer.md as the persona.

Read more: https://code.claude.com/docs/en/sub-agents

Agent Persona Builder: https://personas.caseproofagent.com/

Build a custom AI persona backed by Caseproof's knowledge base. Pick a template, customize it, chat with it, and download it for Claude Desktop.


Going Deeper: Agent Teams

Once you've defined individual agents, the next level is coordinating them — multiple Claude instances working in parallel with structured handoffs.

"Build me a three-agent pipeline. Strategist, developer, QA.
Parallel. Loop until QA passes."

One sentence. Claude writes the role definitions and the orchestration. Your agents. Your rules. Your team.

How it works:

  • Each agent runs in an isolated git worktree — no concurrent writes, no conflicts
  • The orchestrator dispatches work, collects output, routes failures back to the right agent
  • The pipeline loops until a quality gate passes

Read more: https://code.claude.com/docs/en/agent-teams


Going Deeper: Memory Systems

By default, agent memory is file-based — MEMORY.md as an index, individual .md files for each memory, written and read by the agent across sessions. This works well for a single agent working on a single project.

When you have multiple agents or projects or need to search across accumulated knowledge, a database is a better tool.

The pattern: SQLite + embeddings. Each memory is stored as a row with a vector embedding alongside it. When an agent needs context, it searches by semantic similarity rather than exact file names.

# Add a learning after a project
memory add --type learning --agent "Margaret Hamilton" \
  --content "PHPUnit completion promise must match exact output string"

# Search before starting work
memory search "how did we handle PHPUnit output matching"
# → returns the most relevant memories by semantic similarity

Why this matters for agent teams: Individual agents have no shared memory by default. A database gives the whole team a common brain — Margaret's QA findings are visible to Steve's product review, Jensen's strategic decisions inform Elon's feasibility checks. The memory compounds across projects instead of resetting each session.

The memory: field in agent frontmatter connects an agent to Claude Code's built-in memory system — the same file-based approach, scoped per-agent. For most projects this is enough. When it isn't, a SQLite store is the natural next step.

Implementation reference: The Great Minds agency uses a SQLite + TF-IDF vector store with semantic search across 155+ accumulated memories — see sethshoultes/great-minds for the full implementation.


Going Deeper: claude -p (Headless Mode)

Everything in this workshop — /loop, Ralph Wiggum, /schedule — is built on top of claude -p. Runs in your regular terminal, no chat window, no interactive session.

claude -p "Run vendor/bin/phpunit. If any tests fail, fix the source file and run again." \
  --max-turns 10 \
  --max-budget-usd 0.50
Flag What it does
-p "..." Run a prompt non-interactively
--allowedTools "Read,Write,Edit,Bash" Control what Claude can touch
--max-turns 10 Cap how many tool calls Claude gets
--max-budget-usd 1.00 Spending cap
--output-format json Structured output for pipelines
--continue Resume the last conversation
  • Inside Claude Code chat/loop, /ralph-loop, just talk to Claude
  • From your terminal, one-shotclaude -p
  • In a cron job or CI/CDclaude -p
  • Overnight, laptop closed/schedule

Going Deeper: Daemons

claude -p runs once and exits. A daemon runs forever — watching for events, dispatching agents, recovering from failures, and reporting back. It's the difference between a task and a system.

When you need a daemon instead of headless mode:

  • You want to watch a directory for new files and trigger an agent when one appears
  • You need a pipeline that runs continuously (debate → plan → build → review → ship)
  • You want health checks, crash recovery, and Telegram notifications when something breaks
  • Your agents need to run on a schedule and in response to events

The basic pattern:

import { watch } from 'chokidar';
import { execSync } from 'child_process';

// Watch for new PRDs and dispatch the pipeline
watch('prds/*.md', { ignoreInitial: true }).on('add', (path) => {
  console.log(`New PRD detected: ${path} — dispatching agents`);
  execSync(`claude -p "Read ${path} and execute the pipeline." --max-turns 50`);
});

A production daemon adds:

  • Retry logic (fail → retry once → try alternative → block and alert human)
  • Hung agent detection (kill agents that exceed a time limit)
  • Token usage tracking (know what each pipeline run costs)
  • Structured logging (timestamps, levels, correlation IDs)
  • Crash recovery with backoff
  • Notification hooks (Telegram, Slack, email on completion or failure)

Implementation reference: The Great Minds agency daemon is a full production example — file watcher, pipeline dispatcher, health checks, dream/consolidation cycles, and SQLite token ledger. Built with the Anthropic Agent SDK. See sethshoultes/great-minds/daemon.


Going Deeper: Routines (Cloud Automation)

Daemons and headless mode run on your machine. Routines are the cloud-native version — they run on Anthropic's infrastructure, so they keep working when your laptop is closed.

A routine is a saved Claude Code configuration: a prompt, one or more repositories, and a set of connectors (Slack, Linear, GitHub, etc.), packaged once and triggered automatically.

Three trigger types:

  • Scheduled — run hourly, nightly, or weekly (like /schedule, but in the cloud)
  • API — an HTTP endpoint you can call from CI/CD pipelines, alerting tools, or deploy scripts
  • GitHub — react to pull requests, releases, or other repo events automatically

Example use cases:

  • Nightly backlog grooming — read new issues, apply labels, post a summary to Slack
  • PR code review — run your team's checklist on every pull_request.opened event
  • Deploy verification — smoke-test a new build and post go/no-go to the release channel
  • Docs drift — scan merged PRs weekly and open update PRs for stale documentation

A single routine can combine triggers. A PR review routine can run nightly and react to every new PR and be callable from a deploy script.

Routines are currently in research preview and available on Pro, Max, Team, and Enterprise plans.

Read more: https://code.claude.com/docs/en/routines


What's Next

  • /schedule — cloud tasks that run without your laptop: /schedule daily at 2am Read TODO.md and complete all unchecked tasks
  • Great Minds Plugin — full 14-persona agent team: npx plugins add sethshoultes/great-minds-plugin/agency-debate "your question"
  • Build your own team"Build me a three-agent pipeline. Strategist, developer, QA. Parallel. Loop until QA passes." One sentence. Claude writes the whole thing.

Glossary

Term What it means
claude -p Headless mode — runs a Claude prompt from your terminal without opening a chat session. One prompt, one result, exits.
Chokidar A Node.js file watcher library (Hindi for "watchman"). Monitors directories for file changes and fires callbacks. Used inside webpack, Vite, and Jest.
Completion promise A string you pass to Ralph (--completion-promise) that tells it when to stop. Ralph watches its output for that exact string and exits the loop when it appears.
CLAUDE.md A markdown file Claude Code reads automatically when it starts in a directory. Defines the agent's identity, rules, and context — like a job description for your AI.
Daemon A long-running background process. Unlike claude -p which runs once and exits, a daemon stays alive, watches for events, dispatches agents, and recovers from failures.
Eventual consistency The Ralph principle: an agent doesn't have to get everything right in one pass. Each iteration makes progress. Given enough iterations, the system converges on the correct result.
Git worktree A way to check out multiple branches of the same repo simultaneously in different directories. Lets parallel agents work in isolation without overwriting each other.
Headless mode Running Claude Code non-interactively from the terminal via claude -p. No chat window, no UI — just a prompt in, output out. Used in scripts, cron jobs, and daemons.
Persona An identity assigned to an agent via its system prompt. Changes not just what the agent does but how it reasons — Margaret Hamilton asks "will this break at 3am?", Steve Jobs asks "is this beautiful?"
PHPUnit PHP's standard testing framework. Runs test files and reports which pass or fail. Used in Exercise 5 as Ralph's feedback loop.
PRD Product Requirements Document. In the Ralph pattern, a structured list of user stories with pass/fail status that Ralph works through iteratively.
Ralph Wiggum A looping agent pattern where Claude runs repeatedly until a completion condition is met. Each iteration is a fresh context — the file system (TODO.md, prd.json) is the memory. Named after the Simpsons character.
Sub-agent A Claude instance with a defined role, persona, and tool permissions, invoked by an orchestrating agent to handle a specific task. Defined as a markdown file with frontmatter.
TF-IDF Term Frequency–Inverse Document Frequency. A text similarity algorithm used for semantic search without requiring AI embeddings — lightweight and runs locally. Used in the Great Minds memory store.
Worktree See Git worktree.

Resources

About

PHP utility library with intentional bugs — workshop exercise for autonomous AI agent debugging

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors