Autonomous goal-driven experimentation for Claude Code, Codex, and OpenCode.
13 command protocols · 11 native hooks · background runtime · parallel verified closeout
English · 🇨🇳 中文 · 🇯🇵 日本語 · 🇰🇷 한국어 · 🇫🇷 Français · 🇩🇪 Deutsch · 🇪🇸 Español · 🇧🇷 Português · 🇷🇺 Русский
The idea: tell your agent what you want to improve, then walk away. It modifies your code, verifies the result, keeps or discards, and repeats. You come back to a log of experiments and a better codebase.
Inspired by Karpathy's autoresearch, generalized beyond ML to anything you can verify mechanically: test coverage, type errors, latency, lint warnings, security findings, release readiness — if a command can tell whether it improved, the loop can iterate on it.
Primary path: paste this into your coding agent and let it install Autoresearch for you:
Install Autoresearch in this environment.
Use the installer from:
https://raw.githubusercontent.com/coder-company/agent-autoresearch/main/install.sh
Pick the install flag for the current agent:
- Claude Code: --claude
- Codex: --codex
- OpenCode: --opencode
- If you cannot infer the agent, use --all.
Run the installer non-interactively with bash, verify `autoresearch --help`, then tell me the command I should use to start Autoresearch in this agent.
Start commands are `/autoresearch` for Claude Code, `$autoresearch` for Codex, and `/autoresearch` for OpenCode.
Use a global install unless I explicitly asked for a project-local install.
Manual install commands:
Claude Code
curl -fsSL https://raw.githubusercontent.com/coder-company/agent-autoresearch/main/install.sh | bash -s -- --yes --claudeStart with /autoresearch. If you already have the autoresearch binary installed, claude plugin add coder-company/agent-autoresearch also works.
Codex
curl -fsSL https://raw.githubusercontent.com/coder-company/agent-autoresearch/main/install.sh | bash -s -- --yes --codexStart with $autoresearch. This installs the binary plus the Codex skill package. Use $skill-installer install https://github.com/coder-company/agent-autoresearch only when you want the skill without the source-built binary.
OpenCode
curl -fsSL https://raw.githubusercontent.com/coder-company/agent-autoresearch/main/install.sh | bash -s -- --yes --opencodeStart with /autoresearch. Mode commands use underscores, such as /autoresearch_debug, /autoresearch_fix, and /autoresearch_security.
Each command downloads the current source archive, builds the Rust binary, installs it on your PATH, and installs the selected agent package. For local/manual Claude installs, copy the generated .claude/commands and .claude/skills/autoresearch package from this repo into your target project.
Open your project and go. The example below uses the Claude Code/OpenCode command; Codex users type $autoresearch instead.
You: /autoresearch
I want to get rid of all the `any` types in my TypeScript code
Agent: I found 47 `any` occurrences across src/**/*.ts.
Results directory: ./autoresearch-results/
Metric: `any` count (current: 47), direction: lower
Verify: grep count + tsc --noEmit as guard
Say "go" to start, or tell me what to change.
You: Go. Run overnight.
Agent: Baseline: 47. Iterating.
Each improvement stacks. Each failure reverts. Everything is logged.
For Codex, start Codex with codex --dangerously-bypass-approvals-and-sandbox for the smoothest foreground and background runs. For project-local Codex or OpenCode installs, run the raw installer from the target project with --local. From a clone, use ./install.sh --yes --all, or run ./install.sh for the guided installer. Add --vscode to install the editor extension from integrations/vscode. See Getting Started.
You say one sentence → Agent scans & confirms → You say "go"
│
v
┌───────────────────┐
│ The Loop │
│ │
│ modify one thing │
│ trial commit │
│ run verify │
│ improved? keep │
│ worse? revert │
│ log the result │
│ repeat │
└───────────────────┘
That's it. The agent keeps going until the goal is reached, the iteration cap is hit, or you interrupt.
| You say | What happens |
|---|---|
| "Improve my test coverage" | Iterates until target or interrupted |
| "Fix the 12 failing tests" | Repairs one by one until zero remain |
| "Why is the API returning 503?" | Hunts root cause with falsifiable hypotheses |
| "Is this code secure?" | STRIDE + OWASP audit, every finding backed by code evidence |
| "Ship it" | 8-phase checklist: test, lint, build, version, push |
| "I want to optimize but don't know what" | Scans repo, suggests metrics, generates config |
| "What could go wrong with this feature?" | Explores edge cases across 12 dimensions |
| "Should we use event sourcing here?" | Adversarial debate with blind judges until convergence |
Behind the scenes, the agent maps your sentence to the right mode. You never need to pick one.
You don't write config. The agent infers everything from your sentence and your repo:
| What it needs | How it gets it | Example |
|---|---|---|
| Goal | Your sentence | "get rid of all any types" |
| Scope | Scans repo structure | src/**/*.ts |
| Metric | Proposes based on goal + tooling | any count (current: 47) |
| Direction | Infers from "improve" / "reduce" / "eliminate" | lower |
| Verify | Matches to repo tooling | grep count + tsc --noEmit |
| Guard | Suggests a baseline-passing regression check | npm test |
Before starting, it always shows what it found and asks you to confirm. Then you say "go."
Instead of blind retrying, the loop escalates:
| Trigger | Action |
|---|---|
| 3 consecutive failures | REFINE — adjust within current strategy |
| 5 consecutive failures | PIVOT — try a fundamentally different approach |
| 2 PIVOTs without progress | Web search — look for external solutions |
| 3 PIVOTs without progress | Stop — report that human input is needed |
One success resets all counters.
| Command | Purpose |
|---|---|
/autoresearch |
The core loop — improve any metric |
/autoresearch:plan |
Don't know where to start? This figures it out |
/autoresearch:debug |
Find bugs — hypothesize, test, confirm |
/autoresearch:fix |
Kill errors one by one until zero remain |
/autoresearch:security |
Full security audit (STRIDE + OWASP) |
/autoresearch:ship |
Ship through 8 gates: test, lint, build, version, push |
/autoresearch:scenario |
"What could go wrong?" across 12 dimensions |
/autoresearch:predict |
Get 5 expert opinions before you act |
/autoresearch:learn |
Generate/update documentation automatically |
/autoresearch:reason |
Debate a subjective decision with blind judges |
/autoresearch:probe |
Interrogate requirements until nothing's ambiguous |
/autoresearch:improve |
Research ICP needs and generate product improvement PRDs |
/autoresearch:evals |
Analyze past runs: trends, plateaus, anomalies, goal achievement, CI gates, run comparison, parallel worker significance, --recommend, --plateau-window, and --chain guidance |
Just type the command. It asks for what it needs.
Codex: Use
$autoresearchthen the mode as a keyword:$autoresearch debug.OpenCode: Underscore naming:
/autoresearch_debug,/autoresearch_fix, etc.
Every iteration is recorded in autoresearch-results/results.tsv:
iteration commit metric delta guard status description
0 a1b2c3d 47 0 - baseline initial any count
1 b2c3d4e 41 -6 pass keep replace any in auth module
2 - 49 +8 - discard generic wrapper introduced new anys
3 d4e5f6g 38 -3 pass keep type-narrow API response handlers
Failed experiments revert from git but stay in the log. The log is the real audit trail.
Covered in detail in the guide:
- Cross-run learning — lessons from past runs bias future hypothesis generation
- Session resume — interrupted runs pick up from the last consistent state
- Background runtime control —
autoresearch runtime runpreflights each Codex turn, manageslaunch.json,runtime.json,runtime.log, and relaunches until stop or needs-human;start/status/supervise/stopremain available for manual control - Environment probe —
autoresearch env --format jsonreports CPU, disk, container state, toolchains, recommended parallelism, and can seedinit --environment-summary auto - Live results tailing —
autoresearch watch --lines 20 --format jsonlfollowsautoresearch-results/results.tsvfrom the workspace root or any repo subdirectory - Progress WebSocket —
autoresearch watch --websocketstreams snapshot and row update payloads to real-time dashboards - Terminal dashboard —
autoresearch dashboard --oncerenders status, metric history, escalation, and recent rows in one view - Compact run status —
autoresearch status --summaryprints monitor-friendly counters without full config payloads - Metric history sparkline —
autoresearch progressgraphs retained metric history directly in terminal output - Noise-aware verification —
autoresearch verify --repeat 3 --aggregate medianreruns scalar metrics and returns an aggregate with all samples - Cost estimates —
autoresearch cost --per-iteration-usd 0.25projects completed and remaining token/API spend - Eval checkpoints —
autoresearch checkpoint --format jsonruns evals only when the active run reaches its checkpoint interval - Eval chain handoff —
autoresearch evals --file autoresearch-results/results.tsv --recommend --chain shipwrites analysis, go/no-go, and next-target metadata beside the TSV - Eval run comparison —
autoresearch evals --file current.tsv --compare previous.tsv --format jsonreports improvement, efficiency, plateau deltas, and a winner - Eval target gate —
autoresearch evals --file results.tsv --target 90 --recommend --format jsonreportsgoal_achievedand recommendsgoal_metwhen the threshold is crossed - Eval CI exit gate —
autoresearch evals --file results.tsv --target 90 --fail-on goal-not-met --format jsonprints the report and exits non-zero when the gate fails - Protocol re-anchor checks —
autoresearch reanchor --format jsonreports 10-iteration fingerprint due state and reload references for long sessions - Parallel worker execution —
autoresearch parallel prepare/run/closeout/cleanupcreates isolated worker worktrees, launches prompts, merges and verifies the best result, logs5a/5baudit rows, and cleans up branches - A/B compare mode —
autoresearch parallel compare --a "..." --b "..."prepares two explicit hypotheses for head-to-head metric closeout - Shell completions —
autoresearch completions bash|zsh|fish|elvish|powershellprints native completion scripts for local installs - Man pages —
autoresearch manpages --output-dir man/man1writes a localautoresearch.1page for packages and offline docs - Project defaults —
.autoresearch.tomlstores repeatable init settings such as goal, scope, metric, verify, guard, and iteration cap - Native planning —
autoresearch plan --goal "..." --format jsonsuggests scope, metric, direction, verify, guard, and iteration count from repo tooling - Plan chain handoff —
autoresearch plan --goal "..." --debugwrites the derived config into a downstream handoff - Ignored artifact defaults — native artifact generators write under
autoresearch-results/<mode>/unless you pass an explicit output path - Debug artifact generation —
autoresearch debug --symptom ... --scope ...writes hypothesis, findings, eliminated, TSV, and handoff artifacts - Debug investigation controls —
autoresearch debug --depth deep --iterations 12 --severity highrecords investigation budget and severity filter metadata - Fix artifact generation —
autoresearch fix --target ... --scope ... --iterations 7writes a one-error-at-a-time repair plan, TSV, and handoff underautoresearch-results/fix - Debug-to-fix import —
autoresearch fix --from-debugimports the latest debug handoff scope and symptom into a repair plan - Fix chain controls —
autoresearch fix --learn --evalsrecords downstream handoff and checkpoint propagation metadata - Improve artifact bundle —
autoresearch improve --goal ... --icp ... --depth deep --iterations 24 --evalswrites research findings, ranked plan, summary, TSV, and handoff with research budget metadata - Improve research controls —
autoresearch improve --seeds 5 --no-discover --learnrecords seed volume, discovery posture, and downstream handoff metadata - PRD artifact generation —
autoresearch prd --title ... --problem ...writes improve-mode PRDs with decision markers and ready-to-run config blocks - Security artifact generation —
autoresearch security --scope ... --focus ...writes STRIDE, OWASP, findings, recommendations, TSV, and handoff artifacts - Security gating —
autoresearch security --fail-on high --fixrecords CI threshold and downstream repair handoff metadata - Security audit controls —
autoresearch security --depth deep --iterations 18 --diff --fix --evalsrecords audit budget, delta mode, repair handoff, and checkpoint metadata - Ship artifact generation —
autoresearch ship --target ... --type ... --dry-runwrites an 8-phase checklist, summary, ship log, and handoff - Ship workflow controls —
autoresearch ship --auto --force --rollback --monitor 15 --learnrecords approval, rollback, monitoring, and downstream metadata without external side effects - Scenario artifact generation —
autoresearch scenario --target ... --domain api --format test-scenarioswrites a 12-dimension edge-case matrix grounded in scope - Scenario exploration controls —
autoresearch scenario --domain web --depth deep --iterations 16 --evals --debugrecords domain, exploration budget, checkpoint metadata, and downstream handoff - Predict artifact generation —
autoresearch predict --proposal ...writes a five-persona pre-implementation review - Predict review controls —
autoresearch predict --depth deep --adversarial --fail-on highrecords review profile and CI gate metadata - Predict-to-improve handoff —
autoresearch predict --proposal ... --improvepasses expert findings into product improvement research - Reason artifact generation —
autoresearch reason --question ...writes an adversarial candidate debate with a blind-judge rubric - Reason judge controls —
autoresearch reason --iterations 11 --judges 7 --convergence 4 --temperature 0.2records budget, panel, convergence, synthesis, and generation hints - Probe artifact generation —
autoresearch probe --subject ...writes eight-persona requirement questions and constraint slots - Probe interrogation controls —
autoresearch probe --mode autonomous --depth deep --iterations 9 --adversarialrecords depth, round budget, persona count, and saturation settings - Probe-to-improve handoff —
autoresearch probe --subject ... --improvepasses discovered constraints into product improvement research - Learn artifact generation —
autoresearch learn --mode summarize --scope ... --depth comprehensive --iterations 14 --evalswrites summary, validation, TSV, and handoff documentation artifacts with scan and chain metadata - Config template —
autoresearch config template --output .autoresearch.tomlwrites a starter project defaults file - Config validation —
autoresearch config validatechecks project defaults without running verify or guard - Stable CLI API manifest —
autoresearch api --format jsonemits command, flag, and semver policy metadata for wrappers and agents - MCP integration —
autoresearch mcp serveexposes read-only tools, andautoresearch mcp callinvokes external stdio MCP tools from scripts - Pre-built release binaries — tag builds publish checksummed Linux, macOS, and Windows archives through GitHub Releases
- Package-manager metadata —
cargo-binstallarchive metadata and a Homebrew formula template track the release assets - GitHub Action runner —
.github/actions/autoresearchwrapsexecmode for checked-in CI optimization loops - VS Code extension package —
./install.sh --yes --vscodeinstallsintegrations/vscode, which exposes status, dashboard, and watch commands by delegating to the binary - Thin Codex routing skill —
.agents/skills/autoresearch/SKILL.mdstays under 90 lines and defers detailed operations to references - Documentation site —
book.tomlanddocs/SUMMARY.mdbuild the full docs set as an mdBook site for GitHub Pages - Workspace scope expansion —
autoresearch scope expand --format jsonresolves primary and companion repo globs and annotates package roots for monorepos - Cross-repo execution —
autoresearch workspace exec --rollback-on-failureruns one screened command across all repo targets and restores attempted repos on failure - Cross-repo guard presets —
autoresearch guard-presets --format jsonsuggests per-repo guard commands for primary and companion repos - Shared workspace lessons —
autoresearch lessons --workspace-context --last 5proves companion repos read the shared workspace lessons log - Mode plugin manifests —
autoresearch plugin listandautoresearch plugin validateload TOML mode definitions with safety screening - Plugin marketplace index —
autoresearch plugin marketplacevalidates a local community plugin catalog and the manifests it references - Manual lessons —
autoresearch lessons --add "strategy" --context "why it matters"appends reusable run knowledge - Search helper and escalation —
autoresearch search --from-state --logbuilds a run-aware query, calls a configured provider, caches results, and records a search meta-iteration;decideautomatically runs the same helper when escalation reaches Web Search andAUTORESEARCH_SEARCH_CMDis configured - Chaining —
debug --fix,probe --plan,probe --improve,predict --debug,predict --improve - CI/CD mode (
exec) — non-interactive, JSON output, for automation pipelines - Dual-gate verification — separate verify (did it improve?) and guard (did anything break?)
- Safety hooks — blocks dangerous commands, secrets exposure, and scope violations automatically
It only makes small incremental changes. Can it try bigger ideas? By default the loop favors small, verifiable steps — that's by design. But it can go bigger: describe a larger hypothesis in your prompt (e.g., "try replacing the ORM with raw SQL queries and run the full benchmark"), and it will treat that as a single experiment to verify.
Is this more for optimization than research?
It's strongest when the goal and metric are clear — push coverage up, push errors down, push latency lower. For open-ended exploration where the direction itself is uncertain, use /autoresearch:plan first, then switch to the loop once you know what to measure.
How do I stop it?
Foreground: Ctrl+C. Background: autoresearch runtime stop. Or set Iterations: N. The agent commits before verifying, so your last successful state is always in git.
Can it resume after interruption?
Yes. It resumes from autoresearch-results/state.json automatically.
Does it work with any language? Any language, any framework. If you can express success as a number and write a shell command that outputs it, autoresearch can optimize toward it.
What if I don't know what to measure?
/autoresearch:plan scans your repo, looks at your tooling, and suggests metrics with ready-to-run verify commands.
Will it break my code?
No. Every change is committed before verification. If it makes things worse, it reverts. If you set a Guard (e.g., npm test), no change persists unless all tests still pass.
| Doc | What it covers |
|---|---|
| Docs Index | Repository documentation map |
| Installation | Claude Code, Codex, OpenCode, source install |
| Guide | Command map, binary operations, artifact contract |
| Examples | Copy-paste configs for common goals and parallel closeout |
| System Architecture | Binary, skill packages, artifacts, runtime flow |
| Project Changelog | Release history entrypoint and current development track |
| Getting Started | Install, first run, what to expect |
| Examples by Domain | Ready configs: coverage, types, bundle, latency, security |
| Chains & Combinations | Piping commands together |
| Hooks | Safety system reference |
| Codex | Skill install, local plugin package, foreground/background runtime |
| OpenCode | Slash commands, underscore modes, global and project-local installs |
| Full Guide Index | Per-command deep dives |
| CONTRIBUTING.md | How to contribute |
Built on ideas from Karpathy's autoresearch. Command surface inspired by uditgoenka/autoresearch. Background patterns from codex-autoresearch.
MIT — see LICENSE.