GitHub - coder-company/agent-autoresearch: The best auto-research plugin for coding agents. Supports Codex CLI and Claude Code.

Aim. Iterate. Arrive.

Autonomous goal-driven experimentation for Claude Code, Codex, and OpenCode.

13 command protocols · 11 native hooks · background runtime · parallel verified closeout

English · 🇨🇳 中文 · 🇯🇵 日本語 · 🇰🇷 한국어 · 🇫🇷 Français · 🇩🇪 Deutsch · 🇪🇸 Español · 🇧🇷 Português · 🇷🇺 Русский

The idea: tell your agent what you want to improve, then walk away. It modifies your code, verifies the result, keeps or discards, and repeats. You come back to a log of experiments and a better codebase.

Inspired by Karpathy's autoresearch, generalized beyond ML to anything you can verify mechanically: test coverage, type errors, latency, lint warnings, security findings, release readiness — if a command can tell whether it improved, the loop can iterate on it.

Quick Start

Primary path: paste this into your coding agent and let it install Autoresearch for you:

Install Autoresearch in this environment.

Use the installer from:
https://raw.githubusercontent.com/coder-company/agent-autoresearch/main/install.sh

Pick the install flag for the current agent:
- Claude Code: --claude
- Codex: --codex
- OpenCode: --opencode
- If you cannot infer the agent, use --all.

Run the installer non-interactively with bash, verify `autoresearch --help`, then tell me the command I should use to start Autoresearch in this agent.
Start commands are `/autoresearch` for Claude Code, `$autoresearch` for Codex, and `/autoresearch` for OpenCode.
Use a global install unless I explicitly asked for a project-local install.

Manual install commands:

Claude Code

curl -fsSL https://raw.githubusercontent.com/coder-company/agent-autoresearch/main/install.sh | bash -s -- --yes --claude

Start with /autoresearch. If you already have the autoresearch binary installed, claude plugin add coder-company/agent-autoresearch also works.

Codex

curl -fsSL https://raw.githubusercontent.com/coder-company/agent-autoresearch/main/install.sh | bash -s -- --yes --codex

Start with $autoresearch. This installs the binary plus the Codex skill package. Use $skill-installer install https://github.com/coder-company/agent-autoresearch only when you want the skill without the source-built binary.

OpenCode

curl -fsSL https://raw.githubusercontent.com/coder-company/agent-autoresearch/main/install.sh | bash -s -- --yes --opencode

Start with /autoresearch. Mode commands use underscores, such as /autoresearch_debug, /autoresearch_fix, and /autoresearch_security.

Each command downloads the current source archive, builds the Rust binary, installs it on your PATH, and installs the selected agent package. For local/manual Claude installs, copy the generated .claude/commands and .claude/skills/autoresearch package from this repo into your target project.

Open your project and go. The example below uses the Claude Code/OpenCode command; Codex users type $autoresearch instead.

You:   /autoresearch
       I want to get rid of all the `any` types in my TypeScript code

Agent: I found 47 `any` occurrences across src/**/*.ts.
       Results directory: ./autoresearch-results/
       Metric: `any` count (current: 47), direction: lower
       Verify: grep count + tsc --noEmit as guard

       Say "go" to start, or tell me what to change.

You:   Go. Run overnight.

Agent: Baseline: 47. Iterating.

Each improvement stacks. Each failure reverts. Everything is logged.

For Codex, start Codex with codex --dangerously-bypass-approvals-and-sandbox for the smoothest foreground and background runs. For project-local Codex or OpenCode installs, run the raw installer from the target project with --local. From a clone, use ./install.sh --yes --all, or run ./install.sh for the guided installer. Add --vscode to install the editor extension from integrations/vscode. See Getting Started.

How It Works

You say one sentence  →  Agent scans & confirms  →  You say "go"
                                                        │
                                                        v
                                              ┌───────────────────┐
                                              │    The Loop        │
                                              │                    │
                                              │  modify one thing  │
                                              │  trial commit      │
                                              │  run verify        │
                                              │  improved? keep    │
                                              │  worse? revert     │
                                              │  log the result    │
                                              │  repeat            │
                                              └───────────────────┘

That's it. The agent keeps going until the goal is reached, the iteration cap is hit, or you interrupt.

What You Say vs What Happens

You say	What happens
"Improve my test coverage"	Iterates until target or interrupted
"Fix the 12 failing tests"	Repairs one by one until zero remain
"Why is the API returning 503?"	Hunts root cause with falsifiable hypotheses
"Is this code secure?"	STRIDE + OWASP audit, every finding backed by code evidence
"Ship it"	8-phase checklist: test, lint, build, version, push
"I want to optimize but don't know what"	Scans repo, suggests metrics, generates config
"What could go wrong with this feature?"	Explores edge cases across 12 dimensions
"Should we use event sourcing here?"	Adversarial debate with blind judges until convergence

Behind the scenes, the agent maps your sentence to the right mode. You never need to pick one.

What It Figures Out

You don't write config. The agent infers everything from your sentence and your repo:

What it needs	How it gets it	Example
Goal	Your sentence	"get rid of all any types"
Scope	Scans repo structure	`src/*/.ts`
Metric	Proposes based on goal + tooling	any count (current: 47)
Direction	Infers from "improve" / "reduce" / "eliminate"	lower
Verify	Matches to repo tooling	`grep` count + `tsc --noEmit`
Guard	Suggests a baseline-passing regression check	`npm test`

Before starting, it always shows what it found and asks you to confirm. Then you say "go."

When It Gets Stuck

Instead of blind retrying, the loop escalates:

Trigger	Action
3 consecutive failures	REFINE — adjust within current strategy
5 consecutive failures	PIVOT — try a fundamentally different approach
2 PIVOTs without progress	Web search — look for external solutions
3 PIVOTs without progress	Stop — report that human input is needed

One success resets all counters.

Commands

Command	Purpose
`/autoresearch`	The core loop — improve any metric
`/autoresearch:plan`	Don't know where to start? This figures it out
`/autoresearch:debug`	Find bugs — hypothesize, test, confirm
`/autoresearch:fix`	Kill errors one by one until zero remain
`/autoresearch:security`	Full security audit (STRIDE + OWASP)
`/autoresearch:ship`	Ship through 8 gates: test, lint, build, version, push
`/autoresearch:scenario`	"What could go wrong?" across 12 dimensions
`/autoresearch:predict`	Get 5 expert opinions before you act
`/autoresearch:learn`	Generate/update documentation automatically
`/autoresearch:reason`	Debate a subjective decision with blind judges
`/autoresearch:probe`	Interrogate requirements until nothing's ambiguous
`/autoresearch:improve`	Research ICP needs and generate product improvement PRDs
`/autoresearch:evals`	Analyze past runs: trends, plateaus, anomalies, goal achievement, CI gates, run comparison, parallel worker significance, `--recommend`, `--plateau-window`, and `--chain` guidance

Just type the command. It asks for what it needs.

Codex: Use $autoresearch then the mode as a keyword: $autoresearch debug.

OpenCode: Underscore naming: /autoresearch_debug, /autoresearch_fix, etc.

Results Log

Every iteration is recorded in autoresearch-results/results.tsv:

iteration  commit   metric  delta   guard  status    description
0          a1b2c3d  47      0       -      baseline  initial any count
1          b2c3d4e  41      -6      pass   keep      replace any in auth module
2          -        49      +8      -      discard   generic wrapper introduced new anys
3          d4e5f6g  38      -3      pass   keep      type-narrow API response handlers

Failed experiments revert from git but stay in the log. The log is the real audit trail.

More Features

Covered in detail in the guide:

Cross-run learning — lessons from past runs bias future hypothesis generation
Session resume — interrupted runs pick up from the last consistent state
Background runtime control — autoresearch runtime run preflights each Codex turn, manages launch.json, runtime.json, runtime.log, and relaunches until stop or needs-human; start/status/supervise/stop remain available for manual control
Environment probe — autoresearch env --format json reports CPU, disk, container state, toolchains, recommended parallelism, and can seed init --environment-summary auto
Live results tailing — autoresearch watch --lines 20 --format jsonl follows autoresearch-results/results.tsv from the workspace root or any repo subdirectory
Progress WebSocket — autoresearch watch --websocket streams snapshot and row update payloads to real-time dashboards
Terminal dashboard — autoresearch dashboard --once renders status, metric history, escalation, and recent rows in one view
Compact run status — autoresearch status --summary prints monitor-friendly counters without full config payloads
Metric history sparkline — autoresearch progress graphs retained metric history directly in terminal output
Noise-aware verification — autoresearch verify --repeat 3 --aggregate median reruns scalar metrics and returns an aggregate with all samples
Cost estimates — autoresearch cost --per-iteration-usd 0.25 projects completed and remaining token/API spend
Eval checkpoints — autoresearch checkpoint --format json runs evals only when the active run reaches its checkpoint interval
Eval chain handoff — autoresearch evals --file autoresearch-results/results.tsv --recommend --chain ship writes analysis, go/no-go, and next-target metadata beside the TSV
Eval run comparison — autoresearch evals --file current.tsv --compare previous.tsv --format json reports improvement, efficiency, plateau deltas, and a winner
Eval target gate — autoresearch evals --file results.tsv --target 90 --recommend --format json reports goal_achieved and recommends goal_met when the threshold is crossed
Eval CI exit gate — autoresearch evals --file results.tsv --target 90 --fail-on goal-not-met --format json prints the report and exits non-zero when the gate fails
Protocol re-anchor checks — autoresearch reanchor --format json reports 10-iteration fingerprint due state and reload references for long sessions
Parallel worker execution — autoresearch parallel prepare/run/closeout/cleanup creates isolated worker worktrees, launches prompts, merges and verifies the best result, logs 5a/5b audit rows, and cleans up branches
A/B compare mode — autoresearch parallel compare --a "..." --b "..." prepares two explicit hypotheses for head-to-head metric closeout
Shell completions — autoresearch completions bash|zsh|fish|elvish|powershell prints native completion scripts for local installs
Man pages — autoresearch manpages --output-dir man/man1 writes a local autoresearch.1 page for packages and offline docs
Project defaults — .autoresearch.toml stores repeatable init settings such as goal, scope, metric, verify, guard, and iteration cap
Native planning — autoresearch plan --goal "..." --format json suggests scope, metric, direction, verify, guard, and iteration count from repo tooling
Plan chain handoff — autoresearch plan --goal "..." --debug writes the derived config into a downstream handoff
Ignored artifact defaults — native artifact generators write under autoresearch-results/<mode>/ unless you pass an explicit output path
Debug artifact generation — autoresearch debug --symptom ... --scope ... writes hypothesis, findings, eliminated, TSV, and handoff artifacts
Debug investigation controls — autoresearch debug --depth deep --iterations 12 --severity high records investigation budget and severity filter metadata
Fix artifact generation — autoresearch fix --target ... --scope ... --iterations 7 writes a one-error-at-a-time repair plan, TSV, and handoff under autoresearch-results/fix
Debug-to-fix import — autoresearch fix --from-debug imports the latest debug handoff scope and symptom into a repair plan
Fix chain controls — autoresearch fix --learn --evals records downstream handoff and checkpoint propagation metadata
Improve artifact bundle — autoresearch improve --goal ... --icp ... --depth deep --iterations 24 --evals writes research findings, ranked plan, summary, TSV, and handoff with research budget metadata
Improve research controls — autoresearch improve --seeds 5 --no-discover --learn records seed volume, discovery posture, and downstream handoff metadata
PRD artifact generation — autoresearch prd --title ... --problem ... writes improve-mode PRDs with decision markers and ready-to-run config blocks
Security artifact generation — autoresearch security --scope ... --focus ... writes STRIDE, OWASP, findings, recommendations, TSV, and handoff artifacts
Security gating — autoresearch security --fail-on high --fix records CI threshold and downstream repair handoff metadata
Security audit controls — autoresearch security --depth deep --iterations 18 --diff --fix --evals records audit budget, delta mode, repair handoff, and checkpoint metadata
Ship artifact generation — autoresearch ship --target ... --type ... --dry-run writes an 8-phase checklist, summary, ship log, and handoff
Ship workflow controls — autoresearch ship --auto --force --rollback --monitor 15 --learn records approval, rollback, monitoring, and downstream metadata without external side effects
Scenario artifact generation — autoresearch scenario --target ... --domain api --format test-scenarios writes a 12-dimension edge-case matrix grounded in scope
Scenario exploration controls — autoresearch scenario --domain web --depth deep --iterations 16 --evals --debug records domain, exploration budget, checkpoint metadata, and downstream handoff
Predict artifact generation — autoresearch predict --proposal ... writes a five-persona pre-implementation review
Predict review controls — autoresearch predict --depth deep --adversarial --fail-on high records review profile and CI gate metadata
Predict-to-improve handoff — autoresearch predict --proposal ... --improve passes expert findings into product improvement research
Reason artifact generation — autoresearch reason --question ... writes an adversarial candidate debate with a blind-judge rubric
Reason judge controls — autoresearch reason --iterations 11 --judges 7 --convergence 4 --temperature 0.2 records budget, panel, convergence, synthesis, and generation hints
Probe artifact generation — autoresearch probe --subject ... writes eight-persona requirement questions and constraint slots
Probe interrogation controls — autoresearch probe --mode autonomous --depth deep --iterations 9 --adversarial records depth, round budget, persona count, and saturation settings
Probe-to-improve handoff — autoresearch probe --subject ... --improve passes discovered constraints into product improvement research
Learn artifact generation — autoresearch learn --mode summarize --scope ... --depth comprehensive --iterations 14 --evals writes summary, validation, TSV, and handoff documentation artifacts with scan and chain metadata
Config template — autoresearch config template --output .autoresearch.toml writes a starter project defaults file
Config validation — autoresearch config validate checks project defaults without running verify or guard
Stable CLI API manifest — autoresearch api --format json emits command, flag, and semver policy metadata for wrappers and agents
MCP integration — autoresearch mcp serve exposes read-only tools, and autoresearch mcp call invokes external stdio MCP tools from scripts
Pre-built release binaries — tag builds publish checksummed Linux, macOS, and Windows archives through GitHub Releases
Package-manager metadata — cargo-binstall archive metadata and a Homebrew formula template track the release assets
GitHub Action runner — .github/actions/autoresearch wraps exec mode for checked-in CI optimization loops
VS Code extension package — ./install.sh --yes --vscode installs integrations/vscode, which exposes status, dashboard, and watch commands by delegating to the binary
Thin Codex routing skill — .agents/skills/autoresearch/SKILL.md stays under 90 lines and defers detailed operations to references
Documentation site — book.toml and docs/SUMMARY.md build the full docs set as an mdBook site for GitHub Pages
Workspace scope expansion — autoresearch scope expand --format json resolves primary and companion repo globs and annotates package roots for monorepos
Cross-repo execution — autoresearch workspace exec --rollback-on-failure runs one screened command across all repo targets and restores attempted repos on failure
Cross-repo guard presets — autoresearch guard-presets --format json suggests per-repo guard commands for primary and companion repos
Shared workspace lessons — autoresearch lessons --workspace-context --last 5 proves companion repos read the shared workspace lessons log
Mode plugin manifests — autoresearch plugin list and autoresearch plugin validate load TOML mode definitions with safety screening
Plugin marketplace index — autoresearch plugin marketplace validates a local community plugin catalog and the manifests it references
Manual lessons — autoresearch lessons --add "strategy" --context "why it matters" appends reusable run knowledge
Search helper and escalation — autoresearch search --from-state --log builds a run-aware query, calls a configured provider, caches results, and records a search meta-iteration; decide automatically runs the same helper when escalation reaches Web Search and AUTORESEARCH_SEARCH_CMD is configured
Chaining — debug --fix, probe --plan, probe --improve, predict --debug, predict --improve
CI/CD mode (exec) — non-interactive, JSON output, for automation pipelines
Dual-gate verification — separate verify (did it improve?) and guard (did anything break?)
Safety hooks — blocks dangerous commands, secrets exposure, and scope violations automatically

FAQ

It only makes small incremental changes. Can it try bigger ideas? By default the loop favors small, verifiable steps — that's by design. But it can go bigger: describe a larger hypothesis in your prompt (e.g., "try replacing the ORM with raw SQL queries and run the full benchmark"), and it will treat that as a single experiment to verify.

Is this more for optimization than research? It's strongest when the goal and metric are clear — push coverage up, push errors down, push latency lower. For open-ended exploration where the direction itself is uncertain, use /autoresearch:plan first, then switch to the loop once you know what to measure.

How do I stop it? Foreground: Ctrl+C. Background: autoresearch runtime stop. Or set Iterations: N. The agent commits before verifying, so your last successful state is always in git.

Can it resume after interruption? Yes. It resumes from autoresearch-results/state.json automatically.

Does it work with any language? Any language, any framework. If you can express success as a number and write a shell command that outputs it, autoresearch can optimize toward it.

What if I don't know what to measure? /autoresearch:plan scans your repo, looks at your tooling, and suggests metrics with ready-to-run verify commands.

Will it break my code? No. Every change is committed before verification. If it makes things worse, it reverts. If you set a Guard (e.g., npm test), no change persists unless all tests still pass.

Documentation

Doc	What it covers
Docs Index	Repository documentation map
Installation	Claude Code, Codex, OpenCode, source install
Guide	Command map, binary operations, artifact contract
Examples	Copy-paste configs for common goals and parallel closeout
System Architecture	Binary, skill packages, artifacts, runtime flow
Project Changelog	Release history entrypoint and current development track
Getting Started	Install, first run, what to expect
Examples by Domain	Ready configs: coverage, types, bundle, latency, security
Chains & Combinations	Piping commands together
Hooks	Safety system reference
Codex	Skill install, local plugin package, foreground/background runtime
OpenCode	Slash commands, underscore modes, global and project-local installs
Full Guide Index	Per-command deep dives
CONTRIBUTING.md	How to contribute

Acknowledgments

Built on ideas from Karpathy's autoresearch. Command surface inspired by uditgoenka/autoresearch. Background patterns from codex-autoresearch.

License

MIT — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Aim. Iterate. Arrive.

Quick Start

How It Works

What You Say vs What Happens

What It Figures Out

When It Gets Stuck

Commands

Results Log

More Features

FAQ

Documentation

Acknowledgments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 546 Commits
.agents		.agents
.claude-plugin		.claude-plugin
.claude		.claude
.github		.github
.opencode		.opencode
agents		agents
bin		bin
commands		commands
docs		docs
guide		guide
hooks		hooks
integrations/vscode		integrations/vscode
packaging/homebrew		packaging/homebrew
plugins/autoresearch		plugins/autoresearch
references		references
scripts		scripts
skills/autoresearch		skills/autoresearch
src		src
tests		tests
.ckignore		.ckignore
.gitignore		.gitignore
AGENTS.md		AGENTS.md
COMPARISON.md		COMPARISON.md
CONTEXT.md		CONTEXT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
SKILL.md		SKILL.md
book.toml		book.toml
context7.json		context7.json
install.sh		install.sh

Folders and files

Latest commit

History

Repository files navigation

Aim. Iterate. Arrive.

Quick Start

How It Works

What You Say vs What Happens

What It Figures Out

When It Gets Stuck

Commands

Results Log

More Features

FAQ

Documentation

Acknowledgments

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages