Skip to content

coder-company/agent-autoresearch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

546 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Aim. Iterate. Arrive.

Autonomous goal-driven experimentation for Claude Code, Codex, and OpenCode.

13 command protocols · 11 native hooks · background runtime · parallel verified closeout

Claude Code Plugin Codex Skill OpenCode Skill CI Stars MIT License

English · 🇨🇳 中文 · 🇯🇵 日本語 · 🇰🇷 한국어 · 🇫🇷 Français · 🇩🇪 Deutsch · 🇪🇸 Español · 🇧🇷 Português · 🇷🇺 Русский


The idea: tell your agent what you want to improve, then walk away. It modifies your code, verifies the result, keeps or discards, and repeats. You come back to a log of experiments and a better codebase.

Inspired by Karpathy's autoresearch, generalized beyond ML to anything you can verify mechanically: test coverage, type errors, latency, lint warnings, security findings, release readiness — if a command can tell whether it improved, the loop can iterate on it.

Quick Start

Primary path: paste this into your coding agent and let it install Autoresearch for you:

Install Autoresearch in this environment.

Use the installer from:
https://raw.githubusercontent.com/coder-company/agent-autoresearch/main/install.sh

Pick the install flag for the current agent:
- Claude Code: --claude
- Codex: --codex
- OpenCode: --opencode
- If you cannot infer the agent, use --all.

Run the installer non-interactively with bash, verify `autoresearch --help`, then tell me the command I should use to start Autoresearch in this agent.
Start commands are `/autoresearch` for Claude Code, `$autoresearch` for Codex, and `/autoresearch` for OpenCode.
Use a global install unless I explicitly asked for a project-local install.

Manual install commands:

Claude Code

curl -fsSL https://raw.githubusercontent.com/coder-company/agent-autoresearch/main/install.sh | bash -s -- --yes --claude

Start with /autoresearch. If you already have the autoresearch binary installed, claude plugin add coder-company/agent-autoresearch also works.

Codex

curl -fsSL https://raw.githubusercontent.com/coder-company/agent-autoresearch/main/install.sh | bash -s -- --yes --codex

Start with $autoresearch. This installs the binary plus the Codex skill package. Use $skill-installer install https://github.com/coder-company/agent-autoresearch only when you want the skill without the source-built binary.

OpenCode

curl -fsSL https://raw.githubusercontent.com/coder-company/agent-autoresearch/main/install.sh | bash -s -- --yes --opencode

Start with /autoresearch. Mode commands use underscores, such as /autoresearch_debug, /autoresearch_fix, and /autoresearch_security.

Each command downloads the current source archive, builds the Rust binary, installs it on your PATH, and installs the selected agent package. For local/manual Claude installs, copy the generated .claude/commands and .claude/skills/autoresearch package from this repo into your target project.

Open your project and go. The example below uses the Claude Code/OpenCode command; Codex users type $autoresearch instead.

You:   /autoresearch
       I want to get rid of all the `any` types in my TypeScript code

Agent: I found 47 `any` occurrences across src/**/*.ts.
       Results directory: ./autoresearch-results/
       Metric: `any` count (current: 47), direction: lower
       Verify: grep count + tsc --noEmit as guard

       Say "go" to start, or tell me what to change.

You:   Go. Run overnight.

Agent: Baseline: 47. Iterating.

Each improvement stacks. Each failure reverts. Everything is logged.

For Codex, start Codex with codex --dangerously-bypass-approvals-and-sandbox for the smoothest foreground and background runs. For project-local Codex or OpenCode installs, run the raw installer from the target project with --local. From a clone, use ./install.sh --yes --all, or run ./install.sh for the guided installer. Add --vscode to install the editor extension from integrations/vscode. See Getting Started.

How It Works

You say one sentence  →  Agent scans & confirms  →  You say "go"
                                                        │
                                                        v
                                              ┌───────────────────┐
                                              │    The Loop        │
                                              │                    │
                                              │  modify one thing  │
                                              │  trial commit      │
                                              │  run verify        │
                                              │  improved? keep    │
                                              │  worse? revert     │
                                              │  log the result    │
                                              │  repeat            │
                                              └───────────────────┘

That's it. The agent keeps going until the goal is reached, the iteration cap is hit, or you interrupt.

What You Say vs What Happens

You say What happens
"Improve my test coverage" Iterates until target or interrupted
"Fix the 12 failing tests" Repairs one by one until zero remain
"Why is the API returning 503?" Hunts root cause with falsifiable hypotheses
"Is this code secure?" STRIDE + OWASP audit, every finding backed by code evidence
"Ship it" 8-phase checklist: test, lint, build, version, push
"I want to optimize but don't know what" Scans repo, suggests metrics, generates config
"What could go wrong with this feature?" Explores edge cases across 12 dimensions
"Should we use event sourcing here?" Adversarial debate with blind judges until convergence

Behind the scenes, the agent maps your sentence to the right mode. You never need to pick one.

What It Figures Out

You don't write config. The agent infers everything from your sentence and your repo:

What it needs How it gets it Example
Goal Your sentence "get rid of all any types"
Scope Scans repo structure src/**/*.ts
Metric Proposes based on goal + tooling any count (current: 47)
Direction Infers from "improve" / "reduce" / "eliminate" lower
Verify Matches to repo tooling grep count + tsc --noEmit
Guard Suggests a baseline-passing regression check npm test

Before starting, it always shows what it found and asks you to confirm. Then you say "go."

When It Gets Stuck

Instead of blind retrying, the loop escalates:

Trigger Action
3 consecutive failures REFINE — adjust within current strategy
5 consecutive failures PIVOT — try a fundamentally different approach
2 PIVOTs without progress Web search — look for external solutions
3 PIVOTs without progress Stop — report that human input is needed

One success resets all counters.

Commands

Command Purpose
/autoresearch The core loop — improve any metric
/autoresearch:plan Don't know where to start? This figures it out
/autoresearch:debug Find bugs — hypothesize, test, confirm
/autoresearch:fix Kill errors one by one until zero remain
/autoresearch:security Full security audit (STRIDE + OWASP)
/autoresearch:ship Ship through 8 gates: test, lint, build, version, push
/autoresearch:scenario "What could go wrong?" across 12 dimensions
/autoresearch:predict Get 5 expert opinions before you act
/autoresearch:learn Generate/update documentation automatically
/autoresearch:reason Debate a subjective decision with blind judges
/autoresearch:probe Interrogate requirements until nothing's ambiguous
/autoresearch:improve Research ICP needs and generate product improvement PRDs
/autoresearch:evals Analyze past runs: trends, plateaus, anomalies, goal achievement, CI gates, run comparison, parallel worker significance, --recommend, --plateau-window, and --chain guidance

Just type the command. It asks for what it needs.

Codex: Use $autoresearch then the mode as a keyword: $autoresearch debug.

OpenCode: Underscore naming: /autoresearch_debug, /autoresearch_fix, etc.

Results Log

Every iteration is recorded in autoresearch-results/results.tsv:

iteration  commit   metric  delta   guard  status    description
0          a1b2c3d  47      0       -      baseline  initial any count
1          b2c3d4e  41      -6      pass   keep      replace any in auth module
2          -        49      +8      -      discard   generic wrapper introduced new anys
3          d4e5f6g  38      -3      pass   keep      type-narrow API response handlers

Failed experiments revert from git but stay in the log. The log is the real audit trail.

More Features

Covered in detail in the guide:

  • Cross-run learning — lessons from past runs bias future hypothesis generation
  • Session resume — interrupted runs pick up from the last consistent state
  • Background runtime controlautoresearch runtime run preflights each Codex turn, manages launch.json, runtime.json, runtime.log, and relaunches until stop or needs-human; start/status/supervise/stop remain available for manual control
  • Environment probeautoresearch env --format json reports CPU, disk, container state, toolchains, recommended parallelism, and can seed init --environment-summary auto
  • Live results tailingautoresearch watch --lines 20 --format jsonl follows autoresearch-results/results.tsv from the workspace root or any repo subdirectory
  • Progress WebSocketautoresearch watch --websocket streams snapshot and row update payloads to real-time dashboards
  • Terminal dashboardautoresearch dashboard --once renders status, metric history, escalation, and recent rows in one view
  • Compact run statusautoresearch status --summary prints monitor-friendly counters without full config payloads
  • Metric history sparklineautoresearch progress graphs retained metric history directly in terminal output
  • Noise-aware verificationautoresearch verify --repeat 3 --aggregate median reruns scalar metrics and returns an aggregate with all samples
  • Cost estimatesautoresearch cost --per-iteration-usd 0.25 projects completed and remaining token/API spend
  • Eval checkpointsautoresearch checkpoint --format json runs evals only when the active run reaches its checkpoint interval
  • Eval chain handoffautoresearch evals --file autoresearch-results/results.tsv --recommend --chain ship writes analysis, go/no-go, and next-target metadata beside the TSV
  • Eval run comparisonautoresearch evals --file current.tsv --compare previous.tsv --format json reports improvement, efficiency, plateau deltas, and a winner
  • Eval target gateautoresearch evals --file results.tsv --target 90 --recommend --format json reports goal_achieved and recommends goal_met when the threshold is crossed
  • Eval CI exit gateautoresearch evals --file results.tsv --target 90 --fail-on goal-not-met --format json prints the report and exits non-zero when the gate fails
  • Protocol re-anchor checksautoresearch reanchor --format json reports 10-iteration fingerprint due state and reload references for long sessions
  • Parallel worker executionautoresearch parallel prepare/run/closeout/cleanup creates isolated worker worktrees, launches prompts, merges and verifies the best result, logs 5a/5b audit rows, and cleans up branches
  • A/B compare modeautoresearch parallel compare --a "..." --b "..." prepares two explicit hypotheses for head-to-head metric closeout
  • Shell completionsautoresearch completions bash|zsh|fish|elvish|powershell prints native completion scripts for local installs
  • Man pagesautoresearch manpages --output-dir man/man1 writes a local autoresearch.1 page for packages and offline docs
  • Project defaults.autoresearch.toml stores repeatable init settings such as goal, scope, metric, verify, guard, and iteration cap
  • Native planningautoresearch plan --goal "..." --format json suggests scope, metric, direction, verify, guard, and iteration count from repo tooling
  • Plan chain handoffautoresearch plan --goal "..." --debug writes the derived config into a downstream handoff
  • Ignored artifact defaults — native artifact generators write under autoresearch-results/<mode>/ unless you pass an explicit output path
  • Debug artifact generationautoresearch debug --symptom ... --scope ... writes hypothesis, findings, eliminated, TSV, and handoff artifacts
  • Debug investigation controlsautoresearch debug --depth deep --iterations 12 --severity high records investigation budget and severity filter metadata
  • Fix artifact generationautoresearch fix --target ... --scope ... --iterations 7 writes a one-error-at-a-time repair plan, TSV, and handoff under autoresearch-results/fix
  • Debug-to-fix importautoresearch fix --from-debug imports the latest debug handoff scope and symptom into a repair plan
  • Fix chain controlsautoresearch fix --learn --evals records downstream handoff and checkpoint propagation metadata
  • Improve artifact bundleautoresearch improve --goal ... --icp ... --depth deep --iterations 24 --evals writes research findings, ranked plan, summary, TSV, and handoff with research budget metadata
  • Improve research controlsautoresearch improve --seeds 5 --no-discover --learn records seed volume, discovery posture, and downstream handoff metadata
  • PRD artifact generationautoresearch prd --title ... --problem ... writes improve-mode PRDs with decision markers and ready-to-run config blocks
  • Security artifact generationautoresearch security --scope ... --focus ... writes STRIDE, OWASP, findings, recommendations, TSV, and handoff artifacts
  • Security gatingautoresearch security --fail-on high --fix records CI threshold and downstream repair handoff metadata
  • Security audit controlsautoresearch security --depth deep --iterations 18 --diff --fix --evals records audit budget, delta mode, repair handoff, and checkpoint metadata
  • Ship artifact generationautoresearch ship --target ... --type ... --dry-run writes an 8-phase checklist, summary, ship log, and handoff
  • Ship workflow controlsautoresearch ship --auto --force --rollback --monitor 15 --learn records approval, rollback, monitoring, and downstream metadata without external side effects
  • Scenario artifact generationautoresearch scenario --target ... --domain api --format test-scenarios writes a 12-dimension edge-case matrix grounded in scope
  • Scenario exploration controlsautoresearch scenario --domain web --depth deep --iterations 16 --evals --debug records domain, exploration budget, checkpoint metadata, and downstream handoff
  • Predict artifact generationautoresearch predict --proposal ... writes a five-persona pre-implementation review
  • Predict review controlsautoresearch predict --depth deep --adversarial --fail-on high records review profile and CI gate metadata
  • Predict-to-improve handoffautoresearch predict --proposal ... --improve passes expert findings into product improvement research
  • Reason artifact generationautoresearch reason --question ... writes an adversarial candidate debate with a blind-judge rubric
  • Reason judge controlsautoresearch reason --iterations 11 --judges 7 --convergence 4 --temperature 0.2 records budget, panel, convergence, synthesis, and generation hints
  • Probe artifact generationautoresearch probe --subject ... writes eight-persona requirement questions and constraint slots
  • Probe interrogation controlsautoresearch probe --mode autonomous --depth deep --iterations 9 --adversarial records depth, round budget, persona count, and saturation settings
  • Probe-to-improve handoffautoresearch probe --subject ... --improve passes discovered constraints into product improvement research
  • Learn artifact generationautoresearch learn --mode summarize --scope ... --depth comprehensive --iterations 14 --evals writes summary, validation, TSV, and handoff documentation artifacts with scan and chain metadata
  • Config templateautoresearch config template --output .autoresearch.toml writes a starter project defaults file
  • Config validationautoresearch config validate checks project defaults without running verify or guard
  • Stable CLI API manifestautoresearch api --format json emits command, flag, and semver policy metadata for wrappers and agents
  • MCP integrationautoresearch mcp serve exposes read-only tools, and autoresearch mcp call invokes external stdio MCP tools from scripts
  • Pre-built release binaries — tag builds publish checksummed Linux, macOS, and Windows archives through GitHub Releases
  • Package-manager metadatacargo-binstall archive metadata and a Homebrew formula template track the release assets
  • GitHub Action runner.github/actions/autoresearch wraps exec mode for checked-in CI optimization loops
  • VS Code extension package./install.sh --yes --vscode installs integrations/vscode, which exposes status, dashboard, and watch commands by delegating to the binary
  • Thin Codex routing skill.agents/skills/autoresearch/SKILL.md stays under 90 lines and defers detailed operations to references
  • Documentation sitebook.toml and docs/SUMMARY.md build the full docs set as an mdBook site for GitHub Pages
  • Workspace scope expansionautoresearch scope expand --format json resolves primary and companion repo globs and annotates package roots for monorepos
  • Cross-repo executionautoresearch workspace exec --rollback-on-failure runs one screened command across all repo targets and restores attempted repos on failure
  • Cross-repo guard presetsautoresearch guard-presets --format json suggests per-repo guard commands for primary and companion repos
  • Shared workspace lessonsautoresearch lessons --workspace-context --last 5 proves companion repos read the shared workspace lessons log
  • Mode plugin manifestsautoresearch plugin list and autoresearch plugin validate load TOML mode definitions with safety screening
  • Plugin marketplace indexautoresearch plugin marketplace validates a local community plugin catalog and the manifests it references
  • Manual lessonsautoresearch lessons --add "strategy" --context "why it matters" appends reusable run knowledge
  • Search helper and escalationautoresearch search --from-state --log builds a run-aware query, calls a configured provider, caches results, and records a search meta-iteration; decide automatically runs the same helper when escalation reaches Web Search and AUTORESEARCH_SEARCH_CMD is configured
  • Chainingdebug --fix, probe --plan, probe --improve, predict --debug, predict --improve
  • CI/CD mode (exec) — non-interactive, JSON output, for automation pipelines
  • Dual-gate verification — separate verify (did it improve?) and guard (did anything break?)
  • Safety hooks — blocks dangerous commands, secrets exposure, and scope violations automatically

FAQ

It only makes small incremental changes. Can it try bigger ideas? By default the loop favors small, verifiable steps — that's by design. But it can go bigger: describe a larger hypothesis in your prompt (e.g., "try replacing the ORM with raw SQL queries and run the full benchmark"), and it will treat that as a single experiment to verify.

Is this more for optimization than research? It's strongest when the goal and metric are clear — push coverage up, push errors down, push latency lower. For open-ended exploration where the direction itself is uncertain, use /autoresearch:plan first, then switch to the loop once you know what to measure.

How do I stop it? Foreground: Ctrl+C. Background: autoresearch runtime stop. Or set Iterations: N. The agent commits before verifying, so your last successful state is always in git.

Can it resume after interruption? Yes. It resumes from autoresearch-results/state.json automatically.

Does it work with any language? Any language, any framework. If you can express success as a number and write a shell command that outputs it, autoresearch can optimize toward it.

What if I don't know what to measure? /autoresearch:plan scans your repo, looks at your tooling, and suggests metrics with ready-to-run verify commands.

Will it break my code? No. Every change is committed before verification. If it makes things worse, it reverts. If you set a Guard (e.g., npm test), no change persists unless all tests still pass.

Documentation

Doc What it covers
Docs Index Repository documentation map
Installation Claude Code, Codex, OpenCode, source install
Guide Command map, binary operations, artifact contract
Examples Copy-paste configs for common goals and parallel closeout
System Architecture Binary, skill packages, artifacts, runtime flow
Project Changelog Release history entrypoint and current development track
Getting Started Install, first run, what to expect
Examples by Domain Ready configs: coverage, types, bundle, latency, security
Chains & Combinations Piping commands together
Hooks Safety system reference
Codex Skill install, local plugin package, foreground/background runtime
OpenCode Slash commands, underscore modes, global and project-local installs
Full Guide Index Per-command deep dives
CONTRIBUTING.md How to contribute

Acknowledgments

Built on ideas from Karpathy's autoresearch. Command surface inspired by uditgoenka/autoresearch. Background patterns from codex-autoresearch.

License

MIT — see LICENSE.

About

The best auto-research plugin for coding agents. Supports Codex CLI and Claude Code.

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors