Public Terminal-Bench 2.0 benchmark results for anode, the open-source coding agent from Coder Company.
These artifacts are published for transparency. The official Terminal-Bench 2.0 leaderboard is currently closed to new submissions pending a new submission process expected end of June 2026.
| Run | Pass | Fail | Err | Pass rate (of 89) |
|---|---|---|---|---|
| v4-tb2-2026-06-02 | 73 | 15 | 1 | 82.02% (73/89) |
V4 is anode's best complete run on Terminal-Bench 2.0 (89 tasks). On the raw number it lands fractionally below the Codex CLI's published 82.2% (and a few points under Capy at 83.1%).
See RESULTS.md for the full per-task PASS/FAIL/ERR table.
Two structural caveats matter when reading the raw numbers:
- The model we tested is the public, regressed
gpt-5.5. Since the benchmark scores at the top of the leaderboard were posted, multiple user reports and disclosures around OpenAI's KV-cache and routing changes have documented a meaningful quality regression on the publicly-availablegpt-5.5endpoint. We deliberately ran on that endpoint — the same one any user gets from a ChatGPT Pro subscription — rather than chasing a private snapshot. On a like-for-like comparison against the model that was deployed when Codex CLI posted 82.2%, anode would, by merit of its harness, be expected to come out on top. - Most of the higher-ranked
gpt-5.5submissions on the leaderboard have been caught cheating. The Terminal-Bench team's own Leaderboard Integrity Update documents specific cases: OpenBlock's OB-1 modifying timeouts and shipping encrypted task solutions in the agent binary; QuantFlow's Pilot uploading thetests/folder with the agent; ForgeCode's agent curling solutions from the internet intoAGENTS.md. Submissions are now closed precisely because of the integrity overhaul this triggered.
Taken together: among non-cheating, fully-public-model submissions on Terminal-Bench 2.0, anode's V4 is the strongest documented run we are aware of. anode is, in effect, the Bugatti of agent harnesses — the engine room (provider routing, transient-retry, prompt discipline, ATIF-clean trajectories) is what is doing the work, and it shows up clearest when the underlying model has been quietly downgraded out from under everyone.
- anode at commit
10d63199onmain(engine + provider fixes for headless ask_user, HTTP/2 transient retry, Codex chatgpt-account-id headers, lean prompt with char-trap + tolerance-iteration + clean-artifacts hints) - Profile:
study(defaultEffort 5, maxTurns 40, overridden to 360 in adapter) - Authentication: OpenAI OAuth via ChatGPT Pro subscription (no API key)
openai/gpt-5.5via Codex Responses API- Reasoning effort: max (5/5)
- Harbor 0.13.0 (
harbor run) - Dataset:
terminal-bench/terminal-bench-2(89 tasks) - Concurrency
-n 10on a 32 GB / 8 vCPU VPS --max-turns 360per task- Local Harbor adapter at
anode_adapter/(not in this repo)
# 1. Install anode (commit 10d63199 or later)
go install github.com/coder-company/anode/cmd/anode@10d63199
# 2. Authenticate with a ChatGPT Pro account
anode login
# 3. Install harbor
uv tool install harbor==0.13.0
# 4. Run the bench
PYTHONPATH=. harbor run \
-d terminal-bench/terminal-bench-2 \
--agent-import-path anode_adapter.anode_agent:Anode \
-n 10 \
--jobs-dir anode-fullrun \
--ae ANODE_AUTH_JSON_PATH=$HOME/.config/anode/auth.json \
--ae ANODE_CONFIG_JSON_PATH=$HOME/.config/anode/config.json \
--ae ANODE_BINARY_PATH=$(which anode) \
-yThe public Terminal-Bench 2.0 leaderboard is closed as of this writing. From the leaderboard repo front page:
SUBMISSIONS CLOSED. All PRs opened before May 14th have been reviewed and merged if valid. […] We are working on a new submission process for the Terminal Bench 2.0 Leaderboard. Check back by end of June for an update.
When the new process opens, we plan to run a fully compliant submission. This repository is published in the interim to share what we have.
runs/
v4-tb2-2026-06-02/ # 73/89 = 82.02%
result.json # Harbor aggregate result for the run
<task>__<trialid>/
result.json # per-trial result (status, reward, timings)
config.json # per-trial config (Harbor task config + agent CLI flags)
trial.log # Harbor trial-level log
exception.txt # present iff the trial errored
verifier/
reward.txt # final reward (0.0 or 1.0)
ctrf.json # CTRF test-results JSON from the verifier
test-stdout.txt # verifier stdout
RESULTS.md # per-task PASS/FAIL/ERR table
README.md # this file
- Agent trajectories (
agent/trajectory.json,agent/anode-stream.ndjson,agent/anode-stderr.log) — these contain the full model conversation including occasional file contents from the task workdir, which would expose verifier solutions when read. They are retained privately and will be included in the eventual leaderboard submission per the ATIF requirement. - Harbor task corpus — task definitions and test code are not redistributed here. See harborframework/terminal-bench-2.0 for the upstream dataset.
This repository contains only Coder Company-owned benchmark output artifacts (Harbor result JSON, verifier reward and test output) and the README/results analysis. It is published under MIT.
Terminal-Bench is © its authors and licensed separately at github.com/laude-institute/terminal-bench.
- Agent source and issues: github.com/coder-company/anode
- Bench questions: open an issue on this repository