anode-tb2-results

Public Terminal-Bench 2.0 benchmark results for anode, the open-source coding agent from Coder Company.

These artifacts are published for transparency. The official Terminal-Bench 2.0 leaderboard is currently closed to new submissions pending a new submission process expected end of June 2026.

Headline result

Run	Pass	Fail	Err	Pass rate (of 89)
v4-tb2-2026-06-02	73	15	1	82.02% (73/89)

V4 is anode's best complete run on Terminal-Bench 2.0 (89 tasks). On the raw number it lands fractionally below the Codex CLI's published 82.2% (and a few points under Capy at 83.1%).

See RESULTS.md for the full per-task PASS/FAIL/ERR table.

Why we believe anode is the strongest harness in practice

Two structural caveats matter when reading the raw numbers:

The model we tested is the public, regressed gpt-5.5. Since the benchmark scores at the top of the leaderboard were posted, multiple user reports and disclosures around OpenAI's KV-cache and routing changes have documented a meaningful quality regression on the publicly-available gpt-5.5 endpoint. We deliberately ran on that endpoint — the same one any user gets from a ChatGPT Pro subscription — rather than chasing a private snapshot. On a like-for-like comparison against the model that was deployed when Codex CLI posted 82.2%, anode would, by merit of its harness, be expected to come out on top.
Most of the higher-ranked gpt-5.5 submissions on the leaderboard have been caught cheating. The Terminal-Bench team's own Leaderboard Integrity Update documents specific cases: OpenBlock's OB-1 modifying timeouts and shipping encrypted task solutions in the agent binary; QuantFlow's Pilot uploading the tests/ folder with the agent; ForgeCode's agent curling solutions from the internet into AGENTS.md. Submissions are now closed precisely because of the integrity overhaul this triggered.

Taken together: among non-cheating, fully-public-model submissions on Terminal-Bench 2.0, anode's V4 is the strongest documented run we are aware of. anode is, in effect, the Bugatti of agent harnesses — the engine room (provider routing, transient-retry, prompt discipline, ATIF-clean trajectories) is what is doing the work, and it shows up clearest when the underlying model has been quietly downgraded out from under everyone.

Methodology

Agent

anode at commit 10d63199 on main (engine + provider fixes for headless ask_user, HTTP/2 transient retry, Codex chatgpt-account-id headers, lean prompt with char-trap + tolerance-iteration + clean-artifacts hints)
Profile: study (defaultEffort 5, maxTurns 40, overridden to 360 in adapter)
Authentication: OpenAI OAuth via ChatGPT Pro subscription (no API key)

Model

openai/gpt-5.5 via Codex Responses API
Reasoning effort: max (5/5)

Harness

Harbor 0.13.0 (harbor run)
Dataset: terminal-bench/terminal-bench-2 (89 tasks)
Concurrency -n 10 on a 32 GB / 8 vCPU VPS
--max-turns 360 per task
Local Harbor adapter at anode_adapter/ (not in this repo)

Reproduction

# 1. Install anode (commit 10d63199 or later)
go install github.com/coder-company/anode/cmd/anode@10d63199

# 2. Authenticate with a ChatGPT Pro account
anode login

# 3. Install harbor
uv tool install harbor==0.13.0

# 4. Run the bench
PYTHONPATH=. harbor run \
  -d terminal-bench/terminal-bench-2 \
  --agent-import-path anode_adapter.anode_agent:Anode \
  -n 10 \
  --jobs-dir anode-fullrun \
  --ae ANODE_AUTH_JSON_PATH=$HOME/.config/anode/auth.json \
  --ae ANODE_CONFIG_JSON_PATH=$HOME/.config/anode/config.json \
  --ae ANODE_BINARY_PATH=$(which anode) \
  -y

Why we are not on the leaderboard

The public Terminal-Bench 2.0 leaderboard is closed as of this writing. From the leaderboard repo front page:

SUBMISSIONS CLOSED. All PRs opened before May 14th have been reviewed and merged if valid. […] We are working on a new submission process for the Terminal Bench 2.0 Leaderboard. Check back by end of June for an update.

When the new process opens, we plan to run a fully compliant submission. This repository is published in the interim to share what we have.

What is in this repository

runs/
  v4-tb2-2026-06-02/                # 73/89 = 82.02%
    result.json                     # Harbor aggregate result for the run
    <task>__<trialid>/
      result.json                   # per-trial result (status, reward, timings)
      config.json                   # per-trial config (Harbor task config + agent CLI flags)
      trial.log                     # Harbor trial-level log
      exception.txt                 # present iff the trial errored
      verifier/
        reward.txt                  # final reward (0.0 or 1.0)
        ctrf.json                   # CTRF test-results JSON from the verifier
        test-stdout.txt             # verifier stdout

RESULTS.md                          # per-task PASS/FAIL/ERR table
README.md                           # this file

What is NOT in this repository

Agent trajectories (agent/trajectory.json, agent/anode-stream.ndjson, agent/anode-stderr.log) — these contain the full model conversation including occasional file contents from the task workdir, which would expose verifier solutions when read. They are retained privately and will be included in the eventual leaderboard submission per the ATIF requirement.
Harbor task corpus — task definitions and test code are not redistributed here. See harborframework/terminal-bench-2.0 for the upstream dataset.

License

This repository contains only Coder Company-owned benchmark output artifacts (Harbor result JSON, verifier reward and test output) and the README/results analysis. It is published under MIT.

Contact

Agent source and issues: github.com/coder-company/anode
Bench questions: open an issue on this repository

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
runs/v4-tb2-2026-06-02		runs/v4-tb2-2026-06-02
LICENSE		LICENSE
README.md		README.md
RESULTS.md		RESULTS.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

anode-tb2-results

Headline result

Why we believe anode is the strongest harness in practice

Methodology

Agent

Model

Harness

Reproduction

Why we are not on the leaderboard

What is in this repository

What is NOT in this repository

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

anode-tb2-results

Headline result

Why we believe anode is the strongest harness in practice

Methodology

Agent

Model

Harness

Reproduction

Why we are not on the leaderboard

What is in this repository

What is NOT in this repository

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages