Skip to content

feat: add Terminal-Bench weekly regression workflow [LET-7791]#1232

Open
devanshrj wants to merge 2 commits intomainfrom
devansh/terminal-bench-regression
Open

feat: add Terminal-Bench weekly regression workflow [LET-7791]#1232
devanshrj wants to merge 2 commits intomainfrom
devansh/terminal-bench-regression

Conversation

@devanshrj
Copy link
Contributor

Summary

  • Add scheduled GitHub Actions workflow (Mon + Thu, 8am UTC) to run a curated subset of Terminal-Bench 2.0 tasks against Letta Code using Harbor
  • Tests two models (sonnet-4.6-low and gpt-5-minimal) at concurrency 4, reports results via GitHub Issue, fails on regressions
  • Builds Letta Code from source (main branch) so regressions are caught before npm publish

What's included

  • Harbor adapter (benchmarks/terminal_bench/letta_code_agent.py) — BaseInstalledAgent implementation copied from recovery-bench
  • Install script (install-letta-code.sh.j2) — clones repo, builds with bun, installs globally inside TB containers
  • 12 curated tasks (regression-tasks.txt) — 7 easy + 5 medium, covering file ops, coding, debugging, system admin
  • Report script (report.py) — parses Harbor job results, creates/updates a pinned "Terminal-Bench Regression Tracker" GitHub Issue, exits non-zero on regressions
  • Baseline (baseline.json) — empty, to be manually populated after initial runs
  • Workflow (terminal-bench-regression.yml) — cron + manual dispatch, matrix over models, artifact upload

Test plan

  • Verify repo secrets exist: ANTHROPIC_API_KEY, OPENAI_API_KEY, LETTA_API_KEY
  • Trigger manual workflow_dispatch run and verify Harbor executes tasks
  • Confirm GitHub Issue is created with results table
  • Populate baseline.json from first run results
  • Verify subsequent runs compare against baseline correctly

👾 Generated with Letta Code

devanshrj and others added 2 commits March 3, 2026 00:14
Add scheduled GitHub Actions workflow (Mon + Thu) to run a curated
subset of Terminal-Bench 2.0 tasks against Letta Code using Harbor.
Tests two models (sonnet-4.6-low, gpt-5-minimal) at concurrency 4,
reports results via GitHub Issue, and fails on regressions.

- Harbor BaseInstalledAgent adapter (from recovery-bench)
- 12 curated tasks (7 easy, 5 medium)
- report.py: parses Harbor job results, updates GH Issue tracker
- baseline.json: manually populated after initial runs

👾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>
Build from main branch instead of npm so regression tests
run against the latest code, not the last published release.

👾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>
@devanshrj devanshrj force-pushed the devansh/terminal-bench-regression branch from 68ecfdf to 36c8200 Compare March 3, 2026 00:15
@devanshrj devanshrj changed the title feat: add Terminal-Bench weekly regression workflow feat: add Terminal-Bench weekly regression workflow [LET-7791] Mar 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant