feat: add Terminal-Bench weekly regression workflow [LET-7791] by devanshrj · Pull Request #1232 · letta-ai/letta-code

devanshrj · 2026-03-03T00:13:09Z

Summary

Add scheduled GitHub Actions workflow (Mon + Thu, 8am UTC) to run a curated subset of Terminal-Bench 2.0 tasks against Letta Code using Harbor
Tests two models (sonnet-4.6-low and gpt-5-minimal) at concurrency 4, reports results via GitHub Issue, fails on regressions
Builds Letta Code from source (main branch) so regressions are caught before npm publish

What's included

Harbor adapter (benchmarks/terminal_bench/letta_code_agent.py) — BaseInstalledAgent implementation copied from recovery-bench
Install script (install-letta-code.sh.j2) — clones repo, builds with bun, installs globally inside TB containers
12 curated tasks (regression-tasks.txt) — 7 easy + 5 medium, covering file ops, coding, debugging, system admin
Report script (report.py) — parses Harbor job results, creates/updates a pinned "Terminal-Bench Regression Tracker" GitHub Issue, exits non-zero on regressions
Baseline (baseline.json) — empty, to be manually populated after initial runs
Workflow (terminal-bench-regression.yml) — cron + manual dispatch, matrix over models, artifact upload

Test plan

Verify repo secrets exist: ANTHROPIC_API_KEY, OPENAI_API_KEY, LETTA_API_KEY
Trigger manual workflow_dispatch run and verify Harbor executes tasks
Confirm GitHub Issue is created with results table
Populate baseline.json from first run results
Verify subsequent runs compare against baseline correctly

👾 Generated with Letta Code

Add scheduled GitHub Actions workflow (Mon + Thu) to run a curated subset of Terminal-Bench 2.0 tasks against Letta Code using Harbor. Tests two models (sonnet-4.6-low, gpt-5-minimal) at concurrency 4, reports results via GitHub Issue, and fails on regressions. - Harbor BaseInstalledAgent adapter (from recovery-bench) - 12 curated tasks (7 easy, 5 medium) - report.py: parses Harbor job results, updates GH Issue tracker - baseline.json: manually populated after initial runs 👾 Generated with [Letta Code](https://letta.com) Co-Authored-By: Letta <noreply@letta.com>

Build from main branch instead of npm so regression tests run against the latest code, not the last published release. 👾 Generated with [Letta Code](https://letta.com) Co-Authored-By: Letta <noreply@letta.com>

devanshrj and others added 2 commits March 3, 2026 00:14

feat: install letta-code from source in TB regression

36c8200

Build from main branch instead of npm so regression tests run against the latest code, not the last published release. 👾 Generated with [Letta Code](https://letta.com) Co-Authored-By: Letta <noreply@letta.com>

devanshrj force-pushed the devansh/terminal-bench-regression branch from 68ecfdf to 36c8200 Compare March 3, 2026 00:15

devanshrj changed the title ~~feat: add Terminal-Bench weekly regression workflow~~ feat: add Terminal-Bench weekly regression workflow [LET-7791] Mar 3, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add Terminal-Bench weekly regression workflow [LET-7791]#1232

feat: add Terminal-Bench weekly regression workflow [LET-7791]#1232
devanshrj wants to merge 2 commits intomainfrom
devansh/terminal-bench-regression

devanshrj commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

devanshrj commented Mar 3, 2026

Summary

What's included

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant