Files

Devansh Jain 6f999fac25 feat: TB regression — full runs, xhigh models, baselines, and report improvements (#1390 )

Co-authored-by: Letta Code <noreply@letta.com>

2026-03-13 19:08:24 -07:00

Terminal-Bench Regression

Weekly regression tests for Letta Code on Terminal-Bench 2.0.

How it works

GitHub Actions (.github/workflows/terminal-bench-regression.yml) runs every Monday at 5am PT
Harbor orchestrates task execution in Modal cloud sandboxes
Letta Code is built from source (main branch) inside each sandbox
Results are compared against baseline.json and posted to a GitHub issue
@devanshrj is tagged if any model drops >5% from baseline

Model	Baseline
`sonnet-4.6-xhigh`	38/89 (42.7%)
`gpt-5.3-codex-xhigh`	57/89 (64.0%)

File	Description
`letta_code_agent.py`	Harbor agent — installs and runs Letta Code CLI in sandbox
`install-letta-code.sh.j2`	Jinja2 install script (Node.js, Bun, build from source)
`baseline.json`	Per-model, per-task pass/fail baselines
`report.py`	Parses results, detects regressions, posts GitHub issue

gh workflow run terminal-bench-regression.yml --ref main -f concurrency=10

Replace baseline.json with results from a new run. Format:

{
  "model-name": {
    "pass_rate": 0.427,
    "tasks": {
      "task-name": true,
      ...
    }
  }
}