1.5 KiB
1.5 KiB
Terminal-Bench Regression
Weekly regression tests for Letta Code on Terminal-Bench 2.0.
How it works
- GitHub Actions (
.github/workflows/terminal-bench-regression.yml) runs every Monday at 5am PT - Harbor orchestrates task execution in Modal cloud sandboxes
- Letta Code is built from source (
mainbranch) inside each sandbox - Results are compared against
baseline.jsonand posted to a GitHub issue @devanshrjis tagged if any model drops >5% from baseline
Models
| Model | Baseline |
|---|---|
sonnet-4.6-xhigh |
38/89 (42.7%) |
gpt-5.3-codex-xhigh |
57/89 (64.0%) |
Files
| File | Description |
|---|---|
letta_code_agent.py |
Harbor agent — installs and runs Letta Code CLI in sandbox |
install-letta-code.sh.j2 |
Jinja2 install script (Node.js, Bun, build from source) |
baseline.json |
Per-model, per-task pass/fail baselines |
report.py |
Parses results, detects regressions, posts GitHub issue |
Manual trigger
gh workflow run terminal-bench-regression.yml --ref main -f concurrency=10
Required secrets
LETTA_API_KEY— Letta Cloud API keyANTHROPIC_API_KEY/OPENAI_API_KEY— LLM provider keysMODAL_TOKEN_ID/MODAL_TOKEN_SECRET— Modal sandbox credentials
Updating baselines
Replace baseline.json with results from a new run. Format:
{
"model-name": {
"pass_rate": 0.427,
"tasks": {
"task-name": true,
...
}
}
}