70 lines
4.3 KiB
Plaintext
70 lines
4.3 KiB
Plaintext
---
|
||
title: The Letta Leaderboard
|
||
subtitle: Understand which models to use when building your agents
|
||
# layout: page
|
||
# hide-feedback: true
|
||
# no-image-zoom: true
|
||
slug: leaderboard
|
||
---
|
||
|
||
<Note>
|
||
The Letta Leaderboard is [open source](https://github.com/letta-ai/letta-leaderboard) and we actively encourage contributions! To learn how to add additional results or benchmarking tasks, read our [contributor guide](/leaderboard/contributing).
|
||
</Note>
|
||
|
||
The Letta Leaderboard helps developers select which language models to use in the Letta framework by reporting the performance of popular models on a series of tasks.
|
||
|
||
Letta is designed for building [stateful agents](/guides/agents/overview) - agents that are long-running and can automatically manage long-term memory to learn and adapt over time.
|
||
To implement intelligent memory management, agents in Letta rely heavily on **tool (function) calling**, so models that excel at tool use tend to do well in Letta. Conversely, models that struggle to call tools properly often perform poorly when used to drive Letta agents.
|
||
|
||
## Memory Benchmarks
|
||
|
||
The memory benchmarks test the ability of a model to understand a memory hierarchy and manage its own memory. Models that are strong at function calling and aware of their limitations (understanding in-context vs out-of-context data) typically excel here.
|
||
|
||
**Overall Score** refers to the average score from memory read, write, and update tasks. **Cost** refers to (approximate) cost in USD to run the benchmark. Open weights models prefixed with `together` were run on [Together's API](/guides/server/providers/together).
|
||
|
||
[Benchmark breakdown →](#understanding-the-benchmark)<br />
|
||
[Model recommendations →](#main-results-and-recommendations)
|
||
|
||
<div id="letta-leaderboard">
|
||
<div id="lb-controls" style="display:flex;align-items:center;max-width:800px;margin:10px auto 12px;gap:10px">
|
||
<input id="lb-search" placeholder="Search models…" style="flex:1;padding:8px;border:1px solid #ddd;border-radius:0;font-size:14px" />
|
||
</div>
|
||
|
||
<table id="lb-table" style="width:100%;max-width:800px;margin:auto;border-collapse:collapse;font-size:14px">
|
||
<thead style="background:#f2f2f2">
|
||
<tr>
|
||
<th data-key="model" style="padding:8px;text-align:left">Model</th>
|
||
<th data-key="average" style="padding:8px;text-align:center">Overall Score</th>
|
||
<th data-key="total_cost" style="padding:8px;text-align:center">Cost</th>
|
||
</tr>
|
||
</thead>
|
||
<tbody id="lb-body"></tbody>
|
||
</table>
|
||
</div>
|
||
|
||
<Info>
|
||
Try refreshing the page if the leaderboard data is not visible.
|
||
</Info>
|
||
|
||
## Understanding the Benchmark
|
||
|
||
<Note>
|
||
For a more in-depth breakdown of our memory benchmarks, [read our blog](https://www.letta.com/blog/letta-leaderboard).
|
||
</Note>
|
||
|
||
We measure two foundational aspects of context management: **core memory** and **archival memory**. Core memory is what is inside the agent’s [context window](https://www.letta.com/blog/memory-blocks) (aka "in-context memory") and archival memory is managing context external to the agent (aka "out-of-context memory", or "external memory"). This benchmark evaluates stateful agent's fundamental capabilities on _reading_, _writing_, and _updating_ memories.
|
||
|
||
For all the tasks in the memory benchmarks, we generate a fictional question-answering dataset with supporting facts to minimize prior knowledge from LLM training. To evaluate, we use a prompted GPT 4.1 to grade the agent-generated answer and the ground-truth answer, following [SimpleQA](https://openai.com/index/introducing-simpleqa/). We add a penalty for extraneous memory operations to penalize models for inefficient or incorrect archival memory accesses.
|
||
|
||
## Main Results and Recommendations
|
||
|
||
For the **closed** model providers (OpenAI, Anthropic, Google):
|
||
* Anthropic Claude Sonnet 4 and OpenAI GPT 4.1 are recommended models for most tasks
|
||
* Normalized for cost, Gemini 2.5 Flash and GPT 4o-mini are top choices
|
||
* Models that perform well on the archival memory task (e.g. Claude Haiku 3.5) might overuse memory operations when unnecessary, thus receiving a lower score on core memory due to the extraneous access penalty.
|
||
* The o-series reasoner models from OpenAI perform worse than GPT 4.1
|
||
|
||
For the **open weights** models (Llama, Qwen, Mistral, DeepSeek):
|
||
* Qwen3 235B is the best performing (overall)
|
||
* Llama 4 Scout 17B performs similarly to GPT 4.1-nano
|