Files
letta-server/fern/pages/leaderboard/overview.mdx
2025-09-09 09:35:12 -07:00

70 lines
4.3 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
title: The Letta Leaderboard
subtitle: Understand which models to use when building your agents
# layout: page
# hide-feedback: true
# no-image-zoom: true
slug: leaderboard
---
<Note>
The Letta Leaderboard is [open source](https://github.com/letta-ai/letta-leaderboard) and we actively encourage contributions! To learn how to add additional results or benchmarking tasks, read our [contributor guide](/leaderboard/contributing).
</Note>
The Letta Leaderboard helps developers select which language models to use in the Letta framework by reporting the performance of popular models on a series of tasks.
Letta is designed for building [stateful agents](/guides/agents/overview) - agents that are long-running and can automatically manage long-term memory to learn and adapt over time.
To implement intelligent memory management, agents in Letta rely heavily on **tool (function) calling**, so models that excel at tool use tend to do well in Letta. Conversely, models that struggle to call tools properly often perform poorly when used to drive Letta agents.
## Memory Benchmarks
The memory benchmarks test the ability of a model to understand a memory hierarchy and manage its own memory. Models that are strong at function calling and aware of their limitations (understanding in-context vs out-of-context data) typically excel here.
**Overall Score** refers to the average score from memory read, write, and update tasks. **Cost** refers to (approximate) cost in USD to run the benchmark. Open weights models prefixed with `together` were run on [Together's API](/guides/server/providers/together).
[Benchmark breakdown →](#understanding-the-benchmark)<br />
[Model recommendations →](#main-results-and-recommendations)
<div id="letta-leaderboard">
<div id="lb-controls" style="display:flex;align-items:center;max-width:800px;margin:10px auto 12px;gap:10px">
<input id="lb-search" placeholder="Search models…" style="flex:1;padding:8px;border:1px solid #ddd;border-radius:0;font-size:14px" />
</div>
<table id="lb-table" style="width:100%;max-width:800px;margin:auto;border-collapse:collapse;font-size:14px">
<thead style="background:#f2f2f2">
<tr>
<th data-key="model" style="padding:8px;text-align:left">Model</th>
<th data-key="average" style="padding:8px;text-align:center">Overall Score</th>
<th data-key="total_cost" style="padding:8px;text-align:center">Cost</th>
</tr>
</thead>
<tbody id="lb-body"></tbody>
</table>
</div>
<Info>
Try refreshing the page if the leaderboard data is not visible.
</Info>
## Understanding the Benchmark
<Note>
For a more in-depth breakdown of our memory benchmarks, [read our blog](https://www.letta.com/blog/letta-leaderboard).
</Note>
We measure two foundational aspects of context management: **core memory** and **archival memory**. Core memory is what is inside the agents [context window](https://www.letta.com/blog/memory-blocks) (aka "in-context memory") and archival memory is managing context external to the agent (aka "out-of-context memory", or "external memory"). This benchmark evaluates stateful agent's fundamental capabilities on _reading_, _writing_, and _updating_ memories.
For all the tasks in the memory benchmarks, we generate a fictional question-answering dataset with supporting facts to minimize prior knowledge from LLM training. To evaluate, we use a prompted GPT 4.1 to grade the agent-generated answer and the ground-truth answer, following [SimpleQA](https://openai.com/index/introducing-simpleqa/). We add a penalty for extraneous memory operations to penalize models for inefficient or incorrect archival memory accesses.
## Main Results and Recommendations
For the **closed** model providers (OpenAI, Anthropic, Google):
* Anthropic Claude Sonnet 4 and OpenAI GPT 4.1 are recommended models for most tasks
* Normalized for cost, Gemini 2.5 Flash and GPT 4o-mini are top choices
* Models that perform well on the archival memory task (e.g. Claude Haiku 3.5) might overuse memory operations when unnecessary, thus receiving a lower score on core memory due to the extraneous access penalty.
* The o-series reasoner models from OpenAI perform worse than GPT 4.1
For the **open weights** models (Llama, Qwen, Mistral, DeepSeek):
* Qwen3 235B is the best performing (overall)
* Llama 4 Scout 17B performs similarly to GPT 4.1-nano