letta-server/fern/pages/leaderboard/overview.mdx

---
title: The Letta Leaderboard
subtitle: Understand which models to use when building your agents
# layout: page
# hide-feedback: true
# no-image-zoom: true
slug: leaderboard
---

<Note>
The Letta Leaderboard is [open source](https://github.com/letta-ai/letta-leaderboard) and we actively encourage contributions! To learn how to add additional results or benchmarking tasks, read our [contributor guide](/leaderboard/contributing).
</Note>

The Letta Leaderboard helps developers select which language models to use in the Letta framework by reporting the performance of popular models on a series of tasks.

Letta is designed for building [stateful agents](/guides/agents/overview) - agents that are long-running and can automatically manage long-term memory to learn and adapt over time.
To implement intelligent memory management, agents in Letta rely heavily on **tool (function) calling**, so models that excel at tool use tend to do well in Letta. Conversely, models that struggle to call tools properly often perform poorly when used to drive Letta agents.

## Memory Benchmarks

The memory benchmarks test the ability of a model to understand a memory hierarchy and manage its own memory. Models that are strong at function calling and aware of their limitations (understanding in-context vs out-of-context data) typically excel here.

**Overall Score** refers to the average score from memory read, write, and update tasks. **Cost** refers to (approximate) cost in USD to run the benchmark. Open weights models prefixed with `together` were run on [Together's API](/guides/server/providers/together).

[Benchmark breakdown →](#understanding-the-benchmark)<br />
[Model recommendations →](#main-results-and-recommendations)

<div id="letta-leaderboard">
  <div id="lb-controls" style="display:flex;align-items:center;max-width:800px;margin:10px auto 12px;gap:10px">
    <input id="lb-search" placeholder="Search models…" style="flex:1;padding:8px;border:1px solid #ddd;border-radius:0;font-size:14px" />
  </div>

  <table id="lb-table" style="width:100%;max-width:800px;margin:auto;border-collapse:collapse;font-size:14px">
    <thead style="background:#f2f2f2">
      <tr>
        <th data-key="model"      style="padding:8px;text-align:left">Model</th>
        <th data-key="average"    style="padding:8px;text-align:center">Overall Score</th>
        <th data-key="total_cost" style="padding:8px;text-align:center">Cost</th>
      </tr>
    </thead>
    <tbody id="lb-body"></tbody>
  </table>
</div>

<Info>
Try refreshing the page if the leaderboard data is not visible.
</Info>

## Understanding the Benchmark

<Note>
For a more in-depth breakdown of our memory benchmarks, [read our blog](https://www.letta.com/blog/letta-leaderboard).
</Note>

We measure two foundational aspects of context management: **core memory** and **archival memory**. Core memory is what is inside the agent’s [context window](https://www.letta.com/blog/memory-blocks) (aka "in-context memory") and archival memory is managing context external to the agent (aka "out-of-context memory", or "external memory"). This benchmark evaluates stateful agent's fundamental capabilities on _reading_, _writing_, and _updating_ memories.

For all the tasks in the memory benchmarks, we generate a fictional question-answering dataset with supporting facts to minimize prior knowledge from LLM training. To evaluate, we use a prompted GPT 4.1 to grade the agent-generated answer and the ground-truth answer, following [SimpleQA](https://openai.com/index/introducing-simpleqa/). We add a penalty for extraneous memory operations to penalize models for inefficient or incorrect archival memory accesses.

## Main Results and Recommendations

For the **closed** model providers (OpenAI, Anthropic, Google):
* Anthropic Claude Sonnet 4 and OpenAI GPT 4.1 are recommended models for most tasks
* Normalized for cost, Gemini 2.5 Flash and GPT 4o-mini are top choices
* Models that perform well on the archival memory task (e.g. Claude Haiku 3.5) might overuse memory operations when unnecessary, thus receiving a lower score on core memory due to the extraneous access penalty.
* The o-series reasoner models from OpenAI perform worse than GPT 4.1

For the **open weights** models (Llama, Qwen, Mistral, DeepSeek):
* Qwen3 235B is the best performing (overall)
* Llama 4 Scout 17B performs similarly to GPT 4.1-nano