29 lines
2.0 KiB
Plaintext
29 lines
2.0 KiB
Plaintext
---
|
||
title: Benchmark Information
|
||
subtitle: Understand how we benchmark the different models
|
||
# layout: page
|
||
# hide-feedback: true
|
||
# no-image-zoom: true
|
||
slug: leaderboard/benchmarks
|
||
---
|
||
|
||
## Understanding the Letta Memory Benchmark
|
||
|
||
We measure two foundational aspects of context management: **core memory** and **archival memory**. Core memory is what is inside the agent’s [context window](https://www.letta.com/blog/memory-blocks) (aka "in-context memory") and archival memory is managing context external to the agent (aka "out-of-context memory", or "external memory"). This benchmark evaluates stateful agent's fundamental capabilities on _reading_, _writing_, and _updating_ memories.
|
||
|
||
For all the tasks in Letta Memory Benchmark, we generate a fictional question-answering dataset with supporting facts to minimize prior knowledge from LLM training. To evaluate, we use a prompted GPT 4.1 to grade the agent-generated answer and the ground-truth answer, following [SimpleQA](https://openai.com/index/introducing-simpleqa/). We add a penalty for extraneous memory operations to penalize models for inefficient or incorrect archival memory accesses.
|
||
|
||
To read about more details on the benchmark, refer to our [blog post](https://www.letta.com/blog/memory-benchmark).
|
||
|
||
## Main Results and Recommendations
|
||
|
||
For the **closed** model providers (OpenAI, Anthropic, Google):
|
||
* Anthropic Claude Sonnet 4 and OpenAI GPT 4.1 are recommended models for most tasks
|
||
* Normalized for cost, Gemini 2.5 Flash and GPT 4o-mini are top choices
|
||
* Models that perform well on the archival memory task (e.g. Claude Haiku 3.5) might overuse memory operations when unnecessary, thus receiving a lower score on core memory due to the extraneous access penalty.
|
||
* The o-series reasoner models from OpenAI perform worse than GPT 4.1
|
||
|
||
For the **open weights** models (Llama, Qwen, Mistral, DeepSeek):
|
||
* Llama 3.1 405B is the best performing (overall)
|
||
* Llama 4 Scout 17B and Qwen 2.5 72B perform similarly to GPT 4.1 Mini
|