--- title: Benchmark Information subtitle: Understand how we benchmark the different models # layout: page # hide-feedback: true # no-image-zoom: true slug: leaderboard/benchmarks --- ## Understanding the Letta Memory Benchmark We measure two foundational aspects of context management: **core memory** and **archival memory**. Core memory is what is inside the agent’s [context window](https://www.letta.com/blog/memory-blocks) (aka "in-context memory") and archival memory is managing context external to the agent (aka "out-of-context memory", or "external memory"). This benchmark evaluates stateful agent's fundamental capabilities on _reading_, _writing_, and _updating_ memories. For all the tasks in Letta Memory Benchmark, we generate a fictional question-answering dataset with supporting facts to minimize prior knowledge from LLM training. To evaluate, we use a prompted GPT 4.1 to grade the agent-generated answer and the ground-truth answer, following [SimpleQA](https://openai.com/index/introducing-simpleqa/). We add a penalty for extraneous memory operations to penalize models for inefficient or incorrect archival memory accesses. To read about more details on the benchmark, refer to our [blog post](https://www.letta.com/blog/memory-benchmark). ## Main Results and Recommendations For the **closed** model providers (OpenAI, Anthropic, Google): * Anthropic Claude Sonnet 4 and OpenAI GPT 4.1 are recommended models for most tasks * Normalized for cost, Gemini 2.5 Flash and GPT 4o-mini are top choices * Models that perform well on the archival memory task (e.g. Claude Haiku 3.5) might overuse memory operations when unnecessary, thus receiving a lower score on core memory due to the extraneous access penalty. * The o-series reasoner models from OpenAI perform worse than GPT 4.1 For the **open weights** models (Llama, Qwen, Mistral, DeepSeek): * Llama 3.1 405B is the best performing (overall) * Llama 4 Scout 17B and Qwen 2.5 72B perform similarly to GPT 4.1 Mini