letta-server/fern/pages/evals/getting-started.mdx

# Getting Started

Run your first Letta agent evaluation in 5 minutes.

## Prerequisites

- Python 3.11 or higher
- A running Letta server (local or Letta Cloud)
- A Letta agent to test, either in agent file format or by ID (see [Targets](/evals/core-concepts/targets) for more details)

## Installation

```bash
pip install letta-evals
```

Or with uv:

```bash
uv pip install letta-evals
```

## Getting an Agent to Test

Export an existing agent to a file using the Letta SDK:

```python
from letta_client import Letta
import os

client = Letta(
    base_url="http://localhost:8283",  # or https://api.letta.com for Letta Cloud
    token=os.getenv("LETTA_API_KEY")  # required for Letta Cloud
)

# Export an agent to a file
agent_file = client.agents.export_file(agent_id="agent-123")

# Save to disk
with open("my_agent.af", "w") as f:
    f.write(agent_file)
```

Or export via the Agent Development Environment (ADE) by selecting "Export Agent".

Then reference it in your suite:

```yaml
target:
  kind: agent
  agent_file: my_agent.af
```

<Note>
**Other options:** You can also use existing agents by ID or programmatically generate agents. See [Targets](/evals/core-concepts/targets) for all agent configuration options.
</Note>

## Quick Start

Let's create your first evaluation in 3 steps:

### 1. Create a Test Dataset

Create a file named `dataset.jsonl`:

```jsonl
{"input": "What's the capital of France?", "ground_truth": "Paris"}
{"input": "Calculate 2+2", "ground_truth": "4"}
{"input": "What color is the sky?", "ground_truth": "blue"}
```

Each line is a JSON object with:
- `input`: The prompt to send to your agent
- `ground_truth`: The expected answer (used for grading)

<Note>
`ground_truth` is optional for some graders (like rubric graders), but required for tool graders like `contains` and `exact_match`.
</Note>

Read more about [Datasets](/evals/core-concepts/datasets) for details on how to create your dataset.

### 2. Create a Suite Configuration

Create a file named `suite.yaml`:

```yaml
name: my-first-eval
dataset: dataset.jsonl

target:
  kind: agent
  agent_file: my_agent.af  # Path to your agent file
  base_url: http://localhost:8283  # Your Letta server

graders:
  quality:
    kind: tool
    function: contains  # Check if response contains the ground truth
    extractor: last_assistant  # Use the last assistant message

gate:
  metric_key: quality
  op: gte
  value: 0.75  # Require 75% pass rate
```

The suite configuration defines:
- The [dataset](/evals/core-concepts/datasets) to use
- The [agent](/evals/core-concepts/targets) to test
- The [graders](/evals/core-concepts/graders) to use
- The [gate](/evals/core-concepts/gates) criteria

Read more about [Suites](/evals/core-concepts/suites) for details on how to configure your evaluation.

### 3. Run the Evaluation

Run your evaluation with the following command:

```bash
letta-evals run suite.yaml
```

You'll see real-time progress as your evaluation runs:

```
Running evaluation: my-first-eval
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3/3 100%
✓ PASSED (2.25/3.00 avg, 75.0% pass rate)
```

Read more about [CLI Commands](/evals/cli-reference/commands) for details about the available commands and options.

## Understanding the Results

The core evaluation flow is:

**Dataset → Target (Agent) → Extractor → Grader → Gate → Result**

The evaluation runner:
1. Loads your dataset
2. Sends each input to your agent (Target)
3. Extracts the relevant information (using the Extractor)
4. Grades the response (using the Grader function)
5. Computes aggregate metrics
6. Checks if metrics pass the Gate criteria

The output shows:
- **Average score**: Mean score across all samples
- **Pass rate**: Percentage of samples that passed
- **Gate status**: Whether the evaluation passed or failed overall

## Next Steps

Now that you've run your first evaluation, explore more advanced features:

- [Core Concepts](/evals/core-concepts/concepts-overview) - Understand suites, datasets, graders, and extractors
- [Grader Types](/evals/core-concepts/graders) - Learn about tool graders vs rubric graders
- [Multi-Metric Evaluation](/evals/graders/multi-metric-grading) - Test multiple aspects simultaneously
- [Custom Graders](/evals/advanced/custom-graders) - Write custom grading functions
- [Multi-Turn Conversations](/evals/advanced/multi-turn-conversations) - Test conversational memory

## Common Use Cases

### Strict Answer Checking

Use exact matching for cases where the answer must be precisely correct:

```yaml
graders:
  accuracy:
    kind: tool
    function: exact_match
    extractor: last_assistant
```

### Subjective Quality Evaluation

Use an LLM judge to evaluate subjective qualities like helpfulness or tone:

```yaml
graders:
  quality:
    kind: rubric
    prompt_path: rubric.txt
    model: gpt-4o-mini
    extractor: last_assistant
```

Then create `rubric.txt`:
```
Rate the helpfulness and accuracy of the response.
- Score 1.0 if helpful and accurate
- Score 0.5 if partially helpful
- Score 0.0 if unhelpful or wrong
```

### Testing Tool Calls

Verify that your agent calls specific tools with expected arguments:

```yaml
graders:
  tool_check:
    kind: tool
    function: contains
    extractor: tool_arguments
    extractor_config:
      tool_name: search
```

### Testing Memory Persistence

Check if the agent correctly updates its memory blocks:

```yaml
graders:
  memory_check:
    kind: tool
    function: contains
    extractor: memory_block
    extractor_config:
      block_label: human
```

## Troubleshooting

<Warning>
**"Agent file not found"**

Make sure your `agent_file` path is correct. Paths are relative to the suite YAML file location. Use absolute paths if needed:

```yaml
target:
  agent_file: /absolute/path/to/my_agent.af
```
</Warning>

<Warning>
**"Connection refused"**

Your Letta server isn't running or isn't accessible. Start it with:

```bash
letta server
```

By default, it runs at `http://localhost:8283`.
</Warning>

<Warning>
**"No ground_truth provided"**

Tool graders like `exact_match` and `contains` require `ground_truth` in your dataset. Either:
- Add `ground_truth` to your samples, or
- Use a rubric grader which doesn't require ground truth
</Warning>

<Tip>
**Agent didn't respond as expected**

Try testing your agent manually first using the Letta SDK or Agent Development Environment (ADE) to see how it behaves before running evaluations. See the [Letta documentation](https://docs.letta.com) for more information.
</Tip>

For more help, see the [Troubleshooting Guide](/evals/troubleshooting/common-issues).