Files
letta-server/fern/pages/evals/getting-started.mdx
2025-10-24 15:13:47 -07:00

265 lines
6.5 KiB
Plaintext

# Getting Started
Run your first Letta agent evaluation in 5 minutes.
## Prerequisites
- Python 3.11 or higher
- A running Letta server (local or Letta Cloud)
- A Letta agent to test, either in agent file format or by ID (see [Targets](/evals/core-concepts/targets) for more details)
## Installation
```bash
pip install letta-evals
```
Or with uv:
```bash
uv pip install letta-evals
```
## Getting an Agent to Test
Export an existing agent to a file using the Letta SDK:
```python
from letta_client import Letta
import os
client = Letta(
base_url="http://localhost:8283", # or https://api.letta.com for Letta Cloud
token=os.getenv("LETTA_API_KEY") # required for Letta Cloud
)
# Export an agent to a file
agent_file = client.agents.export_file(agent_id="agent-123")
# Save to disk
with open("my_agent.af", "w") as f:
f.write(agent_file)
```
Or export via the Agent Development Environment (ADE) by selecting "Export Agent".
Then reference it in your suite:
```yaml
target:
kind: agent
agent_file: my_agent.af
```
<Note>
**Other options:** You can also use existing agents by ID or programmatically generate agents. See [Targets](/evals/core-concepts/targets) for all agent configuration options.
</Note>
## Quick Start
Let's create your first evaluation in 3 steps:
### 1. Create a Test Dataset
Create a file named `dataset.jsonl`:
```jsonl
{"input": "What's the capital of France?", "ground_truth": "Paris"}
{"input": "Calculate 2+2", "ground_truth": "4"}
{"input": "What color is the sky?", "ground_truth": "blue"}
```
Each line is a JSON object with:
- `input`: The prompt to send to your agent
- `ground_truth`: The expected answer (used for grading)
<Note>
`ground_truth` is optional for some graders (like rubric graders), but required for tool graders like `contains` and `exact_match`.
</Note>
Read more about [Datasets](/evals/core-concepts/datasets) for details on how to create your dataset.
### 2. Create a Suite Configuration
Create a file named `suite.yaml`:
```yaml
name: my-first-eval
dataset: dataset.jsonl
target:
kind: agent
agent_file: my_agent.af # Path to your agent file
base_url: http://localhost:8283 # Your Letta server
graders:
quality:
kind: tool
function: contains # Check if response contains the ground truth
extractor: last_assistant # Use the last assistant message
gate:
metric_key: quality
op: gte
value: 0.75 # Require 75% pass rate
```
The suite configuration defines:
- The [dataset](/evals/core-concepts/datasets) to use
- The [agent](/evals/core-concepts/targets) to test
- The [graders](/evals/core-concepts/graders) to use
- The [gate](/evals/core-concepts/gates) criteria
Read more about [Suites](/evals/core-concepts/suites) for details on how to configure your evaluation.
### 3. Run the Evaluation
Run your evaluation with the following command:
```bash
letta-evals run suite.yaml
```
You'll see real-time progress as your evaluation runs:
```
Running evaluation: my-first-eval
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3/3 100%
✓ PASSED (2.25/3.00 avg, 75.0% pass rate)
```
Read more about [CLI Commands](/evals/cli-reference/commands) for details about the available commands and options.
## Understanding the Results
The core evaluation flow is:
**Dataset → Target (Agent) → Extractor → Grader → Gate → Result**
The evaluation runner:
1. Loads your dataset
2. Sends each input to your agent (Target)
3. Extracts the relevant information (using the Extractor)
4. Grades the response (using the Grader function)
5. Computes aggregate metrics
6. Checks if metrics pass the Gate criteria
The output shows:
- **Average score**: Mean score across all samples
- **Pass rate**: Percentage of samples that passed
- **Gate status**: Whether the evaluation passed or failed overall
## Next Steps
Now that you've run your first evaluation, explore more advanced features:
- [Core Concepts](/evals/core-concepts/concepts-overview) - Understand suites, datasets, graders, and extractors
- [Grader Types](/evals/core-concepts/graders) - Learn about tool graders vs rubric graders
- [Multi-Metric Evaluation](/evals/graders/multi-metric-grading) - Test multiple aspects simultaneously
- [Custom Graders](/evals/advanced/custom-graders) - Write custom grading functions
- [Multi-Turn Conversations](/evals/advanced/multi-turn-conversations) - Test conversational memory
## Common Use Cases
### Strict Answer Checking
Use exact matching for cases where the answer must be precisely correct:
```yaml
graders:
accuracy:
kind: tool
function: exact_match
extractor: last_assistant
```
### Subjective Quality Evaluation
Use an LLM judge to evaluate subjective qualities like helpfulness or tone:
```yaml
graders:
quality:
kind: rubric
prompt_path: rubric.txt
model: gpt-4o-mini
extractor: last_assistant
```
Then create `rubric.txt`:
```
Rate the helpfulness and accuracy of the response.
- Score 1.0 if helpful and accurate
- Score 0.5 if partially helpful
- Score 0.0 if unhelpful or wrong
```
### Testing Tool Calls
Verify that your agent calls specific tools with expected arguments:
```yaml
graders:
tool_check:
kind: tool
function: contains
extractor: tool_arguments
extractor_config:
tool_name: search
```
### Testing Memory Persistence
Check if the agent correctly updates its memory blocks:
```yaml
graders:
memory_check:
kind: tool
function: contains
extractor: memory_block
extractor_config:
block_label: human
```
## Troubleshooting
<Warning>
**"Agent file not found"**
Make sure your `agent_file` path is correct. Paths are relative to the suite YAML file location. Use absolute paths if needed:
```yaml
target:
agent_file: /absolute/path/to/my_agent.af
```
</Warning>
<Warning>
**"Connection refused"**
Your Letta server isn't running or isn't accessible. Start it with:
```bash
letta server
```
By default, it runs at `http://localhost:8283`.
</Warning>
<Warning>
**"No ground_truth provided"**
Tool graders like `exact_match` and `contains` require `ground_truth` in your dataset. Either:
- Add `ground_truth` to your samples, or
- Use a rubric grader which doesn't require ground truth
</Warning>
<Tip>
**Agent didn't respond as expected**
Try testing your agent manually first using the Letta SDK or Agent Development Environment (ADE) to see how it behaves before running evaluations. See the [Letta documentation](https://docs.letta.com) for more information.
</Tip>
For more help, see the [Troubleshooting Guide](/evals/troubleshooting/common-issues).