Files
letta-server/fern/pages/agents/streaming.mdx

331 lines
11 KiB
Plaintext

---
title: Streaming agent responses
slug: guides/agents/streaming
---
Messages from the **Letta server** can be **streamed** to the client.
If you're building a UI on the Letta API, enabling streaming allows your UI to update in real-time as the agent generates a response to an input message.
<Warning>
When working with agents that execute long-running operations (e.g., complex tool calls, extensive searches, or code execution), you may encounter timeouts with the message routes.
See our [tips on handling long-running tasks](/guides/agents/long-running) for more info.
</Warning>
## Quick Start
Letta supports two streaming modes: **step streaming** (default) and **token streaming**.
To enable streaming, use the [`/v1/agents/{agent_id}/messages/stream`](/api-reference/agents/messages/stream) endpoint instead of `/messages`:
<CodeGroup>
```python title="python"
# Step streaming (default) - returns complete messages
stream = client.agents.messages.create_stream(
agent_id=agent.id,
messages=[{"role": "user", "content": "Hello!"}]
)
for chunk in stream:
print(chunk) # Complete message objects
# Token streaming - returns partial chunks for real-time UX
stream = client.agents.messages.create_stream(
agent_id=agent.id,
messages=[{"role": "user", "content": "Hello!"}],
stream_tokens=True # Enable token streaming
)
for chunk in stream:
print(chunk) # Partial content chunks
```
```typescript title="typescript"
import { LettaClient } from '@letta-ai/letta-client';
const client = new LettaClient({ token: 'YOUR_API_KEY' });
// Step streaming (default) - returns complete messages
const stream = await client.agents.messages.createStream(
agent.id, {
messages: [{role: "user", content: "Hello!"}]
}
);
for await (const chunk of stream) {
console.log(chunk); // Complete message objects
}
// Token streaming - returns partial chunks for real-time UX
const tokenStream = await client.agents.messages.createStream(
agent.id, {
messages: [{role: "user", content: "Hello!"}],
streamTokens: true // Enable token streaming
}
);
for await (const chunk of tokenStream) {
console.log(chunk); // Partial content chunks
}
```
</CodeGroup>
## Streaming Modes Comparison
| Aspect | Step Streaming (default) | Token Streaming |
|--------|-------------------------|-----------------|
| **What you get** | Complete messages after each step | Partial chunks as tokens generate |
| **When to use** | Simple implementation | ChatGPT-like real-time UX |
| **Reassembly needed** | No | Yes (by message ID) |
| **Message IDs** | Unique per message | Same ID across chunks |
| **Content format** | Full text in each message | Incremental text pieces |
| **Enable with** | Default behavior | `stream_tokens: true` |
## Understanding Message Flow
### Message Types and Flow Patterns
The messages you receive depend on your agent's configuration:
**With reasoning enabled (default):**
- Simple response: `reasoning_message` → `assistant_message`
- With tool use: `reasoning_message` → `tool_call_message` → `tool_return_message` → `reasoning_message` → `assistant_message`
**With reasoning disabled (`reasoning=false`):**
- Simple response: `assistant_message`
- With tool use: `tool_call_message` → `tool_return_message` → `assistant_message`
### Message Type Reference
- **`reasoning_message`**: Agent's internal thinking process (only when `reasoning=true`)
- **`assistant_message`**: The actual response shown to the user
- **`tool_call_message`**: Request to execute a tool
- **`tool_return_message`**: Result from tool execution
- **`stop_reason`**: Indicates end of response (`end_turn`)
- **`usage_statistics`**: Token usage and step count metrics
### Controlling Reasoning Messages
```python
# With reasoning (default) - includes reasoning_message events
agent = client.agents.create(
model="openai/gpt-4o-mini",
# reasoning=True is the default
)
# Without reasoning - no reasoning_message events
agent = client.agents.create(
model="openai/gpt-4o-mini",
reasoning=False # Disable reasoning messages
)
```
## Step Streaming (Default)
Step streaming delivers **complete messages** after each agent step completes. This is the default behavior when you use the streaming endpoint.
### How It Works
1. Agent processes your request through steps (reasoning, tool calls, generating responses)
2. After each step completes, you receive a complete `LettaMessage` via SSE
3. Each message can be processed immediately without reassembly
### Example
<CodeGroup>
```python title="python"
stream = client.agents.messages.create_stream(
agent_id=agent.id,
messages=[{"role": "user", "content": "What's 2+2?"}]
)
for chunk in stream:
if hasattr(chunk, 'message_type'):
if chunk.message_type == 'reasoning_message':
print(f"Thinking: {chunk.reasoning}")
elif chunk.message_type == 'assistant_message':
print(f"Response: {chunk.content}")
```
```typescript title="typescript"
import { LettaClient } from '@letta-ai/letta-client';
import type { LettaMessage } from '@letta-ai/letta-client/api/types';
const client = new LettaClient({ token: 'YOUR_API_KEY' });
const stream = await client.agents.messages.createStream(
agent.id, {
messages: [{role: "user", content: "What's 2+2?"}]
}
);
for await (const chunk of stream as AsyncIterable<LettaMessage>) {
if (chunk.messageType === 'reasoning_message') {
console.log(`Thinking: ${(chunk as any).reasoning}`);
} else if (chunk.messageType === 'assistant_message') {
console.log(`Response: ${(chunk as any).content}`);
}
}
```
```bash title="curl"
curl -N --request POST \
--url https://api.letta.com/v1/agents/$AGENT_ID/messages/stream \
--header "Authorization: Bearer $LETTA_API_KEY" \
--header 'Content-Type: application/json' \
--data '{"messages": [{"role": "user", "content": "What is 2+2?"}]}'
# For self-hosted: Replace https://api.letta.com with http://localhost:8283
```
</CodeGroup>
### Example Output
```
data: {"id":"msg-123","message_type":"reasoning_message","reasoning":"User is asking a simple math question."}
data: {"id":"msg-456","message_type":"assistant_message","content":"2 + 2 equals 4!"}
data: {"message_type":"stop_reason","stop_reason":"end_turn"}
data: {"message_type":"usage_statistics","completion_tokens":50,"total_tokens":2821}
data: [DONE]
```
## Token Streaming
Token streaming provides **partial content chunks** as they're generated by the LLM, enabling a ChatGPT-like experience where text appears character by character.
### How It Works
1. Set `stream_tokens: true` in your request
2. Receive multiple chunks with the **same message ID**
3. Each chunk contains a piece of the content
4. Client must accumulate chunks by ID to rebuild complete messages
### Example with Reassembly
<CodeGroup>
```python title="python"
# Token streaming with reassembly
message_accumulators = {}
stream = client.agents.messages.create_stream(
agent_id=agent.id,
messages=[{"role": "user", "content": "Tell me a joke"}],
stream_tokens=True
)
for chunk in stream:
if hasattr(chunk, 'id') and hasattr(chunk, 'message_type'):
msg_id = chunk.id
msg_type = chunk.message_type
# Initialize accumulator for new messages
if msg_id not in message_accumulators:
message_accumulators[msg_id] = {
'type': msg_type,
'content': ''
}
# Accumulate content
if msg_type == 'reasoning_message':
message_accumulators[msg_id]['content'] += chunk.reasoning
elif msg_type == 'assistant_message':
message_accumulators[msg_id]['content'] += chunk.content
# Display accumulated content in real-time
print(message_accumulators[msg_id]['content'], end='', flush=True)
```
```typescript title="typescript"
import { LettaClient } from '@letta-ai/letta-client';
import type { LettaMessage } from '@letta-ai/letta-client/api/types';
const client = new LettaClient({ token: 'YOUR_API_KEY' });
// Token streaming with reassembly
interface MessageAccumulator {
type: string;
content: string;
}
const messageAccumulators = new Map<string, MessageAccumulator>();
const stream = await client.agents.messages.createStream(
agent.id, {
messages: [{role: "user", content: "Tell me a joke"}],
streamTokens: true // Note: camelCase
}
);
for await (const chunk of stream as AsyncIterable<LettaMessage>) {
if (chunk.id && chunk.messageType) {
const msgId = chunk.id;
const msgType = chunk.messageType;
// Initialize accumulator for new messages
if (!messageAccumulators.has(msgId)) {
messageAccumulators.set(msgId, {
type: msgType,
content: ''
});
}
// Accumulate content based on message type
const acc = messageAccumulators.get(msgId)!;
// Only accumulate if the type matches (in case types share IDs)
if (acc.type === msgType) {
if (msgType === 'reasoning_message') {
acc.content += (chunk as any).reasoning || '';
} else if (msgType === 'assistant_message') {
acc.content += (chunk as any).content || '';
}
}
// Update UI with accumulated content
process.stdout.write(acc.content);
}
}
```
```bash title="curl"
curl -N --request POST \
--url https://api.letta.com/v1/agents/$AGENT_ID/messages/stream \
--header "Authorization: Bearer $LETTA_API_KEY" \
--header 'Content-Type: application/json' \
--data '{
"messages": [{"role": "user", "content": "Tell me a joke"}],
"stream_tokens": true
}'
```
</CodeGroup>
### Example Output
```
# Same ID across chunks of the same message
data: {"id":"msg-abc","message_type":"assistant_message","content":"Why"}
data: {"id":"msg-abc","message_type":"assistant_message","content":" did"}
data: {"id":"msg-abc","message_type":"assistant_message","content":" the"}
data: {"id":"msg-abc","message_type":"assistant_message","content":" scarecrow"}
data: {"id":"msg-abc","message_type":"assistant_message","content":" win"}
# ... more chunks with same ID
data: [DONE]
```
## Implementation Tips
### Universal Handling Pattern
The accumulator pattern shown above works for **both** streaming modes:
- **Step streaming**: Each message is complete (single chunk per ID)
- **Token streaming**: Multiple chunks per ID need accumulation
This means you can write your client code once to handle both cases.
### SSE Format Notes
All streaming responses follow the Server-Sent Events (SSE) format:
- Each event starts with `data: ` followed by JSON
- Stream ends with `data: [DONE]`
- Empty lines separate events
Learn more about SSE format [here](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events/Using_server-sent_events).
### Handling Different LLM Providers
If your Letta server connects to multiple LLM providers, some may not support token streaming. Your client code will still work - the server will fall back to step streaming automatically when token streaming isn't available.