refactor: add extract_usage_statistics returning LettaUsageStatistics (#9065)
👾 Generated with [Letta Code](https://letta.com) Co-Authored-By: Letta <noreply@letta.com> --------- Co-authored-by: Letta <noreply@letta.com>
This commit is contained in:
committed by
Caren Thomas
parent
2bccd36382
commit
221b4e6279
43
.skills/llm-provider-usage-statistics/SKILL.md
Normal file
43
.skills/llm-provider-usage-statistics/SKILL.md
Normal file
@@ -0,0 +1,43 @@
|
||||
---
|
||||
name: llm-provider-usage-statistics
|
||||
description: Reference guide for token counting and prefix caching across LLM providers (OpenAI, Anthropic, Gemini). Use when debugging token counts or optimizing prefix caching.
|
||||
---
|
||||
|
||||
# LLM Provider Usage Statistics
|
||||
|
||||
Reference documentation for how different LLM providers report token usage.
|
||||
|
||||
## Quick Reference: Token Counting Semantics
|
||||
|
||||
| Provider | `input_tokens` meaning | Cache tokens | Must add cache to get total? |
|
||||
|----------|------------------------|--------------|------------------------------|
|
||||
| OpenAI | TOTAL (includes cached) | `cached_tokens` is subset | No |
|
||||
| Anthropic | NON-cached only | `cache_read_input_tokens` + `cache_creation_input_tokens` | **Yes** |
|
||||
| Gemini | TOTAL (includes cached) | `cached_content_token_count` is subset | No |
|
||||
|
||||
**Critical difference:** Anthropic's `input_tokens` excludes cached tokens, so you must add them:
|
||||
```
|
||||
total_input = input_tokens + cache_read_input_tokens + cache_creation_input_tokens
|
||||
```
|
||||
|
||||
## Quick Reference: Prefix Caching
|
||||
|
||||
| Provider | Min tokens | How to enable | TTL |
|
||||
|----------|-----------|---------------|-----|
|
||||
| OpenAI | 1,024 | Automatic | ~5-10 min |
|
||||
| Anthropic | 1,024 | Requires `cache_control` breakpoints | 5 min |
|
||||
| Gemini 2.0+ | 1,024 | Automatic (implicit) | Variable |
|
||||
|
||||
## Quick Reference: Reasoning/Thinking Tokens
|
||||
|
||||
| Provider | Field name | Models |
|
||||
|----------|-----------|--------|
|
||||
| OpenAI | `reasoning_tokens` | o1, o3 models |
|
||||
| Anthropic | N/A | (thinking is in content blocks, not usage) |
|
||||
| Gemini | `thoughts_token_count` | Gemini 2.0 with thinking enabled |
|
||||
|
||||
## Provider Reference Files
|
||||
|
||||
- **OpenAI:** [references/openai.md](references/openai.md) - Chat Completions vs Responses API, reasoning models, cached_tokens
|
||||
- **Anthropic:** [references/anthropic.md](references/anthropic.md) - cache_control setup, beta headers, cache token fields
|
||||
- **Gemini:** [references/gemini.md](references/gemini.md) - implicit caching, thinking tokens, usage_metadata fields
|
||||
@@ -0,0 +1,83 @@
|
||||
# Anthropic Usage Statistics
|
||||
|
||||
## Response Format
|
||||
|
||||
```
|
||||
response.usage.input_tokens # NON-cached input tokens only
|
||||
response.usage.output_tokens # Output tokens
|
||||
response.usage.cache_read_input_tokens # Tokens read from cache
|
||||
response.usage.cache_creation_input_tokens # Tokens written to cache
|
||||
```
|
||||
|
||||
## Critical: Token Calculation
|
||||
|
||||
**Anthropic's `input_tokens` is NOT the total.** To get total input tokens:
|
||||
|
||||
```python
|
||||
total_input = input_tokens + cache_read_input_tokens + cache_creation_input_tokens
|
||||
```
|
||||
|
||||
This is different from OpenAI/Gemini where `prompt_tokens` is already the total.
|
||||
|
||||
## Prefix Caching (Prompt Caching)
|
||||
|
||||
**Requirements:**
|
||||
- Minimum 1,024 tokens for Claude 3.5 Haiku/Sonnet
|
||||
- Minimum 2,048 tokens for Claude 3 Opus
|
||||
- Requires explicit `cache_control` breakpoints in messages
|
||||
- TTL: 5 minutes
|
||||
|
||||
**How to enable:**
|
||||
Add `cache_control` to message content:
|
||||
```python
|
||||
{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{
|
||||
"type": "text",
|
||||
"text": "...",
|
||||
"cache_control": {"type": "ephemeral"}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Beta header required:**
|
||||
```python
|
||||
betas = ["prompt-caching-2024-07-31"]
|
||||
```
|
||||
|
||||
## Cache Behavior
|
||||
|
||||
- `cache_creation_input_tokens`: Tokens that were cached on this request (cache write)
|
||||
- `cache_read_input_tokens`: Tokens that were read from existing cache (cache hit)
|
||||
- On first request: expect `cache_creation_input_tokens > 0`
|
||||
- On subsequent requests with same prefix: expect `cache_read_input_tokens > 0`
|
||||
|
||||
## Streaming
|
||||
|
||||
In streaming mode, usage is reported in two events:
|
||||
|
||||
1. **`message_start`**: Initial usage (may have cache info)
|
||||
```python
|
||||
event.message.usage.input_tokens
|
||||
event.message.usage.output_tokens
|
||||
event.message.usage.cache_read_input_tokens
|
||||
event.message.usage.cache_creation_input_tokens
|
||||
```
|
||||
|
||||
2. **`message_delta`**: Cumulative output tokens
|
||||
```python
|
||||
event.usage.output_tokens # This is CUMULATIVE, not incremental
|
||||
```
|
||||
|
||||
**Important:** Per Anthropic docs, `message_delta` token counts are cumulative, so assign (don't accumulate).
|
||||
|
||||
## Letta Implementation
|
||||
|
||||
- **Client:** `letta/llm_api/anthropic_client.py`
|
||||
- **Streaming interfaces:**
|
||||
- `letta/interfaces/anthropic_streaming_interface.py`
|
||||
- `letta/interfaces/anthropic_parallel_tool_call_streaming_interface.py` (tracks cache tokens)
|
||||
- **Extract method:** `AnthropicClient.extract_usage_statistics()`
|
||||
- **Cache control:** `_add_cache_control_to_system_message()`, `_add_cache_control_to_messages()`
|
||||
81
.skills/llm-provider-usage-statistics/references/gemini.md
Normal file
81
.skills/llm-provider-usage-statistics/references/gemini.md
Normal file
@@ -0,0 +1,81 @@
|
||||
# Gemini Usage Statistics
|
||||
|
||||
## Response Format
|
||||
|
||||
Gemini returns usage in `usage_metadata`:
|
||||
|
||||
```
|
||||
response.usage_metadata.prompt_token_count # Total input tokens
|
||||
response.usage_metadata.candidates_token_count # Output tokens
|
||||
response.usage_metadata.total_token_count # Sum
|
||||
response.usage_metadata.cached_content_token_count # Tokens from cache (optional)
|
||||
response.usage_metadata.thoughts_token_count # Reasoning tokens (optional)
|
||||
```
|
||||
|
||||
## Token Counting
|
||||
|
||||
- `prompt_token_count` is the TOTAL (includes cached)
|
||||
- `cached_content_token_count` is a subset (when present)
|
||||
- Similar to OpenAI's semantics
|
||||
|
||||
## Implicit Caching (Gemini 2.0+)
|
||||
|
||||
**Requirements:**
|
||||
- Minimum 1,024 tokens
|
||||
- Automatic (no opt-in required)
|
||||
- Available on Gemini 2.0 Flash and later models
|
||||
|
||||
**Behavior:**
|
||||
- Caching is probabilistic and server-side
|
||||
- `cached_content_token_count` may or may not be present
|
||||
- When present, indicates tokens that were served from cache
|
||||
|
||||
**Note:** Unlike Anthropic, Gemini doesn't have explicit cache_control. Caching is implicit and managed by Google's infrastructure.
|
||||
|
||||
## Reasoning/Thinking Tokens
|
||||
|
||||
For models with extended thinking (like Gemini 2.0 with thinking enabled):
|
||||
- `thoughts_token_count` reports tokens used for reasoning
|
||||
- These are similar to OpenAI's `reasoning_tokens`
|
||||
|
||||
**Enabling thinking:**
|
||||
```python
|
||||
generation_config = {
|
||||
"thinking_config": {
|
||||
"thinking_budget": 1024 # Max thinking tokens
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Streaming
|
||||
|
||||
In streaming mode:
|
||||
- `usage_metadata` is typically in the **final chunk**
|
||||
- Same fields as non-streaming
|
||||
- May not be present in intermediate chunks
|
||||
|
||||
**Important:** `stream_async()` returns an async generator (not awaitable):
|
||||
```python
|
||||
# Correct:
|
||||
stream = client.stream_async(request_data, llm_config)
|
||||
async for chunk in stream:
|
||||
...
|
||||
|
||||
# Incorrect (will error):
|
||||
stream = await client.stream_async(...) # TypeError!
|
||||
```
|
||||
|
||||
## APIs
|
||||
|
||||
Gemini has two APIs:
|
||||
- **Google AI (google_ai):** Uses `google.genai` SDK
|
||||
- **Vertex AI (google_vertex):** Uses same SDK with different auth
|
||||
|
||||
Both share the same response format.
|
||||
|
||||
## Letta Implementation
|
||||
|
||||
- **Client:** `letta/llm_api/google_vertex_client.py` (handles both google_ai and google_vertex)
|
||||
- **Streaming interface:** `letta/interfaces/gemini_streaming_interface.py`
|
||||
- **Extract method:** `GoogleVertexClient.extract_usage_statistics()`
|
||||
- Response is a `GenerateContentResponse` object with `.usage_metadata` attribute
|
||||
61
.skills/llm-provider-usage-statistics/references/openai.md
Normal file
61
.skills/llm-provider-usage-statistics/references/openai.md
Normal file
@@ -0,0 +1,61 @@
|
||||
# OpenAI Usage Statistics
|
||||
|
||||
## APIs and Response Formats
|
||||
|
||||
OpenAI has two APIs with different response structures:
|
||||
|
||||
### Chat Completions API
|
||||
```
|
||||
response.usage.prompt_tokens # Total input tokens (includes cached)
|
||||
response.usage.completion_tokens # Output tokens
|
||||
response.usage.total_tokens # Sum
|
||||
response.usage.prompt_tokens_details.cached_tokens # Subset that was cached
|
||||
response.usage.completion_tokens_details.reasoning_tokens # For o1/o3 models
|
||||
```
|
||||
|
||||
### Responses API (newer)
|
||||
```
|
||||
response.usage.input_tokens # Total input tokens
|
||||
response.usage.output_tokens # Output tokens
|
||||
response.usage.total_tokens # Sum
|
||||
response.usage.input_tokens_details.cached_tokens # Subset that was cached
|
||||
response.usage.output_tokens_details.reasoning_tokens # For reasoning models
|
||||
```
|
||||
|
||||
## Prefix Caching
|
||||
|
||||
**Requirements:**
|
||||
- Minimum 1,024 tokens in the prefix
|
||||
- Automatic (no opt-in required)
|
||||
- Cached in 128-token increments
|
||||
- TTL: approximately 5-10 minutes of inactivity
|
||||
|
||||
**Supported models:** GPT-4o, GPT-4o-mini, o1, o1-mini, o3-mini
|
||||
|
||||
**Cache behavior:**
|
||||
- `cached_tokens` will be a multiple of 128
|
||||
- Cache hit means those tokens were not re-processed
|
||||
- Cost: cached tokens are cheaper than non-cached
|
||||
|
||||
## Reasoning Models (o1, o3)
|
||||
|
||||
For reasoning models, additional tokens are used for "thinking":
|
||||
- `reasoning_tokens` in `completion_tokens_details`
|
||||
- These are output tokens used for internal reasoning
|
||||
- Not visible in the response content
|
||||
|
||||
## Streaming
|
||||
|
||||
In streaming mode, usage is reported in the **final chunk** when `stream_options.include_usage=True`:
|
||||
```python
|
||||
request_data["stream_options"] = {"include_usage": True}
|
||||
```
|
||||
|
||||
The final chunk will have `chunk.usage` with the same structure as non-streaming.
|
||||
|
||||
## Letta Implementation
|
||||
|
||||
- **Client:** `letta/llm_api/openai_client.py`
|
||||
- **Streaming interface:** `letta/interfaces/openai_streaming_interface.py`
|
||||
- **Extract method:** `OpenAIClient.extract_usage_statistics()`
|
||||
- Uses OpenAI SDK's pydantic models (`ChatCompletion`) for type-safe parsing
|
||||
Reference in New Issue
Block a user