refactor: add extract_usage_statistics returning LettaUsageStatistics (#9065)

👾 Generated with [Letta Code](https://letta.com) Co-Authored-By: Letta <noreply@letta.com> --------- Co-authored-by: Letta <noreply@letta.com>
2026-01-25 16:15:03 -08:00
parent 2bccd36382
commit 221b4e6279
17 changed files with 1097 additions and 206 deletions
--- a/.skills/llm-provider-usage-statistics/SKILL.md
+++ b/.skills/llm-provider-usage-statistics/SKILL.md
@@ -0,0 +1,43 @@
+---
+name: llm-provider-usage-statistics
+description: Reference guide for token counting and prefix caching across LLM providers (OpenAI, Anthropic, Gemini). Use when debugging token counts or optimizing prefix caching.
+---
+
+# LLM Provider Usage Statistics
+
+Reference documentation for how different LLM providers report token usage.
+
+## Quick Reference: Token Counting Semantics
+
+| Provider | `input_tokens` meaning | Cache tokens | Must add cache to get total? |
+|----------|------------------------|--------------|------------------------------|
+| OpenAI | TOTAL (includes cached) | `cached_tokens` is subset | No |
+| Anthropic | NON-cached only | `cache_read_input_tokens` + `cache_creation_input_tokens` | **Yes** |
+| Gemini | TOTAL (includes cached) | `cached_content_token_count` is subset | No |
+
+**Critical difference:** Anthropic's `input_tokens` excludes cached tokens, so you must add them:
+```
+total_input = input_tokens + cache_read_input_tokens + cache_creation_input_tokens
+```
+
+## Quick Reference: Prefix Caching
+
+| Provider | Min tokens | How to enable | TTL |
+|----------|-----------|---------------|-----|
+| OpenAI | 1,024 | Automatic | ~5-10 min |
+| Anthropic | 1,024 | Requires `cache_control` breakpoints | 5 min |
+| Gemini 2.0+ | 1,024 | Automatic (implicit) | Variable |
+
+## Quick Reference: Reasoning/Thinking Tokens
+
+| Provider | Field name | Models |
+|----------|-----------|--------|
+| OpenAI | `reasoning_tokens` | o1, o3 models |
+| Anthropic | N/A | (thinking is in content blocks, not usage) |
+| Gemini | `thoughts_token_count` | Gemini 2.0 with thinking enabled |
+
+## Provider Reference Files
+
+- **OpenAI:** [references/openai.md](references/openai.md) - Chat Completions vs Responses API, reasoning models, cached_tokens
+- **Anthropic:** [references/anthropic.md](references/anthropic.md) - cache_control setup, beta headers, cache token fields
+- **Gemini:** [references/gemini.md](references/gemini.md) - implicit caching, thinking tokens, usage_metadata fields
--- a/.skills/llm-provider-usage-statistics/references/anthropic.md
+++ b/.skills/llm-provider-usage-statistics/references/anthropic.md
@@ -0,0 +1,83 @@
+# Anthropic Usage Statistics
+
+## Response Format
+
+```
+response.usage.input_tokens                  # NON-cached input tokens only
+response.usage.output_tokens                 # Output tokens
+response.usage.cache_read_input_tokens       # Tokens read from cache
+response.usage.cache_creation_input_tokens   # Tokens written to cache
+```
+
+## Critical: Token Calculation
+
+**Anthropic's `input_tokens` is NOT the total.** To get total input tokens:
+
+```python
+total_input = input_tokens + cache_read_input_tokens + cache_creation_input_tokens
+```
+
+This is different from OpenAI/Gemini where `prompt_tokens` is already the total.
+
+## Prefix Caching (Prompt Caching)
+
+**Requirements:**
+- Minimum 1,024 tokens for Claude 3.5 Haiku/Sonnet
+- Minimum 2,048 tokens for Claude 3 Opus
+- Requires explicit `cache_control` breakpoints in messages
+- TTL: 5 minutes
+
+**How to enable:**
+Add `cache_control` to message content:
+```python
+{
+    "role": "user",
+    "content": [
+        {
+            "type": "text",
+            "text": "...",
+            "cache_control": {"type": "ephemeral"}
+        }
+    ]
+}
+```
+
+**Beta header required:**
+```python
+betas = ["prompt-caching-2024-07-31"]
+```
+
+## Cache Behavior
+
+- `cache_creation_input_tokens`: Tokens that were cached on this request (cache write)
+- `cache_read_input_tokens`: Tokens that were read from existing cache (cache hit)
+- On first request: expect `cache_creation_input_tokens > 0`
+- On subsequent requests with same prefix: expect `cache_read_input_tokens > 0`
+
+## Streaming
+
+In streaming mode, usage is reported in two events:
+
+1. **`message_start`**: Initial usage (may have cache info)
+   ```python
+   event.message.usage.input_tokens
+   event.message.usage.output_tokens
+   event.message.usage.cache_read_input_tokens
+   event.message.usage.cache_creation_input_tokens
+   ```
+
+2. **`message_delta`**: Cumulative output tokens
+   ```python
+   event.usage.output_tokens  # This is CUMULATIVE, not incremental
+   ```
+
+**Important:** Per Anthropic docs, `message_delta` token counts are cumulative, so assign (don't accumulate).
+
+## Letta Implementation
+
+- **Client:** `letta/llm_api/anthropic_client.py`
+- **Streaming interfaces:**
+  - `letta/interfaces/anthropic_streaming_interface.py`
+  - `letta/interfaces/anthropic_parallel_tool_call_streaming_interface.py` (tracks cache tokens)
+- **Extract method:** `AnthropicClient.extract_usage_statistics()`
+- **Cache control:** `_add_cache_control_to_system_message()`, `_add_cache_control_to_messages()`
--- a/.skills/llm-provider-usage-statistics/references/gemini.md
+++ b/.skills/llm-provider-usage-statistics/references/gemini.md
@@ -0,0 +1,81 @@
+# Gemini Usage Statistics
+
+## Response Format
+
+Gemini returns usage in `usage_metadata`:
+
+```
+response.usage_metadata.prompt_token_count           # Total input tokens
+response.usage_metadata.candidates_token_count       # Output tokens
+response.usage_metadata.total_token_count            # Sum
+response.usage_metadata.cached_content_token_count   # Tokens from cache (optional)
+response.usage_metadata.thoughts_token_count         # Reasoning tokens (optional)
+```
+
+## Token Counting
+
+- `prompt_token_count` is the TOTAL (includes cached)
+- `cached_content_token_count` is a subset (when present)
+- Similar to OpenAI's semantics
+
+## Implicit Caching (Gemini 2.0+)
+
+**Requirements:**
+- Minimum 1,024 tokens
+- Automatic (no opt-in required)
+- Available on Gemini 2.0 Flash and later models
+
+**Behavior:**
+- Caching is probabilistic and server-side
+- `cached_content_token_count` may or may not be present
+- When present, indicates tokens that were served from cache
+
+**Note:** Unlike Anthropic, Gemini doesn't have explicit cache_control. Caching is implicit and managed by Google's infrastructure.
+
+## Reasoning/Thinking Tokens
+
+For models with extended thinking (like Gemini 2.0 with thinking enabled):
+- `thoughts_token_count` reports tokens used for reasoning
+- These are similar to OpenAI's `reasoning_tokens`
+
+**Enabling thinking:**
+```python
+generation_config = {
+    "thinking_config": {
+        "thinking_budget": 1024  # Max thinking tokens
+    }
+}
+```
+
+## Streaming
+
+In streaming mode:
+- `usage_metadata` is typically in the **final chunk**
+- Same fields as non-streaming
+- May not be present in intermediate chunks
+
+**Important:** `stream_async()` returns an async generator (not awaitable):
+```python
+# Correct:
+stream = client.stream_async(request_data, llm_config)
+async for chunk in stream:
+    ...
+
+# Incorrect (will error):
+stream = await client.stream_async(...)  # TypeError!
+```
+
+## APIs
+
+Gemini has two APIs:
+- **Google AI (google_ai):** Uses `google.genai` SDK
+- **Vertex AI (google_vertex):** Uses same SDK with different auth
+
+Both share the same response format.
+
+## Letta Implementation
+
+- **Client:** `letta/llm_api/google_vertex_client.py` (handles both google_ai and google_vertex)
+- **Streaming interface:** `letta/interfaces/gemini_streaming_interface.py`
+- **Extract method:** `GoogleVertexClient.extract_usage_statistics()`
+- Response is a `GenerateContentResponse` object with `.usage_metadata` attribute
--- a/.skills/llm-provider-usage-statistics/references/openai.md
+++ b/.skills/llm-provider-usage-statistics/references/openai.md
@@ -0,0 +1,61 @@
+# OpenAI Usage Statistics
+
+## APIs and Response Formats
+
+OpenAI has two APIs with different response structures:
+
+### Chat Completions API
+```
+response.usage.prompt_tokens           # Total input tokens (includes cached)
+response.usage.completion_tokens       # Output tokens
+response.usage.total_tokens            # Sum
+response.usage.prompt_tokens_details.cached_tokens        # Subset that was cached
+response.usage.completion_tokens_details.reasoning_tokens # For o1/o3 models
+```
+
+### Responses API (newer)
+```
+response.usage.input_tokens            # Total input tokens
+response.usage.output_tokens           # Output tokens
+response.usage.total_tokens            # Sum
+response.usage.input_tokens_details.cached_tokens         # Subset that was cached
+response.usage.output_tokens_details.reasoning_tokens     # For reasoning models
+```
+
+## Prefix Caching
+
+**Requirements:**
+- Minimum 1,024 tokens in the prefix
+- Automatic (no opt-in required)
+- Cached in 128-token increments
+- TTL: approximately 5-10 minutes of inactivity
+
+**Supported models:** GPT-4o, GPT-4o-mini, o1, o1-mini, o3-mini
+
+**Cache behavior:**
+- `cached_tokens` will be a multiple of 128
+- Cache hit means those tokens were not re-processed
+- Cost: cached tokens are cheaper than non-cached
+
+## Reasoning Models (o1, o3)
+
+For reasoning models, additional tokens are used for "thinking":
+- `reasoning_tokens` in `completion_tokens_details`
+- These are output tokens used for internal reasoning
+- Not visible in the response content
+
+## Streaming
+
+In streaming mode, usage is reported in the **final chunk** when `stream_options.include_usage=True`:
+```python
+request_data["stream_options"] = {"include_usage": True}
+```
+
+The final chunk will have `chunk.usage` with the same structure as non-streaming.
+
+## Letta Implementation
+
+- **Client:** `letta/llm_api/openai_client.py`
+- **Streaming interface:** `letta/interfaces/openai_streaming_interface.py`
+- **Extract method:** `OpenAIClient.extract_usage_statistics()`
+- Uses OpenAI SDK's pydantic models (`ChatCompletion`) for type-safe parsing