👾 Generated with [Letta Code](https://letta.com) Co-Authored-By: Letta <noreply@letta.com> --------- Co-authored-by: Letta <noreply@letta.com>
2.4 KiB
2.4 KiB
Gemini Usage Statistics
Response Format
Gemini returns usage in usage_metadata:
response.usage_metadata.prompt_token_count # Total input tokens
response.usage_metadata.candidates_token_count # Output tokens
response.usage_metadata.total_token_count # Sum
response.usage_metadata.cached_content_token_count # Tokens from cache (optional)
response.usage_metadata.thoughts_token_count # Reasoning tokens (optional)
Token Counting
prompt_token_countis the TOTAL (includes cached)cached_content_token_countis a subset (when present)- Similar to OpenAI's semantics
Implicit Caching (Gemini 2.0+)
Requirements:
- Minimum 1,024 tokens
- Automatic (no opt-in required)
- Available on Gemini 2.0 Flash and later models
Behavior:
- Caching is probabilistic and server-side
cached_content_token_countmay or may not be present- When present, indicates tokens that were served from cache
Note: Unlike Anthropic, Gemini doesn't have explicit cache_control. Caching is implicit and managed by Google's infrastructure.
Reasoning/Thinking Tokens
For models with extended thinking (like Gemini 2.0 with thinking enabled):
thoughts_token_countreports tokens used for reasoning- These are similar to OpenAI's
reasoning_tokens
Enabling thinking:
generation_config = {
"thinking_config": {
"thinking_budget": 1024 # Max thinking tokens
}
}
Streaming
In streaming mode:
usage_metadatais typically in the final chunk- Same fields as non-streaming
- May not be present in intermediate chunks
Important: stream_async() returns an async generator (not awaitable):
# Correct:
stream = client.stream_async(request_data, llm_config)
async for chunk in stream:
...
# Incorrect (will error):
stream = await client.stream_async(...) # TypeError!
APIs
Gemini has two APIs:
- Google AI (google_ai): Uses
google.genaiSDK - Vertex AI (google_vertex): Uses same SDK with different auth
Both share the same response format.
Letta Implementation
- Client:
letta/llm_api/google_vertex_client.py(handles both google_ai and google_vertex) - Streaming interface:
letta/interfaces/gemini_streaming_interface.py - Extract method:
GoogleVertexClient.extract_usage_statistics() - Response is a
GenerateContentResponseobject with.usage_metadataattribute