* feat(core): add image support in tool returns [LET-7140] Enable tool_return to support both string and ImageContent content parts, matching the pattern used for user message inputs. This allows tools executed client-side to return images back to the agent. Changes: - Add LettaToolReturnContentUnion type for text/image content parts - Update ToolReturn schema to accept Union[str, List[content parts]] - Update converters for each provider: - OpenAI Chat Completions: placeholder text for images - OpenAI Responses API: full image support - Anthropic: full image support with base64 - Google: placeholder text for images - Add resolve_tool_return_images() for URL-to-base64 conversion - Make create_approval_response_message_from_input() async 🐾 Generated with [Letta Code](https://letta.com) Co-Authored-By: Letta <noreply@letta.com> * fix(core): support images in Google tool returns as sibling parts Following the gemini-cli pattern: images in tool returns are sent as sibling inlineData parts alongside the functionResponse, rather than inside it. 🐾 Generated with [Letta Code](https://letta.com) Co-Authored-By: Letta <noreply@letta.com> * test(core): add integration tests for multi-modal tool returns [LET-7140] Tests verify that: - Models with image support (Anthropic, OpenAI Responses API) can see images in tool returns and identify the secret text - Models without image support (Chat Completions) get placeholder text and cannot see the actual image content - Tool returns with images persist correctly in the database Uses secret.png test image containing hidden text "FIREBRAWL" that models must identify to pass the test. Also fixes misleading comment about Anthropic only supporting base64 images - they support URLs too, we just pre-resolve for consistency. 🐾 Generated with [Letta Code](https://letta.com) Co-Authored-By: Letta <noreply@letta.com> * refactor: simplify tool return image support implementation Reduce code verbosity while maintaining all functionality: - Extract _resolve_url_to_base64() helper in message_helper.py (eliminates duplication) - Add _get_text_from_part() helper for text extraction - Add _get_base64_image_data() helper for image data extraction - Add _tool_return_to_google_parts() to simplify Google implementation - Add _image_dict_to_data_url() for OpenAI Responses format - Use walrus operator and list comprehensions where appropriate - Add integration_test_multi_modal_tool_returns.py to CI workflow Net change: -120 lines while preserving all features and test coverage. 👾 Generated with [Letta Code](https://letta.com) Co-Authored-By: Letta <noreply@letta.com> * fix(tests): improve prompt for multi-modal tool return tests Make prompts more direct to reduce LLM flakiness: - Simplify tool description: "Retrieves a secret image with hidden text. Call this function to get the image." - Change user prompt from verbose request to direct command: "Call the get_secret_image function now." - Apply to both test methods This reduces ambiguity and makes tool calling more reliable across different LLM models. 👾 Generated with [Letta Code](https://letta.com) Co-Authored-By: Letta <noreply@letta.com> * fix bugs * test(core): add google_ai/gemini-2.0-flash-exp to multi-modal tests Add Gemini model to test coverage for multi-modal tool returns. Google AI already supports images in tool returns via sibling inlineData parts. 👾 Generated with [Letta Code](https://letta.com) Co-Authored-By: Letta <noreply@letta.com> * fix(ui): handle multi-modal tool_return type in frontend components Convert Union<string, LettaToolReturnContentUnion[]> to string for display: - ViewRunDetails: Convert array to '[Image here]' placeholder - ToolCallMessageComponent: Convert array to '[Image here]' placeholder Fixes TypeScript errors in web, desktop-ui, and docker-ui type-checks. 👾 Generated with [Letta Code](https://letta.com) Co-Authored-By: Letta <noreply@letta.com> --------- Co-authored-by: Letta <noreply@letta.com> Co-authored-by: Caren Thomas <carenthomas@gmail.com>
394 lines
14 KiB
Python
394 lines
14 KiB
Python
from enum import Enum
|
|
from typing import Annotated, List, Literal, Optional, Union
|
|
|
|
from openai.types import Reasoning
|
|
from pydantic import BaseModel, Field
|
|
|
|
|
|
class MessageContentType(str, Enum):
|
|
text = "text"
|
|
image = "image"
|
|
tool_call = "tool_call"
|
|
tool_return = "tool_return"
|
|
# For Anthropic extended thinking
|
|
reasoning = "reasoning"
|
|
redacted_reasoning = "redacted_reasoning"
|
|
# Generic "hidden" (unsavailable) reasoning
|
|
omitted_reasoning = "omitted_reasoning"
|
|
# For OpenAI Responses API
|
|
summarized_reasoning = "summarized_reasoning"
|
|
|
|
|
|
class MessageContent(BaseModel):
|
|
type: MessageContentType = Field(..., description="The type of the message.")
|
|
|
|
def to_text(self) -> Optional[str]:
|
|
"""Extract text representation from this content type.
|
|
|
|
Returns:
|
|
Text representation of the content, None if no text available.
|
|
"""
|
|
return None
|
|
|
|
|
|
# -------------------------------
|
|
# Text Content
|
|
# -------------------------------
|
|
|
|
|
|
class TextContent(MessageContent):
|
|
type: Literal[MessageContentType.text] = Field(default=MessageContentType.text, description="The type of the message.")
|
|
text: str = Field(..., description="The text content of the message.")
|
|
signature: Optional[str] = Field(
|
|
default=None, description="Stores a unique identifier for any reasoning associated with this text content."
|
|
)
|
|
|
|
def to_text(self) -> str:
|
|
"""Return the text content."""
|
|
return self.text
|
|
|
|
|
|
# -------------------------------
|
|
# Image Content
|
|
# -------------------------------
|
|
|
|
|
|
class ImageSourceType(str, Enum):
|
|
url = "url"
|
|
base64 = "base64"
|
|
letta = "letta"
|
|
|
|
|
|
class ImageSource(BaseModel):
|
|
type: ImageSourceType = Field(..., description="The source type for the image.")
|
|
|
|
|
|
class UrlImage(ImageSource):
|
|
type: Literal[ImageSourceType.url] = Field(default=ImageSourceType.url, description="The source type for the image.")
|
|
url: str = Field(..., description="The URL of the image.")
|
|
|
|
|
|
class Base64Image(ImageSource):
|
|
type: Literal[ImageSourceType.base64] = Field(default=ImageSourceType.base64, description="The source type for the image.")
|
|
media_type: str = Field(..., description="The media type for the image.")
|
|
data: str = Field(..., description="The base64 encoded image data.")
|
|
detail: Optional[str] = Field(
|
|
default=None,
|
|
description="What level of detail to use when processing and understanding the image (low, high, or auto to let the model decide)",
|
|
)
|
|
|
|
|
|
class LettaImage(ImageSource):
|
|
type: Literal[ImageSourceType.letta] = Field(default=ImageSourceType.letta, description="The source type for the image.")
|
|
file_id: str = Field(..., description="The unique identifier of the image file persisted in storage.")
|
|
media_type: Optional[str] = Field(default=None, description="The media type for the image.")
|
|
data: Optional[str] = Field(default=None, description="The base64 encoded image data.")
|
|
detail: Optional[str] = Field(
|
|
default=None,
|
|
description="What level of detail to use when processing and understanding the image (low, high, or auto to let the model decide)",
|
|
)
|
|
|
|
|
|
ImageSourceUnion = Annotated[Union[UrlImage, Base64Image, LettaImage], Field(discriminator="type")]
|
|
|
|
|
|
class ImageContent(MessageContent):
|
|
type: Literal[MessageContentType.image] = Field(default=MessageContentType.image, description="The type of the message.")
|
|
source: ImageSourceUnion = Field(..., description="The source of the image.")
|
|
|
|
|
|
# -------------------------------
|
|
# User Content Types
|
|
# -------------------------------
|
|
|
|
|
|
LettaUserMessageContentUnion = Annotated[
|
|
Union[TextContent, ImageContent],
|
|
Field(discriminator="type"),
|
|
]
|
|
|
|
|
|
def create_letta_user_message_content_union_schema():
|
|
return {
|
|
"oneOf": [
|
|
{"$ref": "#/components/schemas/TextContent"},
|
|
{"$ref": "#/components/schemas/ImageContent"},
|
|
],
|
|
"discriminator": {
|
|
"propertyName": "type",
|
|
"mapping": {
|
|
"text": "#/components/schemas/TextContent",
|
|
"image": "#/components/schemas/ImageContent",
|
|
},
|
|
},
|
|
}
|
|
|
|
|
|
def get_letta_user_message_content_union_str_json_schema():
|
|
return {
|
|
"anyOf": [
|
|
{
|
|
"type": "array",
|
|
"items": {
|
|
"$ref": "#/components/schemas/LettaUserMessageContentUnion",
|
|
},
|
|
},
|
|
{"type": "string"},
|
|
],
|
|
}
|
|
|
|
|
|
# -------------------------------
|
|
# Tool Return Content Types
|
|
# -------------------------------
|
|
|
|
|
|
LettaToolReturnContentUnion = Annotated[
|
|
Union[TextContent, ImageContent],
|
|
Field(discriminator="type"),
|
|
]
|
|
|
|
|
|
def create_letta_tool_return_content_union_schema():
|
|
return {
|
|
"oneOf": [
|
|
{"$ref": "#/components/schemas/TextContent"},
|
|
{"$ref": "#/components/schemas/ImageContent"},
|
|
],
|
|
"discriminator": {
|
|
"propertyName": "type",
|
|
"mapping": {
|
|
"text": "#/components/schemas/TextContent",
|
|
"image": "#/components/schemas/ImageContent",
|
|
},
|
|
},
|
|
}
|
|
|
|
|
|
def get_letta_tool_return_content_union_str_json_schema():
|
|
"""Schema that accepts either string or list of content parts for tool returns."""
|
|
return {
|
|
"anyOf": [
|
|
{
|
|
"type": "array",
|
|
"items": {
|
|
"$ref": "#/components/schemas/LettaToolReturnContentUnion",
|
|
},
|
|
},
|
|
{"type": "string"},
|
|
],
|
|
}
|
|
|
|
|
|
# -------------------------------
|
|
# Assistant Content Types
|
|
# -------------------------------
|
|
|
|
|
|
LettaAssistantMessageContentUnion = Annotated[
|
|
Union[TextContent],
|
|
Field(discriminator="type"),
|
|
]
|
|
|
|
|
|
def create_letta_assistant_message_content_union_schema():
|
|
return {
|
|
"oneOf": [
|
|
{"$ref": "#/components/schemas/TextContent"},
|
|
],
|
|
"discriminator": {
|
|
"propertyName": "type",
|
|
"mapping": {
|
|
"text": "#/components/schemas/TextContent",
|
|
},
|
|
},
|
|
}
|
|
|
|
|
|
def get_letta_assistant_message_content_union_str_json_schema():
|
|
return {
|
|
"anyOf": [
|
|
{
|
|
"type": "array",
|
|
"items": {
|
|
"$ref": "#/components/schemas/LettaAssistantMessageContentUnion",
|
|
},
|
|
},
|
|
{"type": "string"},
|
|
],
|
|
}
|
|
|
|
|
|
# -------------------------------
|
|
# Intermediate Step Content Types
|
|
# -------------------------------
|
|
|
|
|
|
class ToolCallContent(MessageContent):
|
|
type: Literal[MessageContentType.tool_call] = Field(
|
|
default=MessageContentType.tool_call, description="Indicates this content represents a tool call event."
|
|
)
|
|
id: str = Field(..., description="A unique identifier for this specific tool call instance.")
|
|
name: str = Field(..., description="The name of the tool being called.")
|
|
input: dict = Field(
|
|
..., description="The parameters being passed to the tool, structured as a dictionary of parameter names to values."
|
|
)
|
|
signature: Optional[str] = Field(
|
|
default=None, description="Stores a unique identifier for any reasoning associated with this tool call."
|
|
)
|
|
|
|
def to_text(self) -> str:
|
|
"""Return a text representation of the tool call."""
|
|
import json
|
|
|
|
input_str = json.dumps(self.input, indent=2)
|
|
return f"Tool call: {self.name}({input_str})"
|
|
|
|
|
|
class ToolReturnContent(MessageContent):
|
|
type: Literal[MessageContentType.tool_return] = Field(
|
|
default=MessageContentType.tool_return, description="Indicates this content represents a tool return event."
|
|
)
|
|
tool_call_id: str = Field(..., description="References the ID of the ToolCallContent that initiated this tool call.")
|
|
content: str = Field(..., description="The content returned by the tool execution.")
|
|
is_error: bool = Field(..., description="Indicates whether the tool execution resulted in an error.")
|
|
|
|
def to_text(self) -> str:
|
|
"""Return the tool return content."""
|
|
prefix = "Tool error: " if self.is_error else "Tool result: "
|
|
return f"{prefix}{self.content}"
|
|
|
|
|
|
class ReasoningContent(MessageContent):
|
|
"""Sent via the Anthropic Messages API"""
|
|
|
|
type: Literal[MessageContentType.reasoning] = Field(
|
|
default=MessageContentType.reasoning, description="Indicates this is a reasoning/intermediate step."
|
|
)
|
|
is_native: bool = Field(..., description="Whether the reasoning content was generated by a reasoner model that processed this step.")
|
|
reasoning: str = Field(..., description="The intermediate reasoning or thought process content.")
|
|
signature: Optional[str] = Field(default=None, description="A unique identifier for this reasoning step.")
|
|
|
|
def to_text(self) -> str:
|
|
"""Return the reasoning content."""
|
|
return self.reasoning
|
|
|
|
|
|
class RedactedReasoningContent(MessageContent):
|
|
"""Sent via the Anthropic Messages API"""
|
|
|
|
type: Literal[MessageContentType.redacted_reasoning] = Field(
|
|
default=MessageContentType.redacted_reasoning, description="Indicates this is a redacted thinking step."
|
|
)
|
|
data: str = Field(..., description="The redacted or filtered intermediate reasoning content.")
|
|
|
|
|
|
class OmittedReasoningContent(MessageContent):
|
|
"""A placeholder for reasoning content we know is present, but isn't returned by the provider (e.g. OpenAI GPT-5 on ChatCompletions)"""
|
|
|
|
type: Literal[MessageContentType.omitted_reasoning] = Field(
|
|
default=MessageContentType.omitted_reasoning, description="Indicates this is an omitted reasoning step."
|
|
)
|
|
signature: Optional[str] = Field(default=None, description="A unique identifier for this reasoning step.")
|
|
# NOTE: dropping because we don't track this kind of information for the other reasoning types
|
|
# tokens: int = Field(..., description="The reasoning token count for intermediate reasoning content.")
|
|
|
|
|
|
class SummarizedReasoningContentPart(BaseModel):
|
|
index: int = Field(..., description="The index of the summary part.")
|
|
text: str = Field(..., description="The text of the summary part.")
|
|
|
|
|
|
class SummarizedReasoningContent(MessageContent):
|
|
"""The style of reasoning content returned by the OpenAI Responses API"""
|
|
|
|
# TODO consider expanding ReasoningContent to support this superset?
|
|
# Or alternatively, rename `ReasoningContent` to `AnthropicReasoningContent`,
|
|
# and rename this one to `OpenAIReasoningContent`?
|
|
|
|
# NOTE: I think the argument for putting thie in ReasoningContent as an additional "summary" field is that it keeps the
|
|
# rendering and GET / listing code a lot simpler, you just need to know how to render "TextContent" and "ReasoningContent"
|
|
# vs breaking out into having to know how to render additional types
|
|
# NOTE: I think the main issue is that we need to track provenance of which provider the reasoning came from
|
|
# so that we don't attempt eg to put Anthropic encrypted reasoning into a GPT-5 responses payload
|
|
type: Literal[MessageContentType.summarized_reasoning] = Field(
|
|
default=MessageContentType.summarized_reasoning, description="Indicates this is a summarized reasoning step."
|
|
)
|
|
|
|
# OpenAI requires holding a string
|
|
id: str = Field(..., description="The unique identifier for this reasoning step.") # NOTE: I don't think this is actually needed?
|
|
# OpenAI returns a list of summary objects, each a string
|
|
# Straying a bit from the OpenAI schema so that we can enforce ordering on the deltas that come out
|
|
# summary: List[str] = Field(..., description="Summaries of the reasoning content.")
|
|
summary: List[SummarizedReasoningContentPart] = Field(..., description="Summaries of the reasoning content.")
|
|
encrypted_content: str = Field(default=None, description="The encrypted reasoning content.")
|
|
|
|
# Temporary stop-gap until the SDKs are updated
|
|
def to_reasoning_content(self) -> Optional[ReasoningContent]:
|
|
# Merge the summary parts with a '\n' join
|
|
parts = [s.text for s in self.summary if s.text != ""]
|
|
if not parts or len(parts) == 0:
|
|
return None
|
|
else:
|
|
combined_summary = "\n\n".join(parts)
|
|
return ReasoningContent(
|
|
is_native=True,
|
|
reasoning=combined_summary,
|
|
signature=self.encrypted_content,
|
|
)
|
|
|
|
|
|
LettaMessageContentUnion = Annotated[
|
|
Union[
|
|
TextContent,
|
|
ImageContent,
|
|
ToolCallContent,
|
|
ToolReturnContent,
|
|
ReasoningContent,
|
|
RedactedReasoningContent,
|
|
OmittedReasoningContent,
|
|
SummarizedReasoningContent,
|
|
],
|
|
Field(discriminator="type"),
|
|
]
|
|
|
|
|
|
def create_letta_message_content_union_schema():
|
|
return {
|
|
"oneOf": [
|
|
{"$ref": "#/components/schemas/TextContent"},
|
|
{"$ref": "#/components/schemas/ImageContent"},
|
|
{"$ref": "#/components/schemas/ToolCallContent"},
|
|
{"$ref": "#/components/schemas/ToolReturnContent"},
|
|
{"$ref": "#/components/schemas/ReasoningContent"},
|
|
{"$ref": "#/components/schemas/RedactedReasoningContent"},
|
|
{"$ref": "#/components/schemas/OmittedReasoningContent"},
|
|
],
|
|
"discriminator": {
|
|
"propertyName": "type",
|
|
"mapping": {
|
|
"text": "#/components/schemas/TextContent",
|
|
"image": "#/components/schemas/ImageContent",
|
|
"tool_call": "#/components/schemas/ToolCallContent",
|
|
"tool_return": "#/components/schemas/ToolCallContent",
|
|
"reasoning": "#/components/schemas/ReasoningContent",
|
|
"redacted_reasoning": "#/components/schemas/RedactedReasoningContent",
|
|
"omitted_reasoning": "#/components/schemas/OmittedReasoningContent",
|
|
},
|
|
},
|
|
}
|
|
|
|
|
|
def get_letta_message_content_union_str_json_schema():
|
|
return {
|
|
"anyOf": [
|
|
{
|
|
"type": "array",
|
|
"items": {
|
|
"$ref": "#/components/schemas/LettaMessageContentUnion",
|
|
},
|
|
},
|
|
{"type": "string"},
|
|
],
|
|
}
|