fix(core): handle ResponseIncompleteEvent in OpenAI Responses API streaming (#9535)
* fix(core): handle ResponseIncompleteEvent in OpenAI Responses API streaming
When reasoning models (gpt-5.x) exhaust their max_output_tokens budget
on chain-of-thought reasoning, OpenAI emits a ResponseIncompleteEvent
instead of ResponseCompletedEvent. This was previously unhandled, causing
final_response to remain None — which meant get_content() and
get_tool_call_objects() returned empty results, silently dropping the
partial response.
Now ResponseIncompleteEvent is handled identically to
ResponseCompletedEvent (extracting partial content, usage stats, and
token details), with an additional warning log indicating the incomplete
reason.
* fix(core): propagate finish_reason for Responses API incomplete events
- Guard usage extraction against None usage payload in
ResponseIncompleteEvent handler
- Add _finish_reason override to LettaLLMAdapter so streaming adapters
can explicitly set finish_reason without a chat_completions_response
- Map incomplete_details.reason="max_output_tokens" to
finish_reason="length" in SimpleLLMStreamAdapter, matching the Chat
Completions API convention
- This allows the agent loop's _decide_continuation to correctly return
stop_reason="max_tokens_exceeded" instead of "end_turn" when the model
exhausts its output token budget on reasoning
* fix(core): handle empty content parts in incomplete ResponseOutputMessage
When a model hits max_output_tokens after starting a ResponseOutputMessage
but before producing any content parts, the message has content=[]. This
previously raised ValueError("Got 0 content parts, expected 1"). Now it
logs a warning and skips the empty message, allowing reasoning-only
incomplete responses to be processed cleanly.
* fix(core): map all incomplete reasons to finish_reason, not just max_output_tokens
Handle content_filter and any future unknown incomplete reasons from the
Responses API instead of silently leaving finish_reason as None.
This commit is contained in:
@@ -55,6 +55,7 @@ class LettaLLMAdapter(ABC):
|
||||
self.usage: LettaUsageStatistics = LettaUsageStatistics()
|
||||
self.telemetry_manager: TelemetryManager = TelemetryManager()
|
||||
self.llm_request_finish_timestamp_ns: int | None = None
|
||||
self._finish_reason: str | None = None
|
||||
|
||||
@abstractmethod
|
||||
async def invoke_llm(
|
||||
@@ -92,6 +93,8 @@ class LettaLLMAdapter(ABC):
|
||||
Returns:
|
||||
str | None: The finish_reason if available, None otherwise
|
||||
"""
|
||||
if self._finish_reason is not None:
|
||||
return self._finish_reason
|
||||
if self.chat_completions_response and self.chat_completions_response.choices:
|
||||
return self.chat_completions_response.choices[0].finish_reason
|
||||
return None
|
||||
|
||||
@@ -198,6 +198,22 @@ class SimpleLLMStreamAdapter(LettaLLMStreamAdapter):
|
||||
# Store any additional data from the interface
|
||||
self.message_id = self.interface.letta_message_id
|
||||
|
||||
# Populate finish_reason for downstream continuation logic.
|
||||
# In Responses streaming, max_output_tokens is expressed via incomplete_details.reason.
|
||||
if hasattr(self.interface, "final_response") and self.interface.final_response is not None:
|
||||
resp = self.interface.final_response
|
||||
incomplete_details = getattr(resp, "incomplete_details", None)
|
||||
incomplete_reason = getattr(incomplete_details, "reason", None) if incomplete_details else None
|
||||
if incomplete_reason == "max_output_tokens":
|
||||
self._finish_reason = "length"
|
||||
elif incomplete_reason == "content_filter":
|
||||
self._finish_reason = "content_filter"
|
||||
elif incomplete_reason is not None:
|
||||
# Unknown incomplete reason — preserve it as-is for diagnostics
|
||||
self._finish_reason = incomplete_reason
|
||||
elif getattr(resp, "status", None) == "completed":
|
||||
self._finish_reason = "stop"
|
||||
|
||||
# Log request and response data
|
||||
self.log_provider_trace(step_id=step_id, actor=actor)
|
||||
|
||||
|
||||
Reference in New Issue
Block a user