Files
letta-server/letta/services/clickhouse_otel_traces.py
Kian Jones 9418ab9815 feat: add provider trace backend abstraction for multi-backend telemetry (#8814)
* feat: add provider trace backend abstraction for multi-backend telemetry

Introduces a pluggable backend system for provider traces:
- Base class with async/sync create and read interfaces
- PostgreSQL backend (existing behavior)
- ClickHouse backend (via OTEL instrumentation)
- Socket backend (writes to Unix socket for crouton sidecar)
- Factory for instantiating backends from config

Refactors TelemetryManager to use backends with support for:
- Multi-backend writes (concurrent via asyncio.gather)
- Primary backend for reads (first in config list)
- Graceful error handling per backend

Config: LETTA_TELEMETRY_PROVIDER_TRACE_BACKEND (comma-separated)
Example: "postgres,socket" for dual-write to Postgres and crouton

🐙 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* feat: add protocol version to socket backend records

Adds PROTOCOL_VERSION constant to socket backend:
- Included in every telemetry record sent to crouton
- Must match ProtocolVersion in apps/crouton/main.go
- Enables crouton to detect and reject incompatible messages

🐙 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* fix: remove organization_id from ProviderTraceCreate calls

The organization_id is now handled via the actor parameter in the
telemetry manager, not through ProviderTraceCreate schema. This fixes
validation errors after changing ProviderTraceCreate to inherit from
BaseProviderTrace which forbids extra fields.

🐙 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* consolidate provider trace

* add clickhouse-connect to fix bug on main lmao

* auto generated sdk changes, and deployment details, and clikchouse prefix bug and added fields to runs trace return api

* auto generated sdk changes, and deployment details, and clikchouse prefix bug and added fields to runs trace return api

* consolidate provider trace

* consolidate provider trace bug fix

---------

Co-authored-by: Letta <noreply@letta.com>
2026-01-19 15:54:43 -08:00

99 lines
3.2 KiB
Python

import asyncio
from typing import Any
from urllib.parse import urlparse
from letta.helpers.singleton import singleton
from letta.settings import settings
def _parse_clickhouse_endpoint(endpoint: str) -> tuple[str, int, bool]:
parsed = urlparse(endpoint)
if parsed.scheme in ("http", "https"):
host = parsed.hostname or ""
port = parsed.port or (8443 if parsed.scheme == "https" else 8123)
secure = parsed.scheme == "https"
return host, port, secure
# Fallback: accept raw hostname (possibly with :port)
if ":" in endpoint:
host, port_str = endpoint.rsplit(":", 1)
return host, int(port_str), True
return endpoint, 8443, True
@singleton
class ClickhouseOtelTracesReader:
def __init__(self):
pass
def _get_client(self):
import clickhouse_connect
if not settings.clickhouse_endpoint:
raise ValueError("CLICKHOUSE_ENDPOINT is required")
host, port, secure = _parse_clickhouse_endpoint(settings.clickhouse_endpoint)
if not host:
raise ValueError("Invalid CLICKHOUSE_ENDPOINT")
database = settings.clickhouse_database or "otel"
username = settings.clickhouse_username or "default"
password = settings.clickhouse_password
if not password:
raise ValueError("CLICKHOUSE_PASSWORD is required")
return clickhouse_connect.get_client(
host=host,
port=port,
username=username,
password=password,
database=database,
secure=secure,
verify=True,
)
def _get_traces_by_trace_id_sync(self, trace_id: str, limit: int, filter_ui_spans: bool = False) -> list[dict[str, Any]]:
client = self._get_client()
if filter_ui_spans:
# Only return spans used by the trace viewer UI:
# - agent_step: step events
# - *._execute_tool: tool execution details
# - root spans (no parent): request info
# - time_to_first_token: TTFT measurement
query = """
SELECT *
FROM otel_traces
WHERE TraceId = %(trace_id)s
AND (
SpanName = 'agent_step'
OR SpanName LIKE '%%._execute_tool'
OR ParentSpanId = ''
OR SpanName = 'time_to_first_token'
)
ORDER BY Timestamp ASC
LIMIT %(limit)s
"""
else:
query = """
SELECT *
FROM otel_traces
WHERE TraceId = %(trace_id)s
ORDER BY Timestamp ASC
LIMIT %(limit)s
"""
result = client.query(query, parameters={"trace_id": trace_id, "limit": limit})
if not result or not result.result_rows:
return []
cols = list(result.column_names)
return [dict(zip(cols, row)) for row in result.result_rows]
async def get_traces_by_trace_id_async(
self, *, trace_id: str, limit: int = 1000, filter_ui_spans: bool = False
) -> list[dict[str, Any]]:
return await asyncio.to_thread(self._get_traces_by_trace_id_sync, trace_id, limit, filter_ui_spans)