Files
letta-server/letta/services/telemetry_manager.py
Kian Jones 9418ab9815 feat: add provider trace backend abstraction for multi-backend telemetry (#8814)
* feat: add provider trace backend abstraction for multi-backend telemetry

Introduces a pluggable backend system for provider traces:
- Base class with async/sync create and read interfaces
- PostgreSQL backend (existing behavior)
- ClickHouse backend (via OTEL instrumentation)
- Socket backend (writes to Unix socket for crouton sidecar)
- Factory for instantiating backends from config

Refactors TelemetryManager to use backends with support for:
- Multi-backend writes (concurrent via asyncio.gather)
- Primary backend for reads (first in config list)
- Graceful error handling per backend

Config: LETTA_TELEMETRY_PROVIDER_TRACE_BACKEND (comma-separated)
Example: "postgres,socket" for dual-write to Postgres and crouton

🐙 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* feat: add protocol version to socket backend records

Adds PROTOCOL_VERSION constant to socket backend:
- Included in every telemetry record sent to crouton
- Must match ProtocolVersion in apps/crouton/main.go
- Enables crouton to detect and reject incompatible messages

🐙 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* fix: remove organization_id from ProviderTraceCreate calls

The organization_id is now handled via the actor parameter in the
telemetry manager, not through ProviderTraceCreate schema. This fixes
validation errors after changing ProviderTraceCreate to inherit from
BaseProviderTrace which forbids extra fields.

🐙 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* consolidate provider trace

* add clickhouse-connect to fix bug on main lmao

* auto generated sdk changes, and deployment details, and clikchouse prefix bug and added fields to runs trace return api

* auto generated sdk changes, and deployment details, and clikchouse prefix bug and added fields to runs trace return api

* consolidate provider trace

* consolidate provider trace bug fix

---------

Co-authored-by: Letta <noreply@letta.com>
2026-01-19 15:54:43 -08:00

118 lines
3.9 KiB
Python

import asyncio
from letta.helpers.singleton import singleton
from letta.log import get_logger
from letta.otel.tracing import trace_method
from letta.schemas.provider_trace import ProviderTrace
from letta.schemas.user import User as PydanticUser
from letta.services.provider_trace_backends import get_provider_trace_backend, get_provider_trace_backends
from letta.utils import enforce_types
logger = get_logger(__name__)
class TelemetryManager:
"""
Manages provider trace telemetry using configurable backends.
Supports multiple backends for dual-write scenarios (e.g., migration).
Configure via LETTA_TELEMETRY_PROVIDER_TRACE_BACKEND (comma-separated):
- postgres: Store in PostgreSQL (default)
- clickhouse: Store in ClickHouse via OTEL instrumentation
- socket: Store via Unix socket to Crouton sidecar (which writes to GCS)
Example: LETTA_TELEMETRY_PROVIDER_TRACE_BACKEND=postgres,socket
Multi-backend behavior:
- Writes: Sent to ALL configured backends concurrently via asyncio.gather.
Errors in one backend don't affect others (logged but not raised).
- Reads: Only from PRIMARY backend (first in the comma-separated list).
Secondary backends are write-only for this manager.
"""
def __init__(self):
self._backends = get_provider_trace_backends()
self._primary_backend = self._backends[0] if self._backends else get_provider_trace_backend()
@enforce_types
@trace_method
async def get_provider_trace_by_step_id_async(
self,
step_id: str,
actor: PydanticUser,
) -> ProviderTrace | None:
# Read from primary backend only
return await self._primary_backend.get_by_step_id_async(step_id=step_id, actor=actor)
@enforce_types
@trace_method
async def create_provider_trace_async(
self,
actor: PydanticUser,
provider_trace: ProviderTrace,
) -> ProviderTrace:
# Write to all backends concurrently
tasks = [self._safe_create_async(backend, actor, provider_trace) for backend in self._backends]
results = await asyncio.gather(*tasks)
# Return first non-None result (from primary backend)
return next((r for r in results if r is not None), None)
async def _safe_create_async(
self,
backend,
actor: PydanticUser,
provider_trace: ProviderTrace,
) -> ProviderTrace | None:
"""Create trace in a backend, catching and logging errors."""
try:
return await backend.create_async(actor=actor, provider_trace=provider_trace)
except Exception as e:
logger.warning(f"Failed to write to {backend.__class__.__name__}: {e}")
return None
def create_provider_trace(
self,
actor: PydanticUser,
provider_trace: ProviderTrace,
) -> ProviderTrace | None:
"""Synchronous version - writes to all backends."""
result = None
for backend in self._backends:
try:
r = backend.create_sync(actor=actor, provider_trace=provider_trace)
if result is None:
result = r
except Exception as e:
logger.warning(f"Failed to write to {backend.__class__.__name__}: {e}")
return result
@singleton
class NoopTelemetryManager(TelemetryManager):
"""Noop implementation of TelemetryManager."""
def __init__(self):
pass # Don't initialize backend
async def create_provider_trace_async(
self,
actor: PydanticUser,
provider_trace: ProviderTrace,
) -> ProviderTrace:
return None
async def get_provider_trace_by_step_id_async(
self,
step_id: str,
actor: PydanticUser,
) -> ProviderTrace | None:
return None
def create_provider_trace(
self,
actor: PydanticUser,
provider_trace: ProviderTrace,
) -> ProviderTrace:
return None