* fix(core): prevent event loop saturation from ClickHouse and socket trace writes Two issues were causing the event loop watchdog to fire and liveness probes to fail under load: 1. LLMTraceWriter held an asyncio.Lock across each ClickHouse write, and wait_for_async_insert=1 meant each write held that lock for ~1s. Under high request volume, N background tasks all queued for the lock simultaneously, saturating the event loop with task management overhead. Fix: switch to wait_for_async_insert=0 (ClickHouse async_insert handles server-side batching — no acknowledgment wait needed) and remove the lock (clickhouse_connect uses a thread-safe connection pool). The sync insert still runs in asyncio.to_thread so it never blocks the event loop. No traces are dropped. 2. SocketProviderTraceBackend spawned one OS thread per trace with a 60s socket timeout. During crouton restarts, threads accumulated blocking on sock.sendall for up to 3 minutes each (3 retries x 60s). Fix: reduce socket timeout from 60s to 5s — the socket is local (Unix socket), so 5s is already generous, and fast failure lets retries resolve before threads pile up. Root cause analysis: event_loop_watchdog.py was detecting saturation (lag >2s) every ~60s on gke-letta-default-pool-c6915745-fmq6 via thread dumps. The saturated event loop caused k8s liveness probes to time out, triggering restarts. * chore(core): sync socket backend with main and document ClickHouse thread safety
6.8 KiB
6.8 KiB