letta-server

Author	SHA1	Message	Date
cthomas	22b9ed254a	feat: skip persisting redundant messages for proxy (#6819 )	2025-12-15 12:03:09 -08:00
Sarah Wooders	0634aa13a1	fix: avoid holding sessions open (#6769 )	2025-12-15 12:03:09 -08:00
Sarah Wooders	c9ad2fd7c4	chore: move things to debug logging (#6610 )	2025-12-15 12:03:09 -08:00
Ari Webb	fecf503ad9	feat: xhigh reasoning for gpt-5.2 (#6735 )	2025-12-15 12:03:09 -08:00
cthomas	bffb9064b8	fix: step logging error (#6755 )	2025-12-15 12:03:08 -08:00
cthomas	fd8e471b2e	chore: improve logging for proxy (#6754 )	2025-12-15 12:03:08 -08:00
cthomas	2dac75a223	fix: remove project id before proxying (#6750 )	2025-12-15 12:03:08 -08:00
jnjpng	4be813b956	fix: migrate sandbox and agent environment variables to encrypted only (#6623 ) * base * remove unnnecessary db migration * update * fix * update * update * comments * fix * revert * anotha --------- Co-authored-by: Letta Bot <noreply@letta.com>	2025-12-15 12:03:08 -08:00
cthomas	799ddc9fe8	chore: api sync (#6747 )	2025-12-15 12:03:07 -08:00
cthomas	b3561631da	feat: create agents with default project for proxy [LET-6488] (#6716 ) * feat: create agents with default project for proxy * make change less invasive	2025-12-15 12:02:53 -08:00
Kian Jones	0a19c4010d	chore: bump from 14.1 to 15.2 for compaction settingsa (#6727 ) bump from 14.1 to 15.2 for compaction settingsa	2025-12-15 12:02:51 -08:00
jnjpng	714c537dc5	chore: change e2b sandbox error logs from debug to warning (#6726 ) Update log level for tool execution errors in e2b sandbox from debug to warning for better visibility when troubleshooting issues. Co-authored-by: Jin Peng <jinjpeng@users.noreply.github.com>	2025-12-15 12:02:34 -08:00
Sarah Wooders	7ea297231a	feat: add `compaction_settings` to agents (#6625 ) * initial commit * Add database migration for compaction_settings field This migration adds the compaction_settings column to the agents table to support customized summarization configuration for each agent. 🐾 Generated with [Letta Code](https://letta.com) Co-Authored-By: Letta <noreply@letta.com> * fix * rename * update apis * fix tests * update web test --------- Co-authored-by: Letta <noreply@letta.com> Co-authored-by: Kian Jones <kian@letta.com>	2025-12-15 12:02:34 -08:00
Shubham Naik	4309ecf606	chore: list shceudled messages [LET-6497] (#6690 ) * chore: list shceudled messages * chore: list shceudled messages * chore: fix type * chore: fix * chore: fix --------- Co-authored-by: Shubham Naik <shub@memgpt.ai>	2025-12-15 12:02:34 -08:00
cthomas	1314e19286	feat: update system message for proxy [LET-6490] (#6714 ) feat: update system message for proxy	2025-12-15 12:02:34 -08:00
Ari Webb	4d90f37f50	feat: add gpt-5.2 support (#6698 )	2025-12-15 12:02:34 -08:00
jnjpng	b658c70063	test: add coverage for provider encryption without LETTA_ENCRYPTION_KEY (#6629 ) Add tests to verify that providers work correctly when no encryption key is configured. The Secret class stores values as plaintext in _enc columns and retrieves them successfully, but this code path had no test coverage. Co-authored-by: Letta Bot <noreply@letta.com>	2025-12-15 12:02:34 -08:00
Ari Webb	25dccc911e	fix: base providers won't break pods still running main (#6631 ) * fix: base providers won't break pods still running main * just stage and publish api	2025-12-15 12:02:34 -08:00
Shubham Naik	67d1c9c135	chore: autogenerate-api (#6699 ) Co-authored-by: Shubham Naik <shub@memgpt.ai>	2025-12-15 12:02:34 -08:00
Sarah Wooders	a2dfa5af17	fix: reorder summarization (#6606 )	2025-12-15 12:02:34 -08:00
jnjpng	17a90538ca	fix: exclude common API key prefixes from encryption detection (#6624 ) * fix: exclude common API key prefixes from encryption detection Add a list of known API key prefixes (OpenAI, Anthropic, GitHub, AWS, Slack, etc.) to prevent is_encrypted() from incorrectly identifying plaintext credentials as encrypted values. * update * test	2025-12-15 12:02:34 -08:00
Kian Jones	15cede7281	fix: prevent db connection pool exhaustion in multi-agent tool executor (#6619 ) Problem: When executing a tool that sends messages to many agents matching tags, the code used asyncio.gather to process all agents concurrently. Each agent processing creates database operations (run creation, message storage), leading to N concurrent database connections. Example: If 100 agents match the tags, 100 simultaneous database connections are created, exhausting the connection pool and causing errors. Root cause: asyncio.gather(*[_process_agent(...) for agent in agents]) creates all coroutines and runs them concurrently, each opening a DB session. Solution: Process agents sequentially instead of concurrently. While this is slower, it prevents database connection pool exhaustion. The operation is still async, so it won't block the event loop. Changes: - apps/core/letta/services/tool_executor/multi_agent_tool_executor.py: - Replaced asyncio.gather with sequential for loop - Added explanatory comment about why sequential processing is needed Impact: With 100 matching agents: - Before: 100 concurrent DB connections (pool exhaustion) - After: 1 DB connection at a time (no pool exhaustion) Note: This follows the same pattern as PR #6617 which fixed a similar issue in file attachment operations.	2025-12-15 12:02:34 -08:00
Kian Jones	fbd89c9360	fix: replace all 'PRODUCTION' references with 'prod' for consistency (#6627 ) * fix: replace all 'PRODUCTION' references with 'prod' for consistency Problem: Codebase had 11 references to 'PRODUCTION' (uppercase) that should use 'prod' (lowercase) for consistency with the deployment workflows and environment normalization. Changes across 8 files: 1. Source files (using settings.environment): - letta/functions/function_sets/multi_agent.py - letta/services/tool_manager.py - letta/services/tool_executor/multi_agent_tool_executor.py - letta/services/helpers/agent_manager_helper.py All checks changed from: settings.environment == "PRODUCTION" To: settings.environment == "prod" 2. OTEL resource configuration: - letta/otel/resource.py - Updated _normalize_environment_tag() to handle 'prod' directly - Removed 'PRODUCTION' -> 'prod' mapping (no longer needed) - Updated device.id check from _env != "PRODUCTION" to _env != "prod" 3. Test files: - tests/managers/conftest.py - Fixture parameter changed from "PRODUCTION" to "prod" - tests/managers/test_agent_manager.py (3 occurrences) - tests/managers/test_tool_manager.py (2 occurrences) All test checks changed to use "prod" Result: Complete consistency across the codebase: - All environment checks use "prod" instead of "PRODUCTION" - Normalization function simplified (no special case for PRODUCTION) - Tests use correct "prod" value - Matches deployment workflow configuration from PR #6626 This completes the environment naming standardization effort. * fix: update settings.py environment description to use 'prod' instead of 'PRODUCTION' The field description still referenced PRODUCTION as an example value. Updated to use lowercase 'prod' for consistency with actual usage. Before: "Application environment (PRODUCTION, DEV, CANARY, etc. - normalized to lowercase for OTEL tags)" After: "Application environment (prod, dev, canary, etc. - lowercase values used for OTEL tags)"	2025-12-15 12:02:34 -08:00
Kian Jones	08ccc8b399	fix: prevent db connection pool exhaustion in file status checks (#6620 ) Problem: When listing files with status checking enabled, the code used asyncio.gather to check and update status for all files concurrently. Each status check may update the file in the database (e.g., for timeouts or embedding completion), leading to N concurrent database connections. Example: Listing 100 files with status checking creates 100 simultaneous database update operations, exhausting the connection pool. Root cause: asyncio.gather(*[check_and_update_file_status(f) for f in files]) processes all files concurrently, each potentially creating DB updates. Solution: Check and update file status sequentially instead of concurrently. While this is slower, it prevents database connection pool exhaustion when listing many files. Changes: - apps/core/letta/services/file_manager.py: - Replaced asyncio.gather with sequential for loop - Added explanatory comment about db pool exhaustion prevention Impact: With 100 files: - Before: Up to 100 concurrent DB connections (pool exhaustion) - After: 1 DB connection at a time (no pool exhaustion) Note: This follows the same pattern as PR #6617 and #6619 which fixed similar issues in file attachment and multi-agent tool execution.	2025-12-15 12:02:34 -08:00
Kian Jones	1a2e0aa8b7	fix: prevent db connection pool exhaustion in MCP server manager (#6622 ) Problem: When creating an MCP server with many tools, the code used two asyncio.gather calls - one for tool creation and one for mapping creation. Each operation involves database INSERT/UPDATE, leading to 2N concurrent database connections. Example: An MCP server with 50 tools creates 50 + 50 = 100 simultaneous database connections (tools + mappings), severely exhausting the pool. Root cause: 1. asyncio.gather([create_mcp_tool_async(...) for tool in tools]) 2. asyncio.gather([create_mcp_tool_mapping(...) for tool in results]) Both process operations concurrently, each opening a DB session. Solution: Process tool creation and mapping sequentially in a single loop. Create each tool, then immediately create its mapping if successful. This: - Reduces connection count from 2N to 1 - Maintains proper error handling per tool - Prevents database connection pool exhaustion Changes: - apps/core/letta/services/mcp_server_manager.py: - Replaced two asyncio.gather calls with single sequential loop - Create mapping immediately after each successful tool creation - Maintained return_exceptions=True behavior with try/except - Added explanatory comment about db pool exhaustion prevention Impact: With 50 MCP tools: - Before: 100 concurrent DB connections (50 tools + 50 mappings, pool exhaustion) - After: 1 DB connection at a time (no pool exhaustion) Note: This follows the same pattern as PR #6617, #6619, #6620, and #6621 which fixed similar issues throughout the codebase.	2025-12-15 12:02:34 -08:00
Kian Jones	43aa97b7d2	fix: prevent db connection pool exhaustion in MCP tool creation (#6621 ) Problem: When creating an MCP server with many tools, the code used asyncio.gather to create all tools concurrently. Each tool creation involves database operations (INSERT with upsert logic), leading to N concurrent database connections. Example: An MCP server with 50 tools creates 50 simultaneous database connections during server creation, exhausting the connection pool. Root cause: asyncio.gather(*[create_mcp_tool_async(...) for tool in tools]) processes all tool creations concurrently, each opening a DB session. Solution: Create tools sequentially instead of concurrently. While this takes longer for server creation, it prevents database connection pool exhaustion and maintains error handling by catching exceptions per tool. Changes: - apps/core/letta/services/mcp_manager.py: - Replaced asyncio.gather with sequential for loop - Maintained return_exceptions=True behavior with try/except - Added explanatory comment about db pool exhaustion prevention Impact: With 50 MCP tools: - Before: 50 concurrent DB connections (pool exhaustion) - After: 1 DB connection at a time (no pool exhaustion) Note: This follows the same pattern as PR #6617, #6619, and #6620 which fixed similar issues in file operations, multi-agent execution, and file status checks.	2025-12-15 12:02:34 -08:00
cthomas	0d77b373e6	fix: remove concurrent db writes for file upload (#6617 )	2025-12-15 12:02:34 -08:00
jnjpng	3221ed8a14	fix: update base provider to only handle _enc fields (#6591 ) * base * update * another pass * fix * generate * fix test * don't set on create * last fixes --------- Co-authored-by: Letta Bot <noreply@letta.com>	2025-12-15 12:02:34 -08:00
Shubham Naik	99126c6283	feat: add delete scheudle message handler [LET-6496] (#6589 ) * feat: add delete scheudle message handler * chore: scheduled messages * chore: scheduled messages * chore: upodate sources --------- Co-authored-by: Shubham Naik <shub@memgpt.ai>	2025-12-15 12:02:34 -08:00
Sarah Wooders	c8fa77a01f	feat: cleanup cancellation code and add more logging (#6588 )	2025-12-15 12:02:34 -08:00
Sarah Wooders	70c57c5072	fix: various patches to summarizer (#6597 )	2025-12-15 12:02:34 -08:00
Charles Packer	1c30ad6991	fix(core): patch anthropic context caching busting (#6516 ) Co-authored-by: Sarah Wooders <sarahwooders@gmail.com>	2025-12-15 12:02:34 -08:00
Sarah Wooders	8440e319e2	Revert "feat: enable provider models persistence" (#6590 ) Revert "feat: enable provider models persistence (#6193)" This reverts commit 9682aff32640a6ee8cf71a6f18c9fa7cda25c40e.	2025-12-15 12:02:34 -08:00
Sarah Wooders	bbd52e291c	feat: refactor summarization and message persistence code [LET-6464] (#6561 )	2025-12-15 12:02:34 -08:00
Sarah Wooders	b23722e4a1	fix: also cleanup on asyncio cancel (#6586 )	2025-12-15 12:02:34 -08:00
Shubham Naik	b2cae07556	Shub/let 6495 create base code for scheduling [LET-6495] (#6581 ) * feat: create base code for scheduling * feat: create base code for scheduling * feat: create base code for scheduling * feat: create base code for scheduling * chore: redeploy * chore: redeploy * chore: userid --------- Co-authored-by: Shubham Naik <shub@memgpt.ai>	2025-12-15 12:02:34 -08:00
Sarah Wooders	821549817d	chore: add error logging if run updates are invalid (#6582 )	2025-12-15 12:02:34 -08:00
Sarah Wooders	fca5774795	feat: store run errors on streaming (#6573 )	2025-12-15 12:02:34 -08:00
Ari Webb	848a73125c	feat: enable provider models persistence (#6193 ) * Revert "fix test" This reverts commit 5126815f23cefb4edad3e3bf9e7083209dcc7bf1. * fix server and better test * test fix, get api key for base and byok? * set letta default endpoint * try to fix timeout for test * fix for letta api key * Delete apps/core/tests/sdk_v1/conftest.py * Update utils.py * clean up a few issues * fix filterning on list_llm_models * soft delete models with provider * add one more test * fix ci * add timeout * band aid for letta embedding provider * info instead of error logs when creating models	2025-12-15 12:02:34 -08:00
Ari Webb	b4af037c19	feat: default preserve filesystem for migration [LET-6366] (#6475 ) * feat: default preserve filesystem for migration * add button on frontend * stage and publish api and add to templates test * fix test * stage and publish api * agents inherit folders from templates * sync sources on template update * don't preserve sources on af upload * fix test	2025-12-15 12:02:34 -08:00
Kian Jones	3422508d42	feat: add OpenTelemetry distributed tracing to clouid-api and web (#6549 ) * feat: add OpenTelemetry distributed tracing to letta-web Enables end-to-end distributed tracing from letta-web through memgpt-server using OpenTelemetry. Traces are exported via OTLP to Datadog APM for monitoring request latency across services. Key changes: - Install OTEL packages: @opentelemetry/sdk-node, auto-instrumentations-node - Create apps/web/src/lib/tracing.ts with full OTEL configuration - Initialize tracing in instrumentation.ts (before any other imports) - Add OTEL packages to next.config.js serverExternalPackages - Add OTEL environment variables to deployment configs: - OTEL_EXPORTER_OTLP_ENDPOINT (e.g., http://datadog-agent:4317) - OTEL_SERVICE_NAME (letta-web) - OTEL_ENABLED (true in production) Features enabled: - Automatic HTTP/fetch instrumentation with trace context propagation - Service metadata (name, version, environment) - Trace correlation with logs (getCurrentTraceId helper) - Graceful shutdown handling - Health check endpoint filtering Configuration: - Traces sent to OTLP endpoint (Datadog agent) - W3C Trace Context propagation for distributed tracing - BatchSpanProcessor for efficient trace export - Debug logging in development environment GitHub variables to set: - OTEL_EXPORTER_OTLP_ENDPOINT (e.g., http://datadog-agent:4317) - OTEL_ENABLED (true) * feat: add OpenTelemetry distributed tracing to cloud-api Completes end-to-end distributed tracing across the full request chain: letta-web → cloud-api → memgpt-server (core) All three services now export traces via OTLP to Datadog APM. Key changes: - Install OTEL packages in cloud-api - Create apps/cloud-api/src/instrument-otel.ts with full OTEL configuration - Initialize OTEL tracing in main.ts (before Sentry) - Add OTEL environment variables to deployment configs: - OTEL_EXPORTER_OTLP_ENDPOINT (e.g., http://datadog-agent:4317) - OTEL_SERVICE_NAME (cloud-api) - OTEL_ENABLED (true in production) - GIT_HASH (for service version) Features enabled: - Automatic HTTP/Express instrumentation - Trace context propagation (W3C Trace Context) - Service metadata (name, version, environment) - Trace correlation with logs (getCurrentTraceId helper) - Health check endpoint filtering Configuration: - Traces sent to OTLP endpoint (Datadog agent) - Seamless trace propagation through the full request chain - BatchSpanProcessor for efficient trace export Complete trace flow: 1. letta-web receives request, starts root span 2. letta-web calls cloud-api, propagates trace context 3. cloud-api calls memgpt-server, propagates trace context 4. All spans linked by trace ID, visible as single trace in Datadog * fix: prevent duplicate OTEL SDK initialization and handle array headers Fixes identified by Cursor bugbot: 1. Added initialization guard to prevent duplicate SDK initialization - Added isInitialized flag to prevent multiple SDK instances - Prevents duplicate SIGTERM handlers from being registered - Prevents resource leaks from lost SDK references 2. Fixed array header value handling - HTTP headers can be string \| string[] \| undefined - Now properly handles array case by taking first element - Prevents passing arrays to span.setAttribute() which expects strings 3. Verified OTEL dependencies are correctly installed - Packages are in root package.json (monorepo structure) - Available to all workspace packages (web, cloud-api) - Bugbot false positive - dependencies ARE present Applied fixes to both: - apps/web/src/lib/tracing.ts - apps/cloud-api/src/instrument-otel.ts * fix: handle SIGTERM promise rejections and unify initialization pattern Fixes identified by Cursor bugbot: 1. Fixed unhandled promise rejection in SIGTERM handlers - Changed from async arrow function to sync with .catch() - Prevents unhandled promise rejections during shutdown - Logs errors if OTLP endpoint is unreachable during shutdown - Applied to both web and cloud-api 2. Unified initialization pattern across services - Removed auto-initialization from cloud-api instrument-otel.ts - Now explicitly calls initializeTracing() in main.ts - Matches web pattern (explicit call in instrumentation.ts) - Reduces confusion and maintains consistency Both services now follow the same pattern: - Import tracing module - Explicitly call initializeTracing() - Guard against duplicate initialization with isInitialized flag Before (cloud-api): import './instrument-otel'; // Auto-initializes After (cloud-api): import { initializeTracing } from './instrument-otel'; initializeTracing(); // Explicit call SIGTERM handler before: process.on('SIGTERM', async () => { await shutdownTracing(); // Unhandled rejection! }); SIGTERM handler after: process.on('SIGTERM', () => { shutdownTracing().catch((error) => { console.error('Error during OTEL shutdown:', error); }); }); * feat: add environment differentiation for distributed tracing Enables proper environment filtering in Datadog APM by introducing LETTA_ENV to distinguish between production, staging, canary, and development. Problem: - NODE_ENV is always 'production' or 'development' - No way to differentiate staging, canary, etc. in Datadog - All traces appeared under no environment or same environment - Couldn't test with staging traces Solution: - Added LETTA_ENV variable (production, staging, canary, development) - Set deployment.environment attribute for Datadog APM filtering - Updated all deployment configs (workflows, justfile) - Falls back to NODE_ENV if LETTA_ENV not set Changes: 1. Updated tracing code (web + cloud-api): - Use LETTA_ENV for environment name - Set SEMRESATTRS_DEPLOYMENT_ENVIRONMENT (resolves to deployment.environment) - Fallback: LETTA_ENV → NODE_ENV → 'development' 2. Updated deployment configs: - .github/workflows/deploy-web.yml: LETTA_ENV=production - .github/workflows/deploy-cloud-api.yml: LETTA_ENV=production - justfile: LETTA_ENV with default to production 3. Added comprehensive documentation: - OTEL_TRACING.md with full setup guide - How to view environments in Datadog APM - How to test with staging environment - Dashboard query examples - Troubleshooting guide Usage: # Production LETTA_ENV=production # Staging LETTA_ENV=staging # Local dev LETTA_ENV=development Datadog APM now shows: - env:production (main traffic) - env:staging (staging deployments) - env:canary (canary deployments) - env:development (local testing) View in Datadog: APM → Services → Filter by env dropdown → Select production/staging/etc. * fix: prevent OTEL SDK double shutdown and error handler failures Fixes identified by Cursor bugbot: 1. SDK double shutdown prevention - Set sdk = null after successful shutdown - Set isInitialized = false to allow re-initialization - Even on shutdown error, mark as shutdown to prevent retry - Prevents errors when shutdownTracing() called multiple times - Applied to both web and cloud-api 2. Error handler using console.error directly (web only) - Replaced dynamic require('./logger') with console.error - Logger module may not be loaded during early initialization - This code runs in Next.js instrumentation.ts before modules load - Prevents masking original OTEL errors with logger failures - Cloud-api already correctly used console.error Before (bug #1): await sdk.shutdown(); // sdk still references shutdown SDK // Next call to shutdownTracing() tries to shutdown again After (bug #1): await sdk.shutdown(); sdk = null; // ✅ Prevent double shutdown isInitialized = false; // ✅ Allow re-init Before (bug #2 - web): const { logger } = require('./logger'); // ❌ May fail during init logger.error('Failed to initialize OTEL', errorInfo); After (bug #2 - web): console.error('Failed to initialize OTEL:', error); // ✅ Always works Scenarios protected: - Multiple SIGTERM signals - Explicit shutdownTracing() calls - Logger initialization failures - Circular dependencies during early init * feat: add environment differentiation to core and staging deployments Enables proper environment filtering in Datadog APM for memgpt-server (core) and staging deployments by adding deployment.environment resource attribute. Problem: - Core traces didn't show environment in Datadog APM - Staging workflow had no OTEL configuration - Couldn't differentiate staging vs production core traces Solution: 1. Updated core OTEL resource to include deployment.environment - Added deployment.environment attribute in resource.py - Uses settings.environment which maps to LETTA_ENVIRONMENT env var - Applied .lower() for consistency with web/cloud-api 2. Added LETTA_ENV to staging workflow - nightly-staging-deploy-test.yaml: LETTA_ENV=staging - Added OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_ENABLED vars - Traces from staging will show env:staging in Datadog 3. Added LETTA_ENV to production core workflow - deploy-core.yml: LETTA_ENV=production - Added OTEL configuration at workflow level - Traces from production will show env:production 4. Updated justfile for core deployments - Set LETTA_ENVIRONMENT from LETTA_ENV with default to production - Maps to settings.environment field (env_prefix="letta_") Environment mapping: - Web/Cloud-API: Use LETTA_ENV directly - Core: Use LETTA_ENVIRONMENT (Pydantic with letta_ prefix) - Both map to deployment.environment resource attribute Now all services properly tag traces with environment: ✅ letta-web: deployment.environment set ✅ cloud-api: deployment.environment set ✅ memgpt-server: deployment.environment set View in Datadog: APM → Services → Filter by env:production or env:staging * refactor: unify environment variable to LETTA_ENV across all services Simplifies environment configuration by using LETTA_ENV consistently across all three services (web, cloud-api, and core) instead of having core use LETTA_ENVIRONMENT. Problem: - Core used LETTA_ENVIRONMENT (due to Pydantic env_prefix) - Web and cloud-api used LETTA_ENV - Confusing to have two different variable names - Justfile had to map LETTA_ENV → LETTA_ENVIRONMENT Solution: - Added validation_alias to core settings.py - environment field now reads from LETTA_ENV directly - Falls back to letta_environment for backwards compatibility - Updated justfile to set LETTA_ENV for core (not LETTA_ENVIRONMENT) - Updated documentation to clarify consistent naming Changes: 1. apps/core/letta/settings.py - Added validation_alias=AliasChoices("LETTA_ENV", "letta_environment") - Prioritizes LETTA_ENV, falls back to letta_environment - Updated description to include all environment values 2. justfile - Changed --set secrets.LETTA_ENVIRONMENT to --set secrets.LETTA_ENV - Now consistent with web and cloud-api deployments 3. OTEL_TRACING.md - Added note that all services use LETTA_ENV consistently - Fixed trailing whitespace Before: - Web: LETTA_ENV - Cloud-API: LETTA_ENV - Core: LETTA_ENVIRONMENT ❌ After: - Web: LETTA_ENV - Cloud-API: LETTA_ENV - Core: LETTA_ENV ✅ All services now use the same environment variable name! * refactor: standardize on LETTA_ENVIRONMENT across all services Unifies environment variable naming to use LETTA_ENVIRONMENT consistently across all three services (web, cloud-api, and core). Problem: - Previous commit tried to use LETTA_ENV everywhere - Core already uses Pydantic with env_prefix="letta_" - Better to standardize on LETTA_ENVIRONMENT to match core conventions Solution: - All services now read from LETTA_ENVIRONMENT - Web: process.env.LETTA_ENVIRONMENT - Cloud-API: process.env.LETTA_ENVIRONMENT - Core: settings.environment (reads LETTA_ENVIRONMENT via Pydantic prefix) Changes: 1. apps/web/src/lib/tracing.ts - Changed LETTA_ENV → LETTA_ENVIRONMENT 2. apps/cloud-api/src/instrument-otel.ts - Changed LETTA_ENV → LETTA_ENVIRONMENT 3. apps/core/letta/settings.py - Removed validation_alias (not needed) - Uses standard Pydantic env_prefix behavior 4. All workflow files updated: - deploy-web.yml: LETTA_ENVIRONMENT=production - deploy-cloud-api.yml: LETTA_ENVIRONMENT=production - deploy-core.yml: LETTA_ENVIRONMENT=production - nightly-staging-deploy-test.yaml: LETTA_ENVIRONMENT=staging - stage-web.yaml: LETTA_ENVIRONMENT=staging - stage-cloud-api.yaml: LETTA_ENVIRONMENT=staging (added OTEL config) - stage-core.yaml: LETTA_ENVIRONMENT=staging (added OTEL config) 5. justfile - Updated all LETTA_ENV → LETTA_ENVIRONMENT - Web: --set env.LETTA_ENVIRONMENT - Cloud-API: --set env.LETTA_ENVIRONMENT - Core: --set secrets.LETTA_ENVIRONMENT 6. OTEL_TRACING.md - All references updated to LETTA_ENVIRONMENT Final state: ✅ Web: LETTA_ENVIRONMENT ✅ Cloud-API: LETTA_ENVIRONMENT ✅ Core: LETTA_ENVIRONMENT (via letta_ prefix) All services use the same variable name with proper Pydantic conventions! * feat: implement split OTEL architecture (Option A) Implements Option A: Web and cloud-api send traces directly to Datadog Agent, while core keeps its existing OTEL sidecar (exports to ClickHouse + Datadog). Architecture: - letta-web → Datadog Agent (OTLP:4317) → Datadog APM - cloud-api → Datadog Agent (OTLP:4317) → Datadog APM - memgpt-server → OTEL Sidecar → ClickHouse + Datadog (unchanged) Rationale: - Core has existing production sidecar setup (exports to ClickHouse for analytics) - Web/cloud-api don't need ClickHouse export, only APM - Simpler: Direct to Datadog Agent is sufficient - Minimal changes to core (already working) - Traces still link end-to-end via W3C Trace Context propagation Changes: 1. Helm Charts - Added OTEL config defaults: - helm/letta-web/values.yaml: Added OTEL env vars - helm/cloud-api/values.yaml: Added OTEL env vars - Default: OTEL_ENABLED="false", override in production - Endpoint: http://datadog-agent:4317 2. Production Workflows - Direct to Datadog Agent: - deploy-web.yml: Set OTEL_EXPORTER_OTLP_ENDPOINT to datadog-agent - deploy-cloud-api.yml: Set OTEL_EXPORTER_OTLP_ENDPOINT to datadog-agent - deploy-core.yml: Removed OTEL vars (keep existing setup) - OTEL_ENABLED="true", LETTA_ENVIRONMENT=production 3. Staging Workflows - Direct to Datadog Agent: - stage-web.yaml: Set OTEL_EXPORTER_OTLP_ENDPOINT to datadog-agent - stage-cloud-api.yaml: Set OTEL_EXPORTER_OTLP_ENDPOINT to datadog-agent - stage-core.yaml: Removed OTEL vars (keep existing setup) - nightly-staging-deploy-test.yaml: Removed OTEL vars - OTEL_ENABLED="true", LETTA_ENVIRONMENT=staging 4. Justfile: - Removed LETTA_ENVIRONMENT from core deployment (keep unchanged) - Web/cloud-api already correctly pass OTEL vars from workflows 5. Documentation: - Completely rewrote OTEL_TRACING.md - Added architecture diagrams explaining split setup - Added Datadog Agent prerequisites - Added troubleshooting for split architecture - Explained why we chose this approach Prerequisites (must verify before deploying): - Datadog Agent deployed with service name: datadog-agent - OTLP receiver enabled on port 4317 - If different service name/namespace, update workflows Next Steps: - Verify datadog-agent service exists in cluster - Verify OTLP receiver is enabled on Datadog agent - Deploy and test trace propagation across services * refactor: shorten environment names to prod and dev Changes LETTA_ENVIRONMENT values from 'production' to 'prod' and 'development' to 'dev' for consistency and brevity. Changes: 1. Workflows: - deploy-web.yml: production → prod - deploy-cloud-api.yml: production → prod 2. Helm charts: - letta-web/values.yaml: development → dev - cloud-api/values.yaml: development → dev 3. Justfile: - Default values: production → prod 4. Code: - apps/web/src/lib/tracing.ts: Fallback 'development' → 'dev' - apps/cloud-api/src/instrument-otel.ts: Fallback 'development' → 'dev' - apps/core/letta/settings.py: Updated description 5. Documentation: - OTEL_TRACING.md: Updated all examples and table Environment values: - prod (was production) - staging (unchanged) - canary (unchanged) - dev (was development) * refactor: align environment names with codebase patterns Changes staging to 'dev' and local development to 'local-test' to match existing codebase conventions (like test_temporal_metrics_local.py). Rationale: - 'dev' for staging matches consistent pattern across codebase - 'local-test' for local development follows test naming convention - Clearer distinction between deployed staging and local testing Environment values: - prod (production) - dev (staging/dev cluster) - canary (canary deployments) - local-test (local development) Changes: 1. Staging workflows: - stage-web.yaml: staging → dev - stage-cloud-api.yaml: staging → dev 2. Helm chart defaults (for local): - letta-web/values.yaml: dev → local-test - cloud-api/values.yaml: dev → local-test 3. Code fallbacks: - apps/web/src/lib/tracing.ts: 'dev' → 'local-test' - apps/cloud-api/src/instrument-otel.ts: 'dev' → 'local-test' - apps/core/letta/settings.py: Updated description 4. Documentation: - OTEL_TRACING.md: Updated table, examples, and all references - Clarified dev = staging cluster, local-test = local development Datadog APM filters: - env:prod (production) - env:dev (staging cluster) - env:canary (canary) - env:local-test (local development) * fix: update environment checks for lowercase values and add missing configs Fixes 4 bugs identified by Cursor bugbot: 1. Case-sensitive environment checks (5 locations) - Updated all checks from "PRODUCTION" to case-insensitive "prod" - Fixed in: resource.py, multi_agent.py, tool_manager.py, multi_agent_tool_executor.py, agent_manager_helper.py - Now properly filters local-only tools in production - Prevents exposing debug tools in production 2. Device ID leak in production - Fixed resource.py to use case-insensitive check - Now correctly excludes device.id (MAC address) in production - Only adds device.id when env is not "prod" 3. Missing @opentelemetry/sdk-trace-base in Next.js externals - Added to serverExternalPackages in next.config.js - Prevents webpack bundling issues with native dependencies - Package is directly imported for BatchSpanProcessor 4. Missing NEXT_PUBLIC_GIT_HASH in stage-web workflow - Added NEXT_PUBLIC_GIT_HASH: ${{ github.sha }} - Now matches stage-cloud-api.yaml pattern - Staging traces will show correct version instead of 'unknown' - Enables correlation of traces with specific deployments Changes: - apps/core/letta/otel/resource.py: Case-insensitive check, add device.id only if not prod - apps/core/letta/functions/function_sets/multi_agent.py: Case-insensitive prod check - apps/core/letta/services/tool_manager.py: Case-insensitive prod check - apps/core/letta/services/tool_executor/multi_agent_tool_executor.py: Case-insensitive prod check - apps/core/letta/services/helpers/agent_manager_helper.py: Case-insensitive prod check - apps/web/next.config.js: Added @opentelemetry/sdk-trace-base to externals - .github/workflows/stage-web.yaml: Added NEXT_PUBLIC_GIT_HASH All checks now use: settings.environment.lower() == "prod" This matches our new convention: prod/dev/canary/local-test Also includes: distributed-tracing skill (created in /skill session) * refactor: keep core PRODUCTION but normalize OTEL tags to prod Changes approach to maintain backward compatibility with core business logic while standardizing OTEL environment tags. Previous approach: - Changed all "PRODUCTION" checks to lowercase "prod" - Would break existing core business logic expectations New approach: - Core continues using "PRODUCTION" (uppercase) for business logic - OTEL resource.py normalizes environment to lowercase abbreviated tags - Web/cloud-api use "prod" directly (they don't have business logic checks) Changes: 1. Reverted business logic checks to use "PRODUCTION" (uppercase): - multi_agent.py: Check for "PRODUCTION" to block tools - tool_manager.py: Check for "PRODUCTION" to filter local-only tools - multi_agent_tool_executor.py: Check for "PRODUCTION" to block tools - agent_manager_helper.py: Check for "PRODUCTION" to filter tools 2. Added environment normalization for OTEL tags: - resource.py: New _normalize_environment_tag() function - Maps PRODUCTION → prod, DEV/STAGING → dev - Other values (CANARY, etc.) converted to lowercase - Device ID check reverted to != "PRODUCTION" 3. Updated core deployments to set PRODUCTION: - deploy-core.yml: LETTA_ENVIRONMENT=PRODUCTION - stage-core.yaml: LETTA_ENVIRONMENT=DEV - justfile: Added LETTA_ENVIRONMENT with default PRODUCTION 4. Updated settings description: - Clarifies values are uppercase (PRODUCTION, DEV) - Notes normalization to lowercase for OTEL tags Result: - Core business logic: Uses "PRODUCTION" (unchanged, backward compatible) - OTEL Datadog tags: Shows "prod" (normalized, consistent with web/cloud-api) - Web/cloud-api: Continue using "prod" directly (no change needed) - Device ID properly excluded in PRODUCTION environments * fix: correct Python FastAPI instrumentation and environment normalization Fixes 3 bugs identified by Cursor bugbot in distributed-tracing skill: 1. Python import typo (line 50) - Was: from opentelemetry.instrumentation.fastapi import FastAPIInstrumentatio - Now: from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor - Missing final 'n' in Instrumentatio - Correct class name is FastAPIInstrumentor (with 'or' suffix) 2. Wrong class name usage (line 151) - Was: FastAPIInstrumentation.instrument_app() - Now: FastAPIInstrumentor.instrument_app() - Fixed to match correct OpenTelemetry API 3. Environment tag inconsistency - Problem: Python template used .lower() which converts PRODUCTION -> production - But resource.py normalizes PRODUCTION -> prod - Would create inconsistent tags: 'production' vs 'prod' in Datadog Solution: - Added _normalize_environment_tag() function to Python template - Matches resource.py normalization logic - PRODUCTION -> prod, DEV/STAGING -> dev, others lowercase - Updated comments in workflows to clarify normalization happens in code Changes: - .skills/distributed-tracing/templates/python-fastapi-tracing.py: - Fixed import: FastAPIInstrumentor (not FastAPIInstrumentatio) - Fixed usage: FastAPIInstrumentor.instrument_app() - Added _normalize_environment_tag() function - Updated environment handling to use normalization - Updated docstring to clarify PRODUCTION/DEV -> prod/dev mapping - .github/workflows/deploy-core.yml: - Clarified comment: _normalize_environment_tag() converts to "prod" - .github/workflows/stage-core.yaml: - Clarified comment: _normalize_environment_tag() converts to "dev" Result: All services now consistently show 'prod' (not 'production') in Datadog APM, enabling proper filtering and correlation across distributed traces. * fix: add Datadog config to staging workflows and fix justfile backslash Fixes 3 issues found in staging deployment logs: 1. Missing backslash in justfile (line 134) Problem: LETTA_ENVIRONMENT line missing backslash caused all subsequent helm --set flags to be ignored, including OTEL_EXPORTER_OTLP_ENDPOINT Result: letta-web and cloud-api logs showed "OTEL_EXPORTER_OTLP_ENDPOINT not set" Fixed: --set env.LETTA_ENVIRONMENT=${LETTA_ENVIRONMENT:-prod} \ # Added backslash 2. Missing Datadog vars in staging workflows Problem: stage-web.yaml, stage-cloud-api.yaml, stage-core.yaml didn't set DD_SITE, DD_API_KEY, DD_LOGS_INJECTION, etc. For web/cloud-api: - Added to top-level env section so justfile can use them For core: - Added to top-level env section - Added to Deploy step env section (so justfile can pass to helm) - core OTEL collector config reads these from environment Result: core logs showed "exporters::datadog: api.key is not set" 3. Wrong environment tag in staging (secondary issue) Problem: letta-web logs showed 'dd.env":"production"' in staging Cause: Missing backslash broke LETTA_ENVIRONMENT, defaulted to prod Fixed: Backslash fix ensures LETTA_ENVIRONMENT=dev is set Changes: - justfile: Fixed missing backslash on LETTA_ENVIRONMENT line - .github/workflows/stage-web.yaml: Added DD_* vars to env - .github/workflows/stage-cloud-api.yaml: Added DD_* vars to env - .github/workflows/stage-core.yaml: Added DD_* vars to env and Deploy step After this fix: - Web/cloud-api will send traces to Datadog Agent via OTLP - Core OTEL collector will export traces to both ClickHouse and Datadog - All staging traces will show env:dev tag (not env:production) * fix: move OTEL config from prod helm to dev helm values Problem: OTEL configuration was added to production helm values files (helm/letta-web/values.yaml and helm/cloud-api/values.yaml) but these are for production deployments. Staging deployments use the dev helm values (helm/dev/<service>/values.yaml). Changes: - Removed OTEL vars from helm/letta-web/values.yaml (prod) - Removed OTEL vars from helm/cloud-api/values.yaml (prod) - Added OTEL vars to helm/dev/letta-web/values.yaml (staging) - Added OTEL vars to helm/dev/cloud-api/values.yaml (staging) Dev helm values now include: OTEL_ENABLED: "true" OTEL_SERVICE_NAME: "letta-web" or "cloud-api" OTEL_EXPORTER_OTLP_ENDPOINT: "http://datadog-agent.default.svc.cluster.local:4317" LETTA_ENVIRONMENT: "dev" Note: Production deployments override these via workflow env vars, so prod helm values don't need OTEL config. Dev/staging deployments use these helm values as defaults. * remove generated doc * secrets in dev * totally unrelated changes to tf for runner sizing and scaling * feat: add DD_ENV tags to staging helm for log correlation Problem: Logs show 'dd.env":"production"' instead of 'dd.env":"dev"' in staging because Datadog's logger injection uses DD_ENV, DD_SERVICE, and DD_VERSION environment variables for tagging. Changes: - Added DD_ENV, DD_SERVICE, DD_VERSION to helm/dev/letta-web/values.yaml - Added DD_ENV, DD_SERVICE, DD_VERSION to helm/dev/cloud-api/values.yaml Values: DD_ENV: "dev" DD_SERVICE: "letta-web" or "cloud-api" DD_VERSION: "dev" This ensures: - Logs show correct env:dev tag in Datadog - Traces and logs are properly correlated - Consistent tagging across OTEL traces and DD logs * feat: enable OTLP receiver in Datadog Agent configurations Added OpenTelemetry Protocol (OTLP) receiver to Datadog Agent for both dev and prod environments to support distributed tracing from services using OpenTelemetry SDKs. Changes: - helm/dev/datadog/datadog-agent.yaml: Added otlp.receiver configuration - helm/datadog/datadog-agent.yaml: Added otlp.receiver configuration OTLP Configuration: otlp: receiver: protocols: grpc: enabled: true endpoint: "0.0.0.0:4317" http: enabled: true endpoint: "0.0.0.0:4318" This enables: - Web/cloud-api services to send traces via OTLP (port 4317) - Core OTEL collector to export to Datadog via OTLP (port 4317) - Alternative HTTP endpoint for OTLP (port 4318) When applied, the Datadog Agent service will expose: - Port 4317/TCP - OTLP gRPC (for traces) - Port 4318/TCP - OTLP HTTP (for traces) - Port 8126/TCP - Native Datadog APM (existing) - Port 8125/UDP - DogStatsD (existing) Apply with: kubectl apply -f helm/dev/datadog/datadog-agent.yaml # staging kubectl apply -f helm/datadog/datadog-agent.yaml # production * feat: use git hash as DD_VERSION for all services Changed from static version strings to using git commit hash as the version tag in Datadog APM for better version tracking and correlation. Changes: 1. Workflows - Set DD_VERSION to github.sha: - .github/workflows/stage-web.yaml: Added DD_VERSION: ${{ github.sha }} - .github/workflows/stage-cloud-api.yaml: Added DD_VERSION: ${{ github.sha }} - .github/workflows/stage-core.yaml: Added DD_VERSION: ${{ github.sha }} (both top-level env and Deploy step env) 2. Justfile - Pass DD_VERSION to helm: - deploy-web: Added --set env.DD_VERSION=${DD_VERSION:-unknown} - deploy-cloud-api: Added --set env.DD_VERSION=${DD_VERSION:-unknown} - deploy-core: Added --set secrets.DD_VERSION=${DD_VERSION:-unknown} 3. Helm dev values - Remove hardcoded version: - helm/dev/letta-web/values.yaml: Removed DD_VERSION: "dev" - helm/dev/cloud-api/values.yaml: Removed DD_VERSION: "dev" - Added comments that DD_VERSION is set via workflow Result: - Traces in Datadog will show version as git commit SHA (e.g., "abc123def") - Can correlate traces with specific deployments/commits - Consistent with internal versioning strategy (git hash, not semver) - Defaults to "unknown" if DD_VERSION not set Example trace tags after deployment: env:dev service:letta-web version:7eafc5b0c12345... * feat: add DD_VERSION to production workflows Added DD_VERSION to production deployment workflows for consistent version tracking across staging and production environments. Changes: - .github/workflows/deploy-web.yml: Added DD_VERSION: ${{ github.sha }} - .github/workflows/deploy-core.yml: Added DD_VERSION: ${{ github.sha }} Note: deploy-cloud-api.yml doesn't have DD config yet, will add when cloud-api gets OTEL enabled in production. Context: This was partially flagged by bugbot - it noted that NEXT_PUBLIC_GIT_HASH was missing from prod, but that was incorrect (line 53 already has it). However, DD_VERSION was indeed missing and needed for Datadog log correlation. Result: - Production logs will show version tag matching git commit SHA - Consistent with staging configuration - Better trace/log correlation in Datadog APM Staging already has DD_VERSION (added in commit fb1a3eea0) * feat: add DD tags to memgpt-server dev helm for APM correlation Problem: memgpt-server logs show up in Datadog but traces don't appear properly in APM UI because DD_ENV, DD_SERVICE, DD_SITE tags were missing. The service was using native Datadog agent instrumentation (via LETTA_TELEMETRY_ENABLE_DATADOG) but without proper unified service tagging, traces weren't being correlated correctly in the APM interface. Changes: - helm/dev/memgpt-server/values.yaml: - Added DD_ENV: "dev" - Added DD_SERVICE: "memgpt-server" - Added DD_SITE: "us5.datadoghq.com" - Added comment that DD_VERSION comes from workflow Existing configuration: - DD_VERSION already passed via stage-core.yaml (line 215) and justfile (line 272) - DD_API_KEY already in secretsProvider (line 194) - LETTA_TELEMETRY_ENABLE_DATADOG: "true" (enables native DD agent) - LETTA_TELEMETRY_DATADOG_AGENT_HOST/PORT (routes to DD cluster agent) Result: After redeployment, memgpt-server traces will show in Datadog APM with: - env:dev - service:memgpt-server - version:<git-hash> - Proper correlation with logs * refactor: use image tag for DD_VERSION instead of separate env var Changed from passing DD_VERSION separately to deriving it from the image.tag that's already set (which contains the git hash). This is cleaner because: - Image tag is already set to git hash via TAG env var - Removes redundant DD_VERSION from workflows (6 locations) - Single source of truth for version (the deployed image tag) - Simpler configuration Changes: Workflows (removed DD_VERSION): - .github/workflows/stage-web.yaml - .github/workflows/stage-cloud-api.yaml - .github/workflows/stage-core.yaml (2 locations) - .github/workflows/deploy-web.yml - .github/workflows/deploy-core.yml Justfile (use {{TAG}} instead of ${DD_VERSION}): - deploy-web: --set env.DD_VERSION={{TAG}} - deploy-cloud-api: --set env.DD_VERSION={{TAG}} - deploy-core: --set secrets.DD_VERSION={{TAG}} Helm values (updated comments): - helm/dev/letta-web/values.yaml - helm/dev/cloud-api/values.yaml - helm/dev/memgpt-server/values.yaml - Changed from "set via workflow" to "set from image.tag by justfile" Flow: 1. Workflow sets TAG=${{ github.sha }} 2. Workflow calls justfile with TAG env var 3. Justfile sets image.tag={{TAG}} and DD_VERSION={{TAG}} 4. Both use same git hash value Example: image.tag: abc123def DD_VERSION: abc123def Both from TAG env var set to github.sha * feat: add Datadog native tracer (dd-trace) to cloud-api for APM Problem: cloud-api traces weren't appearing in Datadog APM despite OTEL being configured. Investigation revealed letta-web uses dd-trace (Datadog's native tracer) in addition to OTEL, and those traces show up perfectly. Analysis: - letta-web: Uses BOTH OTEL + dd-trace → traces visible in APM ✓ - cloud-api: Uses ONLY OTEL → traces NOT visible in APM ✗ Root cause: While OTEL should work, dd-trace provides better integration with Datadog's APM backend and is proven to work in production. Solution: Add dd-trace initialization to cloud-api, matching letta-web's dual-tracing approach (OTEL + dd-trace). Changes: - apps/cloud-api/src/instrument-otel.ts: - Added dd-trace initialization after OTEL setup - Checks for DD_API_KEY env var (already configured in helm) - Enables logInjection, runtimeMetrics, and profiling - Graceful fallback if dd-trace fails to initialize Dependencies: - dd-trace@^5.31.0 already available in root package.json Configuration (already set in helm): - DD_API_KEY: From secretsProvider ✓ - DD_ENV: "dev" ✓ - DD_SERVICE: "cloud-api" ✓ - DD_LOGS_INJECTION: From workflow ✓ Expected result: After deployment, cloud-api traces will appear in Datadog APM alongside letta-web and letta-server, with proper env:dev service:cloud-api tags. * tweak vars in staging * fix: initialize Datadog tracer for memgpt-server APM traces Problem: memgpt-server (letta-server) shows up in Datadog APM with env:null instead of env:dev, and traces weren't being properly captured. Root cause: The code was only initializing the Datadog Profiler (for CPU/memory profiling), but NOT the Tracer (for distributed tracing/APM). Analysis: - Profiler: Records performance metrics (CPU, memory) - WAS initialized ✓ - Tracer: Records distributed traces/spans for APM - NOT initialized ✗ The existing code (line 248-256) did: from ddtrace.profiling import Profiler # Only profiler! profiler = Profiler(...) profiler.start() # No tracer initialization! This explains why: - letta-server appears in Datadog with env:null (profiling data sent without proper tags) - Traces don't show proper service/env correlation - APM service map is incomplete Solution: Initialize the Datadog tracer with ddtrace.patch_all() to: 1. Auto-instrument FastAPI, HTTP clients, database calls, etc. 2. Send proper distributed traces to Datadog APM 3. Use the DD_ENV, DD_SERVICE env vars already set in helm Changes: - apps/core/letta/server/rest_api/app.py: - Added import ddtrace - Added ddtrace.patch_all() to auto-instrument all libraries - Added logging for tracer initialization Configuration (already set in helm): - DD_ENV: "dev" ✓ - DD_SERVICE: "memgpt-server" ✓ - DD_SITE: "us5.datadoghq.com" ✓ - DD_VERSION: From image.tag ✓ - DD_AGENT_HOST/PORT: Set by code from settings ✓ Expected result: After redeployment, letta-server will: - Show as env:dev (not env:null) in Datadog APM - Send proper distributed traces with full context - Appear correctly in service maps and trace explorer * fix: add dd-trace dependency to cloud-api package.json Problem: cloud-api Docker image doesn't include dd-trace, causing "Cannot find module 'dd-trace'" error at runtime. Root cause: dd-trace is in root package.json but not in cloud-api's package.json, so it's not included in the Docker build. Solution: Add dd-trace@^5.31.0 to cloud-api dependencies. Changes: - apps/cloud-api/package.json: Added dd-trace dependency * fix: mark dd-trace as external in cloud-api esbuild config Problem: esbuild fails when trying to bundle dd-trace because it attempts to bundle optional GraphQL plugin dependencies that aren't installed. Error: Could not resolve "graphql/language/visitor" Could not resolve "graphql/language/printer" Could not resolve "graphql/utilities" Root cause: dd-trace has optional plugins for various frameworks (GraphQL, MongoDB, etc.) that it loads conditionally at runtime. esbuild tries to statically analyze and bundle all requires, including these optional deps. Solution: Add dd-trace to the externals list so it's loaded at runtime instead of being bundled. This is the standard approach for native modules and packages with optional dependencies. Changes: - apps/cloud-api/esbuild.config.js: Added 'dd-trace' to externals array Result: - Build succeeds ✓ - dd-trace loads at runtime with only the plugins it needs ✓ - No GraphQL dependency required ✓ * add dd-trace * fix: increase cloud-api memory and make dd-trace profiling configurable Problem: cloud-api pods crash looping with out of memory errors when dd-trace profiling is enabled: FATAL ERROR: JavaScript heap out of memory current_heap_limit=268435456 (268MB in 512Mi total) Root cause: dd-trace profiling is memory-intensive (50-100MB+ overhead) and the original 512Mi limit was too tight. Solution: Two-part fix: 1. Increase memory limits: 512Mi → 1Gi (gives profiling room to breathe) 2. Make profiling configurable via DD_PROFILING_ENABLED env var Changes: helm/dev/cloud-api/values.yaml: - resources.limits.memory: 512Mi → 1Gi - resources.requests.memory: 512Mi → 1Gi - Added DD_PROFILING_ENABLED: "true" apps/cloud-api/src/instrument-otel.ts: - Read DD_PROFILING_ENABLED env var - Pass to tracer.init({ profiling: profilingEnabled }) - Log profiling status on initialization Benefits: ✓ Profiling enabled by default (CPU/heap flame graphs in Datadog) ✓ Can disable via env var if needed (set to "false") ✓ More headroom prevents OOM crashes (1Gi vs 512Mi) ✓ Configurable per environment Memory breakdown with profiling: - App baseline: ~300-400MB - dd-trace profiling: ~50-100MB - Buffer/headroom: ~500MB - Total: 1Gi (comfortable margin)	2025-12-15 12:02:34 -08:00
jnjpng	c48cf021cb	fix: set api key encrypted secret for providers in memory (#6571 ) base Co-authored-by: Letta Bot <noreply@letta.com>	2025-12-15 12:02:34 -08:00
jnjpng	2536942be2	fix: combined tool manager improvements - tracing and redundant fetches (#6570 ) * fix: combined tool manager improvements - tracing and redundant fetches This PR combines improvements from #6530 and #6535: - Add tracer import to enable proper tracing spans - Improve update check logic to verify actual field changes before updating - Return current_tool directly when no update is needed (avoids redundant fetch) - Add structured tracing spans to update_tool_by_id_async for better observability - Fix decorator order for better error handling (raise_on_invalid_id before trace_method) - Remove unnecessary tracing spans in create_or_update_tool_async 🐾 Generated with [Letta Code](https://letta.com) Co-Authored-By: Letta <noreply@letta.com> * revert: remove tracing spans from update_tool_by_id_async Remove the tracer span additions from update_tool_by_id_async while keeping all other improvements (decorator order fix, redundant fetch removal, and improved update check logic). 🐾 Generated with [Letta Code](https://letta.com) Co-Authored-By: Letta <noreply@letta.com> --------- Co-authored-by: Letta Bot <noreply@letta.com>	2025-12-15 12:02:34 -08:00
Devansh Jain	d1536df6f6	chore: Update deepseek client for v3.2 models (#6556 ) * support for v3.2 models * streaming + context window fix * fix for no assitant text from deepseek	2025-12-15 12:02:34 -08:00
cthomas	a7c0bad857	fix: unbound var in summarization [LET-6484] (#6568 ) * fix: unbound var in summarization * fix indentation	2025-12-15 12:02:34 -08:00
jnjpng	fd14657e84	fix: prevent false positive in Secret.get_plaintext() for plaintext values (#6566 ) When a Secret is created from plaintext (was_encrypted=False), the is_encrypted() heuristic can incorrectly identify long API keys as encrypted. This causes get_plaintext() to return None when no encryption key is available, even though the value was explicitly stored as plaintext. Fix: Check was_encrypted flag before trusting is_encrypted() heuristic. If was_encrypted=False, trust the cached plaintext value. This is a port of https://github.com/letta-ai/letta/pull/3078 to letta-cloud. 👾 Generated with [Letta Code](https://letta.com) Co-authored-by: Letta Bot <noreply@letta.com>	2025-12-15 12:02:34 -08:00
Cameron	8c616a2093	fix: add context prompt to sleeptime agent user message (#6564 ) * fix: add context prompt to sleeptime agent user message Previously the sleeptime agent received only the raw conversation transcript with no context, causing identity confusion where the agent would believe it was the primary agent. Now includes a pre-prompt that: - Uses "sleeptime agent" terminology explicitly - Clarifies the agent is NOT the primary agent - Explains message labels (assistant = primary agent) - States the agent has no prior turns in the transcript - Describes the memory management role 🤖 Generated with [Letta Code](https://letta.com) Co-Authored-By: Letta <noreply@letta.com> * chore: remove redundant sleeptime pre-prompt line * chore: add memory_persona reference to sleeptime pre-prompt * chore: wrap sleeptime pre-prompt in system-reminder tags * chore: rename transcript to messages in sleeptime pre-prompt --------- Co-authored-by: Letta <noreply@letta.com>	2025-12-15 12:02:34 -08:00
Cameron	a56c6571d2	fix: update fetch_webpage docstring to reflect actual implementation (#6503 ) The docstring incorrectly stated that fetch_webpage uses Jina AI reader. Updated to accurately describe the actual implementation which uses: 1. Exa API (if EXA_API_KEY is available) 2. Trafilatura (fallback) 3. Readability + html2text (final fallback) 🐾 Generated with [Letta Code](https://letta.com) Co-authored-by: Letta <noreply@letta.com>	2025-12-15 12:02:34 -08:00
Kian Jones	86fbd39a16	feat: add dd instrumentation to web (#6531 ) * add dd instrumentation to web * instrument web fully * omit dd * add error handling for dd-trace initialization * use logger instead of console in dd-trace error handling * exception replay * fix dd-trace native module bundling and error serialization	2025-12-15 12:02:34 -08:00
Kevin Lin	1ca9df0626	feat: Add `memory_apply_patch` to base tools (#6491 ) add memory apply patch	2025-12-15 12:02:34 -08:00

1 2 3 4 5 ...

6690 Commits