Go to file

Kian Jones 3422508d42 feat: add OpenTelemetry distributed tracing to clouid-api and web (#6549 )

* feat: add OpenTelemetry distributed tracing to letta-web

Enables end-to-end distributed tracing from letta-web through memgpt-server
using OpenTelemetry. Traces are exported via OTLP to Datadog APM for
monitoring request latency across services.

Key changes:
- Install OTEL packages: @opentelemetry/sdk-node, auto-instrumentations-node
- Create apps/web/src/lib/tracing.ts with full OTEL configuration
- Initialize tracing in instrumentation.ts (before any other imports)
- Add OTEL packages to next.config.js serverExternalPackages
- Add OTEL environment variables to deployment configs:
  - OTEL_EXPORTER_OTLP_ENDPOINT (e.g., http://datadog-agent:4317)
  - OTEL_SERVICE_NAME (letta-web)
  - OTEL_ENABLED (true in production)

Features enabled:
- Automatic HTTP/fetch instrumentation with trace context propagation
- Service metadata (name, version, environment)
- Trace correlation with logs (getCurrentTraceId helper)
- Graceful shutdown handling
- Health check endpoint filtering

Configuration:
- Traces sent to OTLP endpoint (Datadog agent)
- W3C Trace Context propagation for distributed tracing
- BatchSpanProcessor for efficient trace export
- Debug logging in development environment

GitHub variables to set:
- OTEL_EXPORTER_OTLP_ENDPOINT (e.g., http://datadog-agent:4317)
- OTEL_ENABLED (true)

* feat: add OpenTelemetry distributed tracing to cloud-api

Completes end-to-end distributed tracing across the full request chain:
letta-web → cloud-api → memgpt-server (core)

All three services now export traces via OTLP to Datadog APM.

Key changes:
- Install OTEL packages in cloud-api
- Create apps/cloud-api/src/instrument-otel.ts with full OTEL configuration
- Initialize OTEL tracing in main.ts (before Sentry)
- Add OTEL environment variables to deployment configs:
  - OTEL_EXPORTER_OTLP_ENDPOINT (e.g., http://datadog-agent:4317)
  - OTEL_SERVICE_NAME (cloud-api)
  - OTEL_ENABLED (true in production)
  - GIT_HASH (for service version)

Features enabled:
- Automatic HTTP/Express instrumentation
- Trace context propagation (W3C Trace Context)
- Service metadata (name, version, environment)
- Trace correlation with logs (getCurrentTraceId helper)
- Health check endpoint filtering

Configuration:
- Traces sent to OTLP endpoint (Datadog agent)
- Seamless trace propagation through the full request chain
- BatchSpanProcessor for efficient trace export

Complete trace flow:
1. letta-web receives request, starts root span
2. letta-web calls cloud-api, propagates trace context
3. cloud-api calls memgpt-server, propagates trace context
4. All spans linked by trace ID, visible as single trace in Datadog

* fix: prevent duplicate OTEL SDK initialization and handle array headers

Fixes identified by Cursor bugbot:

1. Added initialization guard to prevent duplicate SDK initialization
   - Added isInitialized flag to prevent multiple SDK instances
   - Prevents duplicate SIGTERM handlers from being registered
   - Prevents resource leaks from lost SDK references

2. Fixed array header value handling
   - HTTP headers can be string | string[] | undefined
   - Now properly handles array case by taking first element
   - Prevents passing arrays to span.setAttribute() which expects strings

3. Verified OTEL dependencies are correctly installed
   - Packages are in root package.json (monorepo structure)
   - Available to all workspace packages (web, cloud-api)
   - Bugbot false positive - dependencies ARE present

Applied fixes to both:
- apps/web/src/lib/tracing.ts
- apps/cloud-api/src/instrument-otel.ts

* fix: handle SIGTERM promise rejections and unify initialization pattern

Fixes identified by Cursor bugbot:

1. Fixed unhandled promise rejection in SIGTERM handlers
   - Changed from async arrow function to sync with .catch()
   - Prevents unhandled promise rejections during shutdown
   - Logs errors if OTLP endpoint is unreachable during shutdown
   - Applied to both web and cloud-api

2. Unified initialization pattern across services
   - Removed auto-initialization from cloud-api instrument-otel.ts
   - Now explicitly calls initializeTracing() in main.ts
   - Matches web pattern (explicit call in instrumentation.ts)
   - Reduces confusion and maintains consistency

Both services now follow the same pattern:
- Import tracing module
- Explicitly call initializeTracing()
- Guard against duplicate initialization with isInitialized flag

Before (cloud-api):
  import './instrument-otel'; // Auto-initializes

After (cloud-api):
  import { initializeTracing } from './instrument-otel';
  initializeTracing(); // Explicit call

SIGTERM handler before:
  process.on('SIGTERM', async () => {
    await shutdownTracing(); // Unhandled rejection!
  });

SIGTERM handler after:
  process.on('SIGTERM', () => {
    shutdownTracing().catch((error) => {
      console.error('Error during OTEL shutdown:', error);
    });
  });

* feat: add environment differentiation for distributed tracing

Enables proper environment filtering in Datadog APM by introducing LETTA_ENV
to distinguish between production, staging, canary, and development.

Problem:
- NODE_ENV is always 'production' or 'development'
- No way to differentiate staging, canary, etc. in Datadog
- All traces appeared under no environment or same environment
- Couldn't test with staging traces

Solution:
- Added LETTA_ENV variable (production, staging, canary, development)
- Set deployment.environment attribute for Datadog APM filtering
- Updated all deployment configs (workflows, justfile)
- Falls back to NODE_ENV if LETTA_ENV not set

Changes:
1. Updated tracing code (web + cloud-api):
   - Use LETTA_ENV for environment name
   - Set SEMRESATTRS_DEPLOYMENT_ENVIRONMENT (resolves to deployment.environment)
   - Fallback: LETTA_ENV → NODE_ENV → 'development'

2. Updated deployment configs:
   - .github/workflows/deploy-web.yml: LETTA_ENV=production
   - .github/workflows/deploy-cloud-api.yml: LETTA_ENV=production
   - justfile: LETTA_ENV with default to production

3. Added comprehensive documentation:
   - OTEL_TRACING.md with full setup guide
   - How to view environments in Datadog APM
   - How to test with staging environment
   - Dashboard query examples
   - Troubleshooting guide

Usage:
# Production
LETTA_ENV=production

# Staging
LETTA_ENV=staging

# Local dev
LETTA_ENV=development

Datadog APM now shows:
- env:production (main traffic)
- env:staging (staging deployments)
- env:canary (canary deployments)
- env:development (local testing)

View in Datadog:
APM → Services → Filter by env dropdown → Select production/staging/etc.

* fix: prevent OTEL SDK double shutdown and error handler failures

Fixes identified by Cursor bugbot:

1. SDK double shutdown prevention
   - Set sdk = null after successful shutdown
   - Set isInitialized = false to allow re-initialization
   - Even on shutdown error, mark as shutdown to prevent retry
   - Prevents errors when shutdownTracing() called multiple times
   - Applied to both web and cloud-api

2. Error handler using console.error directly (web only)
   - Replaced dynamic require('./logger') with console.error
   - Logger module may not be loaded during early initialization
   - This code runs in Next.js instrumentation.ts before modules load
   - Prevents masking original OTEL errors with logger failures
   - Cloud-api already correctly used console.error

Before (bug #1):
  await sdk.shutdown();
  // sdk still references shutdown SDK
  // Next call to shutdownTracing() tries to shutdown again

After (bug #1):
  await sdk.shutdown();
  sdk = null; // ✅ Prevent double shutdown
  isInitialized = false; // ✅ Allow re-init

Before (bug #2 - web):
  const { logger } = require('./logger'); // ❌ May fail during init
  logger.error('Failed to initialize OTEL', errorInfo);

After (bug #2 - web):
  console.error('Failed to initialize OTEL:', error); // ✅ Always works

Scenarios protected:
- Multiple SIGTERM signals
- Explicit shutdownTracing() calls
- Logger initialization failures
- Circular dependencies during early init

* feat: add environment differentiation to core and staging deployments

Enables proper environment filtering in Datadog APM for memgpt-server (core)
and staging deployments by adding deployment.environment resource attribute.

Problem:
- Core traces didn't show environment in Datadog APM
- Staging workflow had no OTEL configuration
- Couldn't differentiate staging vs production core traces

Solution:
1. Updated core OTEL resource to include deployment.environment
   - Added deployment.environment attribute in resource.py
   - Uses settings.environment which maps to LETTA_ENVIRONMENT env var
   - Applied .lower() for consistency with web/cloud-api

2. Added LETTA_ENV to staging workflow
   - nightly-staging-deploy-test.yaml: LETTA_ENV=staging
   - Added OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_ENABLED vars
   - Traces from staging will show env:staging in Datadog

3. Added LETTA_ENV to production core workflow
   - deploy-core.yml: LETTA_ENV=production
   - Added OTEL configuration at workflow level
   - Traces from production will show env:production

4. Updated justfile for core deployments
   - Set LETTA_ENVIRONMENT from LETTA_ENV with default to production
   - Maps to settings.environment field (env_prefix="letta_")

Environment mapping:
- Web/Cloud-API: Use LETTA_ENV directly
- Core: Use LETTA_ENVIRONMENT (Pydantic with letta_ prefix)
- Both map to deployment.environment resource attribute

Now all services properly tag traces with environment:
✅ letta-web: deployment.environment set
✅ cloud-api: deployment.environment set
✅ memgpt-server: deployment.environment set

View in Datadog:
APM → Services → Filter by env:production or env:staging

* refactor: unify environment variable to LETTA_ENV across all services

Simplifies environment configuration by using LETTA_ENV consistently across
all three services (web, cloud-api, and core) instead of having core use
LETTA_ENVIRONMENT.

Problem:
- Core used LETTA_ENVIRONMENT (due to Pydantic env_prefix)
- Web and cloud-api used LETTA_ENV
- Confusing to have two different variable names
- Justfile had to map LETTA_ENV → LETTA_ENVIRONMENT

Solution:
- Added validation_alias to core settings.py
- environment field now reads from LETTA_ENV directly
- Falls back to letta_environment for backwards compatibility
- Updated justfile to set LETTA_ENV for core (not LETTA_ENVIRONMENT)
- Updated documentation to clarify consistent naming

Changes:
1. apps/core/letta/settings.py
   - Added validation_alias=AliasChoices("LETTA_ENV", "letta_environment")
   - Prioritizes LETTA_ENV, falls back to letta_environment
   - Updated description to include all environment values

2. justfile
   - Changed --set secrets.LETTA_ENVIRONMENT to --set secrets.LETTA_ENV
   - Now consistent with web and cloud-api deployments

3. OTEL_TRACING.md
   - Added note that all services use LETTA_ENV consistently
   - Fixed trailing whitespace

Before:
- Web: LETTA_ENV
- Cloud-API: LETTA_ENV
- Core: LETTA_ENVIRONMENT ❌

After:
- Web: LETTA_ENV
- Cloud-API: LETTA_ENV
- Core: LETTA_ENV ✅

All services now use the same environment variable name!

* refactor: standardize on LETTA_ENVIRONMENT across all services

Unifies environment variable naming to use LETTA_ENVIRONMENT consistently
across all three services (web, cloud-api, and core).

Problem:
- Previous commit tried to use LETTA_ENV everywhere
- Core already uses Pydantic with env_prefix="letta_"
- Better to standardize on LETTA_ENVIRONMENT to match core conventions

Solution:
- All services now read from LETTA_ENVIRONMENT
- Web: process.env.LETTA_ENVIRONMENT
- Cloud-API: process.env.LETTA_ENVIRONMENT
- Core: settings.environment (reads LETTA_ENVIRONMENT via Pydantic prefix)

Changes:
1. apps/web/src/lib/tracing.ts
   - Changed LETTA_ENV → LETTA_ENVIRONMENT

2. apps/cloud-api/src/instrument-otel.ts
   - Changed LETTA_ENV → LETTA_ENVIRONMENT

3. apps/core/letta/settings.py
   - Removed validation_alias (not needed)
   - Uses standard Pydantic env_prefix behavior

4. All workflow files updated:
   - deploy-web.yml: LETTA_ENVIRONMENT=production
   - deploy-cloud-api.yml: LETTA_ENVIRONMENT=production
   - deploy-core.yml: LETTA_ENVIRONMENT=production
   - nightly-staging-deploy-test.yaml: LETTA_ENVIRONMENT=staging
   - stage-web.yaml: LETTA_ENVIRONMENT=staging
   - stage-cloud-api.yaml: LETTA_ENVIRONMENT=staging (added OTEL config)
   - stage-core.yaml: LETTA_ENVIRONMENT=staging (added OTEL config)

5. justfile
   - Updated all LETTA_ENV → LETTA_ENVIRONMENT
   - Web: --set env.LETTA_ENVIRONMENT
   - Cloud-API: --set env.LETTA_ENVIRONMENT
   - Core: --set secrets.LETTA_ENVIRONMENT

6. OTEL_TRACING.md
   - All references updated to LETTA_ENVIRONMENT

Final state:
✅ Web: LETTA_ENVIRONMENT
✅ Cloud-API: LETTA_ENVIRONMENT
✅ Core: LETTA_ENVIRONMENT (via letta_ prefix)

All services use the same variable name with proper Pydantic conventions!

* feat: implement split OTEL architecture (Option A)

Implements Option A: Web and cloud-api send traces directly to Datadog Agent,
while core keeps its existing OTEL sidecar (exports to ClickHouse + Datadog).

Architecture:
- letta-web → Datadog Agent (OTLP:4317) → Datadog APM
- cloud-api → Datadog Agent (OTLP:4317) → Datadog APM
- memgpt-server → OTEL Sidecar → ClickHouse + Datadog (unchanged)

Rationale:
- Core has existing production sidecar setup (exports to ClickHouse for analytics)
- Web/cloud-api don't need ClickHouse export, only APM
- Simpler: Direct to Datadog Agent is sufficient
- Minimal changes to core (already working)
- Traces still link end-to-end via W3C Trace Context propagation

Changes:

1. Helm Charts - Added OTEL config defaults:
   - helm/letta-web/values.yaml: Added OTEL env vars
   - helm/cloud-api/values.yaml: Added OTEL env vars
   - Default: OTEL_ENABLED="false", override in production
   - Endpoint: http://datadog-agent:4317

2. Production Workflows - Direct to Datadog Agent:
   - deploy-web.yml: Set OTEL_EXPORTER_OTLP_ENDPOINT to datadog-agent
   - deploy-cloud-api.yml: Set OTEL_EXPORTER_OTLP_ENDPOINT to datadog-agent
   - deploy-core.yml: Removed OTEL vars (keep existing setup)
   - OTEL_ENABLED="true", LETTA_ENVIRONMENT=production

3. Staging Workflows - Direct to Datadog Agent:
   - stage-web.yaml: Set OTEL_EXPORTER_OTLP_ENDPOINT to datadog-agent
   - stage-cloud-api.yaml: Set OTEL_EXPORTER_OTLP_ENDPOINT to datadog-agent
   - stage-core.yaml: Removed OTEL vars (keep existing setup)
   - nightly-staging-deploy-test.yaml: Removed OTEL vars
   - OTEL_ENABLED="true", LETTA_ENVIRONMENT=staging

4. Justfile:
   - Removed LETTA_ENVIRONMENT from core deployment (keep unchanged)
   - Web/cloud-api already correctly pass OTEL vars from workflows

5. Documentation:
   - Completely rewrote OTEL_TRACING.md
   - Added architecture diagrams explaining split setup
   - Added Datadog Agent prerequisites
   - Added troubleshooting for split architecture
   - Explained why we chose this approach

Prerequisites (must verify before deploying):
- Datadog Agent deployed with service name: datadog-agent
- OTLP receiver enabled on port 4317
- If different service name/namespace, update workflows

Next Steps:
- Verify datadog-agent service exists in cluster
- Verify OTLP receiver is enabled on Datadog agent
- Deploy and test trace propagation across services

* refactor: shorten environment names to prod and dev

Changes LETTA_ENVIRONMENT values from 'production' to 'prod' and
'development' to 'dev' for consistency and brevity.

Changes:
1. Workflows:
   - deploy-web.yml: production → prod
   - deploy-cloud-api.yml: production → prod

2. Helm charts:
   - letta-web/values.yaml: development → dev
   - cloud-api/values.yaml: development → dev

3. Justfile:
   - Default values: production → prod

4. Code:
   - apps/web/src/lib/tracing.ts: Fallback 'development' → 'dev'
   - apps/cloud-api/src/instrument-otel.ts: Fallback 'development' → 'dev'
   - apps/core/letta/settings.py: Updated description

5. Documentation:
   - OTEL_TRACING.md: Updated all examples and table

Environment values:
- prod (was production)
- staging (unchanged)
- canary (unchanged)
- dev (was development)

* refactor: align environment names with codebase patterns

Changes staging to 'dev' and local development to 'local-test' to match
existing codebase conventions (like test_temporal_metrics_local.py).

Rationale:
- 'dev' for staging matches consistent pattern across codebase
- 'local-test' for local development follows test naming convention
- Clearer distinction between deployed staging and local testing

Environment values:
- prod (production)
- dev (staging/dev cluster)
- canary (canary deployments)
- local-test (local development)

Changes:
1. Staging workflows:
   - stage-web.yaml: staging → dev
   - stage-cloud-api.yaml: staging → dev

2. Helm chart defaults (for local):
   - letta-web/values.yaml: dev → local-test
   - cloud-api/values.yaml: dev → local-test

3. Code fallbacks:
   - apps/web/src/lib/tracing.ts: 'dev' → 'local-test'
   - apps/cloud-api/src/instrument-otel.ts: 'dev' → 'local-test'
   - apps/core/letta/settings.py: Updated description

4. Documentation:
   - OTEL_TRACING.md: Updated table, examples, and all references
   - Clarified dev = staging cluster, local-test = local development

Datadog APM filters:
- env:prod (production)
- env:dev (staging cluster)
- env:canary (canary)
- env:local-test (local development)

* fix: update environment checks for lowercase values and add missing configs

Fixes 4 bugs identified by Cursor bugbot:

1. Case-sensitive environment checks (5 locations)
   - Updated all checks from "PRODUCTION" to case-insensitive "prod"
   - Fixed in: resource.py, multi_agent.py, tool_manager.py,
     multi_agent_tool_executor.py, agent_manager_helper.py
   - Now properly filters local-only tools in production
   - Prevents exposing debug tools in production

2. Device ID leak in production
   - Fixed resource.py to use case-insensitive check
   - Now correctly excludes device.id (MAC address) in production
   - Only adds device.id when env is not "prod"

3. Missing @opentelemetry/sdk-trace-base in Next.js externals
   - Added to serverExternalPackages in next.config.js
   - Prevents webpack bundling issues with native dependencies
   - Package is directly imported for BatchSpanProcessor

4. Missing NEXT_PUBLIC_GIT_HASH in stage-web workflow
   - Added NEXT_PUBLIC_GIT_HASH: ${{ github.sha }}
   - Now matches stage-cloud-api.yaml pattern
   - Staging traces will show correct version instead of 'unknown'
   - Enables correlation of traces with specific deployments

Changes:
- apps/core/letta/otel/resource.py: Case-insensitive check, add device.id only if not prod
- apps/core/letta/functions/function_sets/multi_agent.py: Case-insensitive prod check
- apps/core/letta/services/tool_manager.py: Case-insensitive prod check
- apps/core/letta/services/tool_executor/multi_agent_tool_executor.py: Case-insensitive prod check
- apps/core/letta/services/helpers/agent_manager_helper.py: Case-insensitive prod check
- apps/web/next.config.js: Added @opentelemetry/sdk-trace-base to externals
- .github/workflows/stage-web.yaml: Added NEXT_PUBLIC_GIT_HASH

All checks now use: settings.environment.lower() == "prod"
This matches our new convention: prod/dev/canary/local-test

Also includes: distributed-tracing skill (created in /skill session)

* refactor: keep core PRODUCTION but normalize OTEL tags to prod

Changes approach to maintain backward compatibility with core business logic
while standardizing OTEL environment tags.

Previous approach:
- Changed all "PRODUCTION" checks to lowercase "prod"
- Would break existing core business logic expectations

New approach:
- Core continues using "PRODUCTION" (uppercase) for business logic
- OTEL resource.py normalizes environment to lowercase abbreviated tags
- Web/cloud-api use "prod" directly (they don't have business logic checks)

Changes:

1. Reverted business logic checks to use "PRODUCTION" (uppercase):
   - multi_agent.py: Check for "PRODUCTION" to block tools
   - tool_manager.py: Check for "PRODUCTION" to filter local-only tools
   - multi_agent_tool_executor.py: Check for "PRODUCTION" to block tools
   - agent_manager_helper.py: Check for "PRODUCTION" to filter tools

2. Added environment normalization for OTEL tags:
   - resource.py: New _normalize_environment_tag() function
   - Maps PRODUCTION → prod, DEV/STAGING → dev
   - Other values (CANARY, etc.) converted to lowercase
   - Device ID check reverted to != "PRODUCTION"

3. Updated core deployments to set PRODUCTION:
   - deploy-core.yml: LETTA_ENVIRONMENT=PRODUCTION
   - stage-core.yaml: LETTA_ENVIRONMENT=DEV
   - justfile: Added LETTA_ENVIRONMENT with default PRODUCTION

4. Updated settings description:
   - Clarifies values are uppercase (PRODUCTION, DEV)
   - Notes normalization to lowercase for OTEL tags

Result:
- Core business logic: Uses "PRODUCTION" (unchanged, backward compatible)
- OTEL Datadog tags: Shows "prod" (normalized, consistent with web/cloud-api)
- Web/cloud-api: Continue using "prod" directly (no change needed)
- Device ID properly excluded in PRODUCTION environments

* fix: correct Python FastAPI instrumentation and environment normalization

Fixes 3 bugs identified by Cursor bugbot in distributed-tracing skill:

1. Python import typo (line 50)
   - Was: from opentelemetry.instrumentation.fastapi import FastAPIInstrumentatio
   - Now: from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
   - Missing final 'n' in Instrumentatio
   - Correct class name is FastAPIInstrumentor (with 'or' suffix)

2. Wrong class name usage (line 151)
   - Was: FastAPIInstrumentation.instrument_app()
   - Now: FastAPIInstrumentor.instrument_app()
   - Fixed to match correct OpenTelemetry API

3. Environment tag inconsistency
   - Problem: Python template used .lower() which converts PRODUCTION -> production
   - But resource.py normalizes PRODUCTION -> prod
   - Would create inconsistent tags: 'production' vs 'prod' in Datadog

   Solution:
   - Added _normalize_environment_tag() function to Python template
   - Matches resource.py normalization logic
   - PRODUCTION -> prod, DEV/STAGING -> dev, others lowercase
   - Updated comments in workflows to clarify normalization happens in code

Changes:
- .skills/distributed-tracing/templates/python-fastapi-tracing.py:
  - Fixed import: FastAPIInstrumentor (not FastAPIInstrumentatio)
  - Fixed usage: FastAPIInstrumentor.instrument_app()
  - Added _normalize_environment_tag() function
  - Updated environment handling to use normalization
  - Updated docstring to clarify PRODUCTION/DEV -> prod/dev mapping

- .github/workflows/deploy-core.yml:
  - Clarified comment: _normalize_environment_tag() converts to "prod"

- .github/workflows/stage-core.yaml:
  - Clarified comment: _normalize_environment_tag() converts to "dev"

Result:
All services now consistently show 'prod' (not 'production') in Datadog APM,
enabling proper filtering and correlation across distributed traces.

* fix: add Datadog config to staging workflows and fix justfile backslash

Fixes 3 issues found in staging deployment logs:

1. Missing backslash in justfile (line 134)
   Problem: LETTA_ENVIRONMENT line missing backslash caused all subsequent
   helm --set flags to be ignored, including OTEL_EXPORTER_OTLP_ENDPOINT
   Result: letta-web and cloud-api logs showed "OTEL_EXPORTER_OTLP_ENDPOINT not set"

   Fixed:
   --set env.LETTA_ENVIRONMENT=${LETTA_ENVIRONMENT:-prod} \  # Added backslash

2. Missing Datadog vars in staging workflows
   Problem: stage-web.yaml, stage-cloud-api.yaml, stage-core.yaml didn't set
   DD_SITE, DD_API_KEY, DD_LOGS_INJECTION, etc.

   For web/cloud-api:
   - Added to top-level env section so justfile can use them

   For core:
   - Added to top-level env section
   - Added to Deploy step env section (so justfile can pass to helm)
   - core OTEL collector config reads these from environment

   Result: core logs showed "exporters::datadog: api.key is not set"

3. Wrong environment tag in staging (secondary issue)
   Problem: letta-web logs showed 'dd.env":"production"' in staging
   Cause: Missing backslash broke LETTA_ENVIRONMENT, defaulted to prod
   Fixed: Backslash fix ensures LETTA_ENVIRONMENT=dev is set

Changes:
- justfile: Fixed missing backslash on LETTA_ENVIRONMENT line
- .github/workflows/stage-web.yaml: Added DD_* vars to env
- .github/workflows/stage-cloud-api.yaml: Added DD_* vars to env
- .github/workflows/stage-core.yaml: Added DD_* vars to env and Deploy step

After this fix:
- Web/cloud-api will send traces to Datadog Agent via OTLP
- Core OTEL collector will export traces to both ClickHouse and Datadog
- All staging traces will show env:dev tag (not env:production)

* fix: move OTEL config from prod helm to dev helm values

Problem: OTEL configuration was added to production helm values files
(helm/letta-web/values.yaml and helm/cloud-api/values.yaml) but these
are for production deployments. Staging deployments use the dev helm
values (helm/dev/<service>/values.yaml).

Changes:
- Removed OTEL vars from helm/letta-web/values.yaml (prod)
- Removed OTEL vars from helm/cloud-api/values.yaml (prod)
- Added OTEL vars to helm/dev/letta-web/values.yaml (staging)
- Added OTEL vars to helm/dev/cloud-api/values.yaml (staging)

Dev helm values now include:
  OTEL_ENABLED: "true"
  OTEL_SERVICE_NAME: "letta-web" or "cloud-api"
  OTEL_EXPORTER_OTLP_ENDPOINT: "http://datadog-agent.default.svc.cluster.local:4317"
  LETTA_ENVIRONMENT: "dev"

Note: Production deployments override these via workflow env vars, so
prod helm values don't need OTEL config. Dev/staging deployments use
these helm values as defaults.

* remove generated doc

* secrets in dev

* totally unrelated changes to tf for runner sizing and scaling

* feat: add DD_ENV tags to staging helm for log correlation

Problem: Logs show 'dd.env":"production"' instead of 'dd.env":"dev"'
in staging because Datadog's logger injection uses DD_ENV, DD_SERVICE,
and DD_VERSION environment variables for tagging.

Changes:
- Added DD_ENV, DD_SERVICE, DD_VERSION to helm/dev/letta-web/values.yaml
- Added DD_ENV, DD_SERVICE, DD_VERSION to helm/dev/cloud-api/values.yaml

Values:
  DD_ENV: "dev"
  DD_SERVICE: "letta-web" or "cloud-api"
  DD_VERSION: "dev"

This ensures:
- Logs show correct env:dev tag in Datadog
- Traces and logs are properly correlated
- Consistent tagging across OTEL traces and DD logs

* feat: enable OTLP receiver in Datadog Agent configurations

Added OpenTelemetry Protocol (OTLP) receiver to Datadog Agent for both
dev and prod environments to support distributed tracing from services
using OpenTelemetry SDKs.

Changes:
- helm/dev/datadog/datadog-agent.yaml: Added otlp.receiver configuration
- helm/datadog/datadog-agent.yaml: Added otlp.receiver configuration

OTLP Configuration:
  otlp:
    receiver:
      protocols:
        grpc:
          enabled: true
          endpoint: "0.0.0.0:4317"
        http:
          enabled: true
          endpoint: "0.0.0.0:4318"

This enables:
- Web/cloud-api services to send traces via OTLP (port 4317)
- Core OTEL collector to export to Datadog via OTLP (port 4317)
- Alternative HTTP endpoint for OTLP (port 4318)

When applied, the Datadog Agent service will expose:
- Port 4317/TCP - OTLP gRPC (for traces)
- Port 4318/TCP - OTLP HTTP (for traces)
- Port 8126/TCP - Native Datadog APM (existing)
- Port 8125/UDP - DogStatsD (existing)

Apply with:
  kubectl apply -f helm/dev/datadog/datadog-agent.yaml     # staging
  kubectl apply -f helm/datadog/datadog-agent.yaml         # production

* feat: use git hash as DD_VERSION for all services

Changed from static version strings to using git commit hash as the
version tag in Datadog APM for better version tracking and correlation.

Changes:

1. Workflows - Set DD_VERSION to github.sha:
   - .github/workflows/stage-web.yaml: Added DD_VERSION: ${{ github.sha }}
   - .github/workflows/stage-cloud-api.yaml: Added DD_VERSION: ${{ github.sha }}
   - .github/workflows/stage-core.yaml: Added DD_VERSION: ${{ github.sha }}
     (both top-level env and Deploy step env)

2. Justfile - Pass DD_VERSION to helm:
   - deploy-web: Added --set env.DD_VERSION=${DD_VERSION:-unknown}
   - deploy-cloud-api: Added --set env.DD_VERSION=${DD_VERSION:-unknown}
   - deploy-core: Added --set secrets.DD_VERSION=${DD_VERSION:-unknown}

3. Helm dev values - Remove hardcoded version:
   - helm/dev/letta-web/values.yaml: Removed DD_VERSION: "dev"
   - helm/dev/cloud-api/values.yaml: Removed DD_VERSION: "dev"
   - Added comments that DD_VERSION is set via workflow

Result:
- Traces in Datadog will show version as git commit SHA (e.g., "abc123def")
- Can correlate traces with specific deployments/commits
- Consistent with internal versioning strategy (git hash, not semver)
- Defaults to "unknown" if DD_VERSION not set

Example trace tags after deployment:
  env:dev
  service:letta-web
  version:7eafc5b0c12345...

* feat: add DD_VERSION to production workflows

Added DD_VERSION to production deployment workflows for consistent version
tracking across staging and production environments.

Changes:
- .github/workflows/deploy-web.yml: Added DD_VERSION: ${{ github.sha }}
- .github/workflows/deploy-core.yml: Added DD_VERSION: ${{ github.sha }}

Note: deploy-cloud-api.yml doesn't have DD config yet, will add when
cloud-api gets OTEL enabled in production.

Context:
This was partially flagged by bugbot - it noted that NEXT_PUBLIC_GIT_HASH
was missing from prod, but that was incorrect (line 53 already has it).
However, DD_VERSION was indeed missing and needed for Datadog log
correlation.

Result:
- Production logs will show version tag matching git commit SHA
- Consistent with staging configuration
- Better trace/log correlation in Datadog APM

Staging already has DD_VERSION (added in commit fb1a3eea0)

* feat: add DD tags to memgpt-server dev helm for APM correlation

Problem: memgpt-server logs show up in Datadog but traces don't appear
properly in APM UI because DD_ENV, DD_SERVICE, DD_SITE tags were missing.

The service was using native Datadog agent instrumentation (via
LETTA_TELEMETRY_ENABLE_DATADOG) but without proper unified service tagging,
traces weren't being correlated correctly in the APM interface.

Changes:
- helm/dev/memgpt-server/values.yaml:
  - Added DD_ENV: "dev"
  - Added DD_SERVICE: "memgpt-server"
  - Added DD_SITE: "us5.datadoghq.com"
  - Added comment that DD_VERSION comes from workflow

Existing configuration:
- DD_VERSION already passed via stage-core.yaml (line 215) and justfile (line 272)
- DD_API_KEY already in secretsProvider (line 194)
- LETTA_TELEMETRY_ENABLE_DATADOG: "true" (enables native DD agent)
- LETTA_TELEMETRY_DATADOG_AGENT_HOST/PORT (routes to DD cluster agent)

Result:
After redeployment, memgpt-server traces will show in Datadog APM with:
- env:dev
- service:memgpt-server
- version:<git-hash>
- Proper correlation with logs

* refactor: use image tag for DD_VERSION instead of separate env var

Changed from passing DD_VERSION separately to deriving it from the
image.tag that's already set (which contains the git hash).

This is cleaner because:
- Image tag is already set to git hash via TAG env var
- Removes redundant DD_VERSION from workflows (6 locations)
- Single source of truth for version (the deployed image tag)
- Simpler configuration

Changes:

Workflows (removed DD_VERSION):
- .github/workflows/stage-web.yaml
- .github/workflows/stage-cloud-api.yaml
- .github/workflows/stage-core.yaml (2 locations)
- .github/workflows/deploy-web.yml
- .github/workflows/deploy-core.yml

Justfile (use {{TAG}} instead of ${DD_VERSION}):
- deploy-web: --set env.DD_VERSION={{TAG}}
- deploy-cloud-api: --set env.DD_VERSION={{TAG}}
- deploy-core: --set secrets.DD_VERSION={{TAG}}

Helm values (updated comments):
- helm/dev/letta-web/values.yaml
- helm/dev/cloud-api/values.yaml
- helm/dev/memgpt-server/values.yaml
- Changed from "set via workflow" to "set from image.tag by justfile"

Flow:
1. Workflow sets TAG=${{ github.sha }}
2. Workflow calls justfile with TAG env var
3. Justfile sets image.tag={{TAG}} and DD_VERSION={{TAG}}
4. Both use same git hash value

Example:
  image.tag: abc123def
  DD_VERSION: abc123def
  Both from TAG env var set to github.sha

* feat: add Datadog native tracer (dd-trace) to cloud-api for APM

Problem: cloud-api traces weren't appearing in Datadog APM despite OTEL
being configured. Investigation revealed letta-web uses dd-trace (Datadog's
native tracer) in addition to OTEL, and those traces show up perfectly.

Analysis:
- letta-web: Uses BOTH OTEL + dd-trace → traces visible in APM ✓
- cloud-api: Uses ONLY OTEL → traces NOT visible in APM ✗

Root cause: While OTEL *should* work, dd-trace provides better integration
with Datadog's APM backend and is proven to work in production.

Solution: Add dd-trace initialization to cloud-api, matching letta-web's
dual-tracing approach (OTEL + dd-trace).

Changes:
- apps/cloud-api/src/instrument-otel.ts:
  - Added dd-trace initialization after OTEL setup
  - Checks for DD_API_KEY env var (already configured in helm)
  - Enables logInjection, runtimeMetrics, and profiling
  - Graceful fallback if dd-trace fails to initialize

Dependencies:
- dd-trace@^5.31.0 already available in root package.json

Configuration (already set in helm):
- DD_API_KEY: From secretsProvider ✓
- DD_ENV: "dev" ✓
- DD_SERVICE: "cloud-api" ✓
- DD_LOGS_INJECTION: From workflow ✓

Expected result:
After deployment, cloud-api traces will appear in Datadog APM alongside
letta-web and letta-server, with proper env:dev service:cloud-api tags.

* tweak vars in staging

* fix: initialize Datadog tracer for memgpt-server APM traces

Problem: memgpt-server (letta-server) shows up in Datadog APM with env:null
instead of env:dev, and traces weren't being properly captured.

Root cause: The code was only initializing the Datadog Profiler (for CPU/memory
profiling), but NOT the Tracer (for distributed tracing/APM).

Analysis:
- Profiler: Records performance metrics (CPU, memory) - WAS initialized ✓
- Tracer: Records distributed traces/spans for APM - NOT initialized ✗

The existing code (line 248-256) did:
  from ddtrace.profiling import Profiler  # Only profiler!
  profiler = Profiler(...)
  profiler.start()
  # No tracer initialization!

This explains why:
- letta-server appears in Datadog with env:null (profiling data sent without proper tags)
- Traces don't show proper service/env correlation
- APM service map is incomplete

Solution: Initialize the Datadog tracer with ddtrace.patch_all() to:
1. Auto-instrument FastAPI, HTTP clients, database calls, etc.
2. Send proper distributed traces to Datadog APM
3. Use the DD_ENV, DD_SERVICE env vars already set in helm

Changes:
- apps/core/letta/server/rest_api/app.py:
  - Added import ddtrace
  - Added ddtrace.patch_all() to auto-instrument all libraries
  - Added logging for tracer initialization

Configuration (already set in helm):
- DD_ENV: "dev" ✓
- DD_SERVICE: "memgpt-server" ✓
- DD_SITE: "us5.datadoghq.com" ✓
- DD_VERSION: From image.tag ✓
- DD_AGENT_HOST/PORT: Set by code from settings ✓

Expected result:
After redeployment, letta-server will:
- Show as env:dev (not env:null) in Datadog APM
- Send proper distributed traces with full context
- Appear correctly in service maps and trace explorer

* fix: add dd-trace dependency to cloud-api package.json

Problem: cloud-api Docker image doesn't include dd-trace, causing
"Cannot find module 'dd-trace'" error at runtime.

Root cause: dd-trace is in root package.json but not in cloud-api's
package.json, so it's not included in the Docker build.

Solution: Add dd-trace@^5.31.0 to cloud-api dependencies.

Changes:
- apps/cloud-api/package.json: Added dd-trace dependency

* fix: mark dd-trace as external in cloud-api esbuild config

Problem: esbuild fails when trying to bundle dd-trace because it attempts
to bundle optional GraphQL plugin dependencies that aren't installed.

Error:
  Could not resolve "graphql/language/visitor"
  Could not resolve "graphql/language/printer"
  Could not resolve "graphql/utilities"

Root cause: dd-trace has optional plugins for various frameworks (GraphQL,
MongoDB, etc.) that it loads conditionally at runtime. esbuild tries to
statically analyze and bundle all requires, including these optional deps.

Solution: Add dd-trace to the externals list so it's loaded at runtime
instead of being bundled. This is the standard approach for native modules
and packages with optional dependencies.

Changes:
- apps/cloud-api/esbuild.config.js: Added 'dd-trace' to externals array

Result:
- Build succeeds ✓
- dd-trace loads at runtime with only the plugins it needs ✓
- No GraphQL dependency required ✓

* add dd-trace

* fix: increase cloud-api memory and make dd-trace profiling configurable

Problem: cloud-api pods crash looping with out of memory errors when
dd-trace profiling is enabled:
  FATAL ERROR: JavaScript heap out of memory
  current_heap_limit=268435456 (268MB in 512Mi total)

Root cause: dd-trace profiling is memory-intensive (50-100MB+ overhead)
and the original 512Mi limit was too tight.

Solution: Two-part fix:
1. Increase memory limits: 512Mi → 1Gi (gives profiling room to breathe)
2. Make profiling configurable via DD_PROFILING_ENABLED env var

Changes:

helm/dev/cloud-api/values.yaml:
- resources.limits.memory: 512Mi → 1Gi
- resources.requests.memory: 512Mi → 1Gi
- Added DD_PROFILING_ENABLED: "true"

apps/cloud-api/src/instrument-otel.ts:
- Read DD_PROFILING_ENABLED env var
- Pass to tracer.init({ profiling: profilingEnabled })
- Log profiling status on initialization

Benefits:
✓ Profiling enabled by default (CPU/heap flame graphs in Datadog)
✓ Can disable via env var if needed (set to "false")
✓ More headroom prevents OOM crashes (1Gi vs 512Mi)
✓ Configurable per environment

Memory breakdown with profiling:
- App baseline: ~300-400MB
- dd-trace profiling: ~50-100MB
- Buffer/headroom: ~500MB
- Total: 1Gi (comfortable margin)

2025-12-15 12:02:34 -08:00

.github

feat: parallel tool calling in model settings [LET-6239] (#6262 )

2025-11-24 19:10:26 -08:00

.skills/db-migrations-schema-changes

feat: add .skills/db-migrations-schema-changes (#6476 )

2025-12-15 12:02:33 -08:00

alembic

feat: add project id scoping for tools backend changes (#6529 )

2025-12-15 12:02:34 -08:00

assets

merge this (#4759 )

2025-09-17 15:47:40 -07:00

certs

merge this (#4759 )

2025-09-17 15:47:40 -07:00

merge this (#4759 )

2025-09-17 15:47:40 -07:00

examples/notebooks/data

chore: remove old examples (#6255 )

2025-11-24 19:09:33 -08:00

fern

feat: add project id scoping for tools backend changes (#6529 )

2025-12-15 12:02:34 -08:00

letta

feat: add OpenTelemetry distributed tracing to clouid-api and web (#6549 )

2025-12-15 12:02:34 -08:00

otel

feat: Ship traces to datadog and add trace correlation (#6311 )

2025-11-24 19:10:26 -08:00

sandbox

merge this (#4759 )

2025-09-17 15:47:40 -07:00

scripts

merge this (#4759 )

2025-09-17 15:47:40 -07:00

tests

chore: Update deepseek client for v3.2 models (#6556 )

2025-12-15 12:02:34 -08:00

.dockerignore

merge this (#4759 )

2025-09-17 15:47:40 -07:00

.env.example

chore: officially migrate to submodule (#4502 )

2025-09-09 12:45:53 -07:00

.gitattributes

merge this (#4759 )

2025-09-17 15:47:40 -07:00

.gitignore

merge this (#4759 )

2025-09-17 15:47:40 -07:00

.pre-commit-config.yaml

merge this (#4759 )

2025-09-17 15:47:40 -07:00

alembic.ini

merge this (#4759 )

2025-09-17 15:47:40 -07:00

CITATION.cff

merge this (#4759 )

2025-09-17 15:47:40 -07:00

compose.yaml

merge this (#4759 )

2025-09-17 15:47:40 -07:00

CONTRIBUTING.md

merge this (#4759 )

2025-09-17 15:47:40 -07:00

dev-compose.yaml

fix: refactor into common uri parsing logic, fix test, and fix compose file (#5261 )

2025-10-24 15:10:35 -07:00

development.compose.yml

merge this (#4759 )

2025-09-17 15:47:40 -07:00

docker-compose-vllm.yaml

merge this (#4759 )

2025-09-17 15:47:40 -07:00

Dockerfile

fix: Implement architecture-specific OTEL installation logic (#3061 )

2025-11-28 16:17:01 -08:00

init.sql

merge this (#4759 )

2025-09-17 15:47:40 -07:00

LICENSE

merge this (#4759 )

2025-09-17 15:47:40 -07:00

nginx.conf

merge this (#4759 )

2025-09-17 15:47:40 -07:00

package-lock.json

merge this (#4759 )

2025-09-17 15:47:40 -07:00

PRIVACY.md

merge this (#4759 )

2025-09-17 15:47:40 -07:00

project.json

merge this (#4759 )

2025-09-17 15:47:40 -07:00

pyproject.toml

feat: support programmatic tool execution (cloud only) (#6441 )

2025-12-15 12:02:19 -08:00

README.md

Updated readme with actual argument (#3083 )

2025-11-27 00:56:48 -08:00

TERMS.md

merge this (#4759 )

2025-09-17 15:47:40 -07:00

test_watchdog_hang.py

Add lightweight event loop watchdog monitoring (#6209 )

2025-11-24 19:09:33 -08:00

uv.lock

feat: support programmatic tool execution (cloud only) (#6441 )

2025-12-15 12:02:19 -08:00

WEBHOOK_SETUP.md

feat: support webhooks for step completions (#5904 )

2025-11-13 15:36:50 -08:00

Letta (formerly MemGPT)

Letta is the platform for building stateful agents: open AI with advanced memory that can learn and self-improve over time.

Quicklinks:

Developer Documentation: Learn how create agents that learn using Python / TypeScript
Agent Development Environment (ADE): A no-code UI for building stateful agents
Letta Desktop: A fully-local version of the ADE, available on MacOS and Windows
Letta Cloud: The fastest way to try Letta, with agents running in the cloud

Get started

One-Shot ✨ Vibecoding ⚡️ Prompts

Or install the Letta SDK (available for both Python and TypeScript):

Python SDK

pip install letta-client

TypeScript / Node.js SDK

npm install @letta-ai/letta-client

Simple Hello World example

In the example below, we'll create a stateful agent with two memory blocks, one for itself (the persona block), and one for the human. We'll initialize the human memory block with incorrect information, and correct agent in our first message - which will trigger the agent to update its own memory with a tool call.

To run the examples, you'll need to get a LETTA_API_KEY from Letta Cloud, or run your own self-hosted server (see our guide)

Python

from letta_client import Letta
import os

# Connect to Letta Cloud (get your API key at https://app.letta.com/api-keys)
client = Letta(api_key=os.getenv("LETTA_API_KEY"))
# client = Letta(base_url="http://localhost:8283", embedding="openai/text-embedding-3-small")  # if self-hosting, set base_url and embedding

agent_state = client.agents.create(
    model="openai/gpt-4.1",
    memory_blocks=[
        {
          "label": "human",
          "value": "The human's name is Chad. They like vibe coding."
        },
        {
          "label": "persona",
          "value": "My name is Sam, a helpful assistant."
        }
    ],
    tools=["web_search", "run_code"]
)

print(agent_state.id)
# agent-d9be...0846

response = client.agents.messages.create(
    agent_id=agent_state.id,
    messages=[
        {
            "role": "user",
            "content": "Hey, nice to meet you, my name is Brad."
        }
    ]
)

# the agent will think, then edit its memory using a tool
for message in response.messages:
    print(message)

TypeScript / Node.js

import { LettaClient } from '@letta-ai/letta-client'

// Connect to Letta Cloud (get your API key at https://app.letta.com/api-keys)
const client = new LettaClient({ token: process.env.LETTA_API_KEY });
// const client = new LettaClient({ baseUrl: "http://localhost:8283", embedding: "openai/text-embedding-3-small" });  // if self-hosting

const agentState = await client.agents.create({
    model: "openai/gpt-4.1",
    memoryBlocks: [
        {
          label: "human",
          value: "The human's name is Chad. They like vibe coding."
        },
        {
          label: "persona",
          value: "My name is Sam, a helpful assistant."
        }
    ],
    tools: ["web_search", "run_code"]
});

console.log(agentState.id);
// agent-d9be...0846

const response = await client.agents.messages.create(
    agentState.id, {
        messages: [
            {
                role: "user",
                content: "Hey, nice to meet you, my name is Brad."
            }
        ]
    }
);

// the agent will think, then edit its memory using a tool
for (const message of response.messages) {
    console.log(message);
}

Core concepts in Letta:

Letta is made by the creators of MemGPT, a research paper that introduced the concept of the "LLM Operating System" for memory management. The core concepts in Letta for designing stateful agents follow the MemGPT LLM OS principles:

Memory Hierarchy: Agents have self-editing memory that is split between in-context memory and out-of-context memory
Memory Blocks: The agent's in-context memory is composed of persistent editable memory blocks
Agentic Context Engineering: Agents control the context window by using tools to edit, delete, or search for memory
Perpetual Self-Improving Agents: Every "agent" is a single entity that has a perpetual (infinite) message history

Multi-agent shared memory (full guide)

A single memory block can be attached to multiple agents, allowing to extremely powerful multi-agent shared memory setups. For example, you can create two agents that have their own independent memory blocks in addition to a shared memory block.

Python

# create a shared memory block
shared_block = client.blocks.create(
    label="organization",
    description="Shared information between all agents within the organization.",
    value="Nothing here yet, we should update this over time."
)

# create a supervisor agent
supervisor_agent = client.agents.create(
    model="anthropic/claude-3-5-sonnet-20241022",
    # blocks created for this agent
    memory_blocks=[{"label": "persona", "value": "I am a supervisor"}],
    # pre-existing shared block that is "attached" to this agent
    block_ids=[shared_block.id],
)

# create a worker agent
worker_agent = client.agents.create(
    model="openai/gpt-4.1-mini",
    # blocks created for this agent
    memory_blocks=[{"label": "persona", "value": "I am a worker"}],
    # pre-existing shared block that is "attached" to this agent
    block_ids=[shared_block.id],
)

TypeScript / Node.js

// create a shared memory block
const sharedBlock = await client.blocks.create({
    label: "organization",
    description: "Shared information between all agents within the organization.",
    value: "Nothing here yet, we should update this over time."
});

// create a supervisor agent
const supervisorAgent = await client.agents.create({
    model: "anthropic/claude-3-5-sonnet-20241022",
    // blocks created for this agent
    memoryBlocks: [{ label: "persona", value: "I am a supervisor" }],
    // pre-existing shared block that is "attached" to this agent
    blockIds: [sharedBlock.id]
});

// create a worker agent
const workerAgent = await client.agents.create({
    model: "openai/gpt-4.1-mini",
    // blocks created for this agent
    memoryBlocks: [{ label: "persona", value: "I am a worker" }],
    // pre-existing shared block that is "attached" to this agent
    blockIds: [sharedBlock.id]
});

Sleep-time agents (full guide)

In Letta, you can create special sleep-time agents that share the memory of your primary agents, but run in the background (like an agent's "subconcious"). You can think of sleep-time agents as a special form of multi-agent architecture.

To enable sleep-time agents for your agent, set the enable_sleeptime flag to true when creating your agent. This will automatically create a sleep-time agent in addition to your main agent which will handle the memory editing, instead of your primary agent.

Python

agent_state = client.agents.create(
    ...
    enable_sleeptime=True,  # <- enable this flag to create a sleep-time agent
)

TypeScript / Node.js

const agentState = await client.agents.create({
    ...
    enableSleeptime: true  // <- enable this flag to create a sleep-time agent
});

Saving and sharing agents with Agent File (`.af`) (full guide)

In Letta, all agent data is persisted to disk (Postgres or SQLite), and can be easily imported and exported using the open source Agent File (.af) file format. You can use Agent File to checkpoint your agents, as well as move your agents (and their complete state/memories) between different Letta servers, e.g. between self-hosted Letta and Letta Cloud.

View code snippets

Python

# Import your .af file from any location
agent_state = client.agents.import_agent_serialized(file=open("/path/to/agent/file.af", "rb"))

print(f"Imported agent: {agent.id}")

# Export your agent into a serialized schema object (which you can write to a file)
schema = client.agents.export_agent_serialized(agent_id="<AGENT_ID>")

TypeScript / Node.js

import { readFileSync } from 'fs';
import { Blob } from 'buffer';

// Import your .af file from any location
const file = new Blob([readFileSync('/path/to/agent/file.af')])
const agentState = await client.agents.importAgentSerialized(file, {})

console.log(`Imported agent: ${agentState.id}`);

// Export your agent into a serialized schema object (which you can write to a file)
const schema = await client.agents.exportAgentSerialized("<AGENT_ID>");

Model Context Protocol (MCP) and custom tools (full guide)

Letta has rich support for MCP tools (Letta acts as an MCP client), as well as custom Python tools. MCP servers can be easily added within the Agent Development Environment (ADE) tool manager UI, as well as via the SDK:

View code snippets

Python

# List tools from an MCP server
tools = client.tools.list_mcp_tools_by_server(mcp_server_name="weather-server")

# Add a specific tool from the MCP server
tool = client.tools.add_mcp_tool(
    mcp_server_name="weather-server",
    mcp_tool_name="get_weather"
)

# Create agent with MCP tool attached
agent_state = client.agents.create(
    model="openai/gpt-4o-mini",
    tool_ids=[tool.id]
)

# Or attach tools to an existing agent
client.agents.tool.attach(
    agent_id=agent_state.id
    tool_id=tool.id
)

# Use the agent with MCP tools
response = client.agents.messages.create(
    agent_id=agent_state.id,
    messages=[
        {
            "role": "user",
            "content": "Use the weather tool to check the forecast"
        }
    ]
)

TypeScript / Node.js

// List tools from an MCP server
const tools = await client.tools.listMcpToolsByServer("weather-server");

// Add a specific tool from the MCP server
const tool = await client.tools.addMcpTool("weather-server", "get_weather");

// Create agent with MCP tool
const agentState = await client.agents.create({
    model: "openai/gpt-4o-mini",
    toolIds: [tool.id]
});

// Use the agent with MCP tools
const response = await client.agents.messages.create(agentState.id, {
    messages: [
        {
            role: "user",
            content: "Use the weather tool to check the forecast"
        }
    ]
});

Filesystem (full guide)

Letta’s filesystem allow you to easily connect your agents to external files, for example: research papers, reports, medical records, or any other data in common text formats (.pdf, .txt, .md, .json, etc). Once you attach a folder to an agent, the agent will be able to use filesystem tools (open_file, grep_file, search_file) to browse the files to search for information.

View code snippets

Python

# create the folder (embeddings managed automatically by Letta Cloud)
folder = client.folders.create(
    name="my_folder"
)

# upload a file into the folder
job = client.folders.files.upload(
    folder_id=folder.id,
    file=open("my_file.txt", "rb")
)

# wait until the job is completed
while True:
    job = client.jobs.retrieve(job.id)
    if job.status == "completed":
        break
    elif job.status == "failed":
        raise ValueError(f"Job failed: {job.metadata}")
    print(f"Job status: {job.status}")
    time.sleep(1)

# once you attach a folder to an agent, the agent can see all files in it
client.agents.folders.attach(agent_id=agent.id, folder_id=folder.id)

response = client.agents.messages.create(
    agent_id=agent_state.id,
    messages=[
        {
            "role": "user",
            "content": "What data is inside of my_file.txt?"
        }
    ]
)

for message in response.messages:
    print(message)

TypeScript / Node.js

// create the folder (embeddings managed automatically by Letta Cloud)
const folder = await client.folders.create({
    name: "my_folder"
});

// upload a file into the folder
const uploadJob = await client.folders.files.upload(
    createReadStream("my_file.txt"),
    folder.id,
);
console.log("file uploaded")

// wait until the job is completed
while (true) {
    const job = await client.jobs.retrieve(uploadJob.id);
    if (job.status === "completed") {
        break;
    } else if (job.status === "failed") {
        throw new Error(`Job failed: ${job.metadata}`);
    }
    console.log(`Job status: ${job.status}`);
    await new Promise((resolve) => setTimeout(resolve, 1000));
}

// list files in the folder
const files = await client.folders.files.list(folder.id);
console.log(`Files in folder: ${files}`);

// list passages in the folder
const passages = await client.folders.passages.list(folder.id);
console.log(`Passages in folder: ${passages}`);

// once you attach a folder to an agent, the agent can see all files in it
await client.agents.folders.attach(agent.id, folder.id);

const response = await client.agents.messages.create(
    agentState.id, {
        messages: [
            {
                role: "user",
                content: "What data is inside of my_file.txt?"
            }
        ]
    }
);

for (const message of response.messages) {
    console.log(message);
}

Long-running agents (full guide)

When agents need to execute multiple tool calls or perform complex operations (like deep research, data analysis, or multi-step workflows), processing time can vary significantly. Letta supports both a background mode (with resumable streaming) as well as an async mode (with polling) to enable robust long-running agent executions.

View code snippets

Python

stream = client.agents.messages.create_stream(
    agent_id=agent_state.id,
    messages=[
      {
        "role": "user",
        "content": "Run comprehensive analysis on this dataset"
      }
    ],
    stream_tokens=True,
    background=True,
)
run_id = None
last_seq_id = None
for chunk in stream:
    if hasattr(chunk, "run_id") and hasattr(chunk, "seq_id"):
        run_id = chunk.run_id       # Save this to reconnect if your connection drops
        last_seq_id = chunk.seq_id  # Save this as your resumption point for cursor-based pagination
    print(chunk)

# If disconnected, resume from last received seq_id:
for chunk in client.runs.stream(run_id, starting_after=last_seq_id):
    print(chunk)

TypeScript / Node.js

const stream = await client.agents.messages.createStream({
    agentId: agentState.id,
    requestBody: {
        messages: [
            {
                role: "user",
                content: "Run comprehensive analysis on this dataset"
            }
        ],
        streamTokens: true,
        background: true,
    }
});

let runId = null;
let lastSeqId = null;
for await (const chunk of stream) {
    if (chunk.run_id && chunk.seq_id) {
        runId = chunk.run_id;      // Save this to reconnect if your connection drops
        lastSeqId = chunk.seq_id; // Save this as your resumption point for cursor-based pagination
    }
    console.log(chunk);
}

// If disconnected, resume from last received seq_id
for await (const chunk of client.runs.stream(runId, {startingAfter: lastSeqId})) {
    console.log(chunk);
}

Using local models

Letta is model agnostic and supports using local model providers such as Ollama and LM Studio. You can also easily swap models inside an agent after the agent has been created, by modifying the agent state with the new model provider via the SDK or in the ADE.

Development (only needed if you need to modify the server code)

Note: this repostory contains the source code for the core Letta service (API server), not the client SDKs. The client SDKs can be found here: Python, TypeScript.

To install the Letta server from source, fork the repo, clone your fork, then use uv to install from inside the main directory:

cd letta
uv sync --all-extras

To run the Letta server from source, use uv run:

uv run letta server

Contributing

Letta is an open source project built by over a hundred contributors. There are many ways to get involved in the Letta OSS project!

Join the Discord: Chat with the Letta devs and other AI developers.
Chat on our forum: If you're not into Discord, check out our developer forum.
Follow our socials: Twitter/X, LinkedIn, YouTube

Legal notices: By using Letta and related Letta services (such as the Letta endpoint or hosted service), you are agreeing to our privacy policy and terms of service.

README.md Unescape Escape

Letta (formerly MemGPT)

Quicklinks:

Get started

One-Shot ✨ Vibecoding ⚡️ Prompts

Python SDK

TypeScript / Node.js SDK

Simple Hello World example

Python

TypeScript / Node.js

Core concepts in Letta:

Multi-agent shared memory (full guide)

Python

TypeScript / Node.js

Sleep-time agents (full guide)

Python

TypeScript / Node.js

Saving and sharing agents with Agent File (.af) (full guide)

Python

TypeScript / Node.js

Model Context Protocol (MCP) and custom tools (full guide)

Python

TypeScript / Node.js

Filesystem (full guide)

Python

TypeScript / Node.js

Long-running agents (full guide)

Python

TypeScript / Node.js

Using local models

Development (only needed if you need to modify the server code)

Contributing

README.md

Saving and sharing agents with Agent File (`.af`) (full guide)