Commit Graph

147 Commits

Author SHA1 Message Date
Kevin Lin
a11ba9710c feat(core): increase Gemini timeout to 10 minutes (#9714) 2026-03-03 18:34:02 -08:00
github-actions[bot]
bf80de214d feat: change default context window from 32000 to 128000 (#9673)
* feat: change default context window from 32000 to 128000

Update DEFAULT_CONTEXT_WINDOW and global_max_context_window_limit from
32000 to 128000. Also update all .af (agent files), cypress test
fixtures, and integration tests to use the new default.

Closes #9672

Co-authored-by: Sarah Wooders <sarahwooders@users.noreply.github.com>

🐾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* fix(core): update conversation manager tests for auto-created system message

create_conversation now auto-creates a system message at position 0
(from #9508), but the test assertions weren't updated. Adjust expected
message counts and ordering to account for the initial system message.

🐾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* fix(core): fix mock Anthropic models.list() to return async iterable, not coroutine

The real Anthropic SDK's models.list() returns an AsyncPage (with __aiter__)
directly, but the mock used `async def list()` which returns a coroutine.
The code does `async for model in client.models.list()` which needs an
async iterable, not a coroutine. Fix by making list() a regular method.

🐾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

---------

Co-authored-by: letta-code <248085862+letta-code@users.noreply.github.com>
Co-authored-by: Letta <noreply@letta.com>
Co-authored-by: Sarah Wooders <sarahwooders@gmail.com>
2026-03-03 18:34:01 -08:00
cthomas
3cdd64dc24 chore: update keepalive interval 50->20 (#9538)
* chore: update keepalive interval 50->20

* update comment
2026-02-24 10:55:11 -08:00
Devansh Jain
39ddda81cc feat: add Anthropic Sonnet 4.6 (#9408) 2026-02-24 10:55:11 -08:00
cthomas
126d8830b8 feat: set memfs env vars in deploy wf (#9318) 2026-02-24 10:52:06 -08:00
cthomas
0bdd555f33 feat: add memfs-py service (#9315)
* feat: add memfs-py service

* add tf for bucket access and secrets v2 access

* feat(memfs): add helm charts, deploy workflow, and bug fixes

- Add dev helm chart (helm/dev/memfs-py/) with CSI secrets pattern
- Update prod helm chart with CSI secrets and correct service account
- Add GitHub Actions deploy workflow
- Change port from 8284 to 8285 to avoid conflict with core's dulwich sidecar
- Fix chunked transfer encoding issue (strip HTTP_TRANSFER_ENCODING header)
- Fix timestamp parsing to handle both ISO and HTTP date formats
- Fix get_head_sha to raise FileNotFoundError on 404

🐾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

---------

Co-authored-by: Kian Jones <kian@letta.com>
Co-authored-by: Letta <noreply@letta.com>
2026-02-24 10:52:06 -08:00
Devansh Jain
644f7b9d5d chore: Add Opus 4.6 with 1M context window [OPUS-46] (#9301)
opus 4.6 1M version
2026-02-24 10:52:06 -08:00
Sarah Wooders
50a60c1393 feat: git smart HTTP for agent memory repos (#9257)
* feat(core): add git-backed memory repos and block manager

Introduce a GCS-backed git repository per agent as the source of truth for core
memory blocks. Add a GitEnabledBlockManager that writes block updates to git and
syncs values back into Postgres as a cache.

Default newly-created memory repos to the `main` branch.

👾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* feat(core): serve memory repos over git smart HTTP

Run dulwich's WSGI HTTPGitApplication on a local sidecar port and proxy
/v1/git/* through FastAPI to support git clone/fetch/push directly against
GCS-backed memory repos.

👾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* fix(core): create memory repos on demand and stabilize git HTTP

- Ensure MemoryRepoManager creates the git repo on first write (instead of 500ing)
  and avoids rewriting history by only auto-creating on FileNotFoundError.
- Simplify dulwich-thread async execution and auto-create empty repos on first
  git clone.

👾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* fix(core): make dulwich optional for CI installs

Guard dulwich imports in the git smart HTTP router so the core server can boot
(and CI tests can run) without installing the memory-repo extra.

👾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* fix(core): guard git HTTP WSGI init when dulwich missing

Avoid instantiating dulwich's HTTPGitApplication at import time when dulwich
isn't installed (common in CI installs).

👾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* fix(core): avoid masking send_message errors in finally

Initialize `result` before the agent loop so error paths (e.g. approval
validation) don't raise UnboundLocalError in the run-tracking finally block.

👾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* fix(core): stop event loop watchdog on FastAPI shutdown

Ensure the EventLoopWatchdog thread is stopped during FastAPI lifespan
shutdown to avoid daemon threads logging during interpreter teardown (seen in CI
unit tests).

👾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* chore(core): remove send_*_message_to_agent from SyncServer

Drop send_message_to_agent and send_group_message_to_agent from SyncServer and
route internal fire-and-forget messaging through send_messages helpers instead.

👾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* fix(core): backfill git memory repo when tag added

When an agent is updated to include the git-memory-enabled tag, ensure the
git-backed memory repo is created and initialized from the agent's current
blocks. Also support configuring the memory repo object store via
LETTA_OBJECT_STORE_URI.

👾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* fix(core): preserve block tags on git-enabled updates

When updating a block for a git-memory-enabled agent, keep block tags in sync
with PostgreSQL (tags are not currently stored in the git repo).

👾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* chore(core): remove git-state legacy shims

- Rename optional dependency extra from memory-repo to git-state
- Drop legacy object-store env aliases and unused region config
- Simplify memory repo metadata to a single canonical format
- Remove unused repo-cache invalidation helper

👾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* fix(core): keep PR scope for git-backed blocks

- Revert unrelated change in fire-and-forget multi-agent send helper
- Route agent block updates-by-label through injected block manager only when needed

👾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

---------

Co-authored-by: Letta <noreply@letta.com>
2026-02-24 10:52:06 -08:00
Kian Jones
74dc6225ba feat: add support for YAML config file (#8999)
* feat: add simplified YAML config file support

Simple hierarchical YAML config that maps to environment variables:

```yaml
letta:
    telemetry:
        enable_datadog: true
    pg_host: localhost
```

Maps to:
- LETTA_TELEMETRY_ENABLE_DATADOG=true
- LETTA_PG_HOST=localhost

Config loaded at settings startup. Environment variables take precedence.

Closes #8997

🤖 Generated with [Letta Code](https://letta.com)

Co-authored-by: Kian Jones <kianjones9@users.noreply.github.com>

* Rename config files from 'config.yaml' to 'conf.yaml'

* add conf.yaml

* modifications (#9199)

* feat: add simplified YAML config file support

Simple hierarchical YAML config that maps to environment variables:

```yaml
letta:
    telemetry:
        enable_datadog: true
    pg_host: localhost
```

Maps to:
- LETTA_TELEMETRY_ENABLE_DATADOG=true
- LETTA_PG_HOST=localhost

Config loaded at settings startup. Environment variables take precedence.

Closes #8997

🤖 Generated with [Letta Code](https://letta.com)

Co-authored-by: Kian Jones <kianjones9@users.noreply.github.com>

* Rename config files from 'config.yaml' to 'conf.yaml'

* add conf.yaml

* fixes?

---------

Co-authored-by: letta-code <248085862+letta-code@users.noreply.github.com>
Co-authored-by: Kian Jones <kianjones9@users.noreply.github.com>

---------

Co-authored-by: letta-code <248085862+letta-code@users.noreply.github.com>
Co-authored-by: Kian Jones <kianjones9@users.noreply.github.com>
2026-02-24 10:52:06 -08:00
Sarah Wooders
4096b30cd7 feat: log LLM traces to clickhouse (#9111)
* feat: add non-streaming option for conversation messages

- Add ConversationMessageRequest with stream=True default (backwards compatible)
- stream=true (default): SSE streaming via StreamingService
- stream=false: JSON response via AgentLoop.load().step()

🤖 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* chore: regenerate API schema for ConversationMessageRequest

* feat: add direct ClickHouse storage for raw LLM traces

    Adds ability to store raw LLM request/response payloads directly in ClickHouse,
    bypassing OTEL span attribute size limits. This enables debugging and analytics
    on large LLM payloads (>10MB system prompts, large tool schemas, etc.).

    New files:
    - letta/schemas/llm_raw_trace.py: Pydantic schema with ClickHouse row helper
    - letta/services/llm_raw_trace_writer.py: Async batching writer (fire-and-forget)
    - letta/services/llm_raw_trace_reader.py: Reader with query methods
    - scripts/sql/clickhouse/llm_raw_traces.ddl: Production table DDL
    - scripts/sql/clickhouse/llm_raw_traces_local.ddl: Local dev DDL
    - apps/core/clickhouse-init.sql: Local dev initialization

    Modified:
    - letta/settings.py: Added 4 settings (store_llm_raw_traces, ttl, batch_size, flush_interval)
    - letta/llm_api/llm_client_base.py: Integration into request_async_with_telemetry
    - compose.yaml: Added ClickHouse service for local dev
    - justfile: Added clickhouse, clickhouse-cli, clickhouse-traces commands

    Feature disabled by default (LETTA_STORE_LLM_RAW_TRACES=false).
    Uses ZSTD(3) compression for 10-30x reduction on JSON payloads.

    🤖 Generated with [Letta Code](https://letta.com)

    Co-Authored-By: Letta <noreply@letta.com>

* fix: address code review feedback for LLM raw traces

Fixes based on code review feedback:

1. Fix ClickHouse endpoint parsing - default to secure=False for raw host:port
   inputs (was defaulting to HTTPS which breaks local dev)

2. Make raw trace writes truly fire-and-forget - use asyncio.create_task()
   instead of awaiting, so JSON serialization doesn't block request path

3. Add bounded queue (maxsize=10000) - prevents unbounded memory growth
   under load. Drops traces with warning if queue is full.

4. Fix deprecated asyncio usage - get_running_loop() instead of get_event_loop()

5. Add org_id fallback - use _telemetry_org_id if actor doesn't have it

6. Remove unused imports - json import in reader

🤖 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* fix: add missing asyncio import and simplify JSON serialization

- Add missing 'import asyncio' that was causing 'name asyncio is not defined' error
- Remove unnecessary clean_double_escapes() function - the JSON is stored correctly,
  the clickhouse-client CLI was just adding extra escaping when displaying
- Update just clickhouse-trace to use Python client for correct JSON output

🤖 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* test: add clickhouse raw trace integration test

* test: simplify clickhouse trace assertions

* refactor: centralize usage parsing and stream error traces

Use per-client usage helpers for raw trace extraction and ensure streaming errors log requests with error metadata.

👾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* test: exercise provider usage parsing live

Make live OpenAI/Anthropic/Gemini requests with credential gating and validate Anthropic cache usage mapping when present.

👾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* test: fix usage parsing tests to pass

- Use GoogleAIClient with GEMINI_API_KEY instead of GoogleVertexClient
- Update model to gemini-2.0-flash (1.5-flash deprecated in v1beta)
- Add tools=[] for Gemini/Anthropic build_request_data

👾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* refactor: extract_usage_statistics returns LettaUsageStatistics

Standardize on LettaUsageStatistics as the canonical usage format returned by client helpers. Inline UsageStatistics construction for ChatCompletionResponse where needed.

👾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* feat: add is_byok and llm_config_json columns to ClickHouse traces

Extend llm_raw_traces table with:
- is_byok (UInt8): Track BYOK vs base provider usage for billing analytics
- llm_config_json (String, ZSTD): Store full LLM config for debugging and analysis

This enables queries like:
- BYOK usage breakdown by provider/model
- Config parameter analysis (temperature, max_tokens, etc.)
- Debugging specific request configurations

* feat: add tests for error traces, llm_config_json, and cache tokens

- Update llm_raw_trace_reader.py to query new columns (is_byok,
  cached_input_tokens, cache_write_tokens, reasoning_tokens, llm_config_json)
- Add test_error_trace_stored_in_clickhouse to verify error fields
- Add test_cache_tokens_stored_for_anthropic to verify cache token storage
- Update existing tests to verify llm_config_json is stored correctly
- Make llm_config required in log_provider_trace_async()
- Simplify provider extraction to use provider_name directly

🐾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* ci: add ClickHouse integration tests to CI pipeline

- Add use-clickhouse option to reusable-test-workflow.yml
- Add ClickHouse service container with otel database
- Add schema initialization step using clickhouse-init.sql
- Add ClickHouse env vars (CLICKHOUSE_ENDPOINT, etc.)
- Add separate clickhouse-integration-tests job running
  integration_test_clickhouse_llm_raw_traces.py

🐾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* refactor: simplify provider and org_id extraction in raw trace writer

- Use model_endpoint_type.value for provider (not provider_name)
- Simplify org_id to just self.actor.organization_id (actor is always pydantic)

🐾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* refactor: simplify LLMRawTraceWriter with _enabled flag

- Check ClickHouse env vars once at init, set _enabled flag
- Early return in write_async/flush_async if not enabled
- Remove ValueError raises (never used)
- Simplify _get_client (no validation needed since already checked)

🐾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* fix: add LLMRawTraceWriter shutdown to FastAPI lifespan

Properly flush pending traces on graceful shutdown via lifespan
instead of relying only on atexit handler.

🐾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* feat: add agent_tags column to ClickHouse traces

Store agent tags as Array(String) for filtering/analytics by tag.

🐾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* cleanup

* fix(ci): fix ClickHouse schema initialization in CI

- Create database separately before loading SQL file
- Remove CREATE DATABASE from SQL file (handled in CI step)
- Add verification step to confirm table was created
- Use -sf flag for curl to fail on HTTP errors

🐾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* refactor: simplify LLM trace writer with ClickHouse async_insert

- Use ClickHouse async_insert for server-side batching instead of manual queue/flush loop
- Sync cloud DDL schema with clickhouse-init.sql (add missing columns)
- Remove redundant llm_raw_traces_local.ddl
- Remove unused batch_size/flush_interval settings
- Update tests for simplified writer

Key changes:
- async_insert=1, wait_for_async_insert=1 for reliable server-side batching
- Simple per-trace retry with exponential backoff (max 3 retries)
- ~150 lines removed from writer

🤖 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* refactor: consolidate ClickHouse direct writes into TelemetryManager backend

- Add clickhouse_direct backend to provider_trace_backends
- Remove duplicate ClickHouse write logic from llm_client_base.py
- Configure via LETTA_TELEMETRY_PROVIDER_TRACE_BACKEND=postgres,clickhouse_direct

The clickhouse_direct backend:
- Converts ProviderTrace to LLMRawTrace
- Extracts usage stats from response JSON
- Writes via LLMRawTraceWriter with async_insert

🤖 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* refactor: address PR review comments and fix llm_config bug

Review comment fixes:
- Rename clickhouse_direct -> clickhouse_analytics (clearer purpose)
- Remove ClickHouse from OSS compose.yaml, create separate compose.clickhouse.yaml
- Delete redundant scripts/test_llm_raw_traces.py (use pytest tests)
- Remove unused llm_raw_traces_ttl_days setting (TTL handled in DDL)
- Fix socket description leak in telemetry_manager docstring
- Add cloud-only comment to clickhouse-init.sql
- Update justfile to use separate compose file

Bug fix:
- Fix llm_config not being passed to ProviderTrace in telemetry
- Now correctly populates provider, model, is_byok for all LLM calls
- Affects both request_async_with_telemetry and log_provider_trace_async

DDL optimizations:
- Add secondary indexes (bloom_filter for agent_id, model, step_id)
- Add minmax indexes for is_byok, is_error
- Change model and error_type to LowCardinality for faster GROUP BY

🤖 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* refactor: rename llm_raw_traces -> llm_traces

Address review feedback that "raw" is misleading since we denormalize fields.

Renames:
- Table: llm_raw_traces -> llm_traces
- Schema: LLMRawTrace -> LLMTrace
- Files: llm_raw_trace_{reader,writer}.py -> llm_trace_{reader,writer}.py
- Setting: store_llm_raw_traces -> store_llm_traces

🤖 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* fix: update workflow references to llm_traces

Missed renaming table name in CI workflow files.

🤖 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* fix: update clickhouse_direct -> clickhouse_analytics in docstring

🤖 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* chore: remove inaccurate OTEL size limit comments

The 4MB limit is our own truncation logic, not an OTEL protocol limit.
The real benefit is denormalized columns for analytics queries.

🤖 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* chore: remove local ClickHouse dev setup (cloud-only feature)

- Delete clickhouse-init.sql and compose.clickhouse.yaml
- Remove local clickhouse just commands
- Update CI to use cloud DDL with MergeTree for testing

clickhouse_analytics is a cloud-only feature. For local dev, use postgres backend.

🤖 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* fix: restore compose.yaml to match main

🤖 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* refactor: merge clickhouse_analytics into clickhouse backend

Per review feedback - having two separate backends was confusing.

Now the clickhouse backend:
- Writes to llm_traces table (denormalized for cost analytics)
- Reads from OTEL traces table (will cut over to llm_traces later)

Config: LETTA_TELEMETRY_PROVIDER_TRACE_BACKEND=postgres,clickhouse

🤖 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* fix: correct path to DDL file in CI workflow

🤖 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* chore: add provider index to DDL for faster filtering

🤖 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* fix: configure telemetry backend in clickhouse tests

Tests need to set telemetry_settings.provider_trace_backends to include
'clickhouse', otherwise traces are routed to default postgres backend.

🤖 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* fix: set provider_trace_backend field, not property

provider_trace_backends is a computed property, need to set the
underlying provider_trace_backend string field instead.

🤖 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* fix: error trace test and error_type extraction

- Add TelemetryManager to error trace test so traces get written
- Fix error_type extraction to check top-level before nested error dict

🤖 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* fix: use provider_trace.id for trace correlation across backends

- Pass provider_trace.id to LLMTrace instead of auto-generating
- Log warning if ID is missing (shouldn't happen, helps debug)
- Fallback to new UUID only if not set

🤖 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* fix: trace ID correlation and concurrency issues

- Strip "provider_trace-" prefix from ID for UUID storage in ClickHouse
- Add asyncio.Lock to serialize writes (clickhouse_connect not thread-safe)
- Fix Anthropic prompt_tokens to include cached tokens for cost analytics
- Log warning if provider_trace.id is missing

🤖 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

---------

Co-authored-by: Letta <noreply@letta.com>
Co-authored-by: Caren Thomas <carenthomas@gmail.com>
2026-02-24 10:52:06 -08:00
Kian Jones
c1a02fa180 feat: add metadata-only provider trace storage option (#9155)
* feat: add metadata-only provider trace storage option

Add support for writing provider traces to a lightweight metadata-only
table (~1.5GB) instead of the full table (~725GB) since request/response
JSON is now stored in GCS.

- Add `LETTA_TELEMETRY_PROVIDER_TRACE_PG_METADATA_ONLY` setting
- Create `provider_trace_metadata` table via alembic migration
- Conditionally write to new table when flag is enabled
- Include backfill script for migrating existing data

🐾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* chore: regenerate API spec and SDK

* fix: use composite PK (created_at, id) for provider_trace_metadata

Aligns with GCS partitioning structure (raw/date=YYYY-MM-DD/{id}.json.gz)
and enables efficient date-range queries via the B-tree index.

🐾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* ammendments

* fix: add bulk data copy to migration

Copy existing provider_traces metadata in-migration instead of separate
backfill script. Creates indexes after bulk insert for better performance.

🐾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* fix: remove data copy from migration, create empty table only

Old data stays in provider_traces, new writes go to provider_trace_metadata
when flag is enabled. Full traces are in GCS anyway.

🐾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* fix: address PR comments

- Remove GCS mention from ProviderTraceMetadata docstring
- Move metadata object creation outside session context

🐾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* fix: reads always use full provider_traces table

The metadata_only flag should only control writes. Reads always go to
the full table to avoid returning ProviderTraceMetadata where
ProviderTrace is expected.

🐾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* feat: enable metadata-only provider trace writes in prod

Add LETTA_TELEMETRY_PROVIDER_TRACE_PG_METADATA_ONLY=true to all
Helm values (memgpt-server and lettuce-py, prod and dev).

🐾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

---------

Co-authored-by: Letta <noreply@letta.com>
2026-01-29 12:44:04 -08:00
Sarah Wooders
adab8cd9b5 feat: add MiniMax provider support (#9095)
* feat: add MiniMax provider support

Add MiniMax as a new LLM provider using their Anthropic-compatible API.

Key implementation details:
- Uses standard messages API (not beta) - MiniMax supports thinking blocks natively
- Base URL: https://api.minimax.io/anthropic
- Models: MiniMax-M2.1, MiniMax-M2.1-lightning, MiniMax-M2 (all 200K context, 128K output)
- Temperature clamped to valid range (0.0, 1.0]
- All M2.x models treated as reasoning models (support interleaved thinking)

Files added:
- letta/schemas/providers/minimax.py - MiniMax provider schema
- letta/llm_api/minimax_client.py - Client extending AnthropicClient
- tests/test_minimax_client.py - Unit tests (13 tests)
- tests/model_settings/minimax-m2.1.json - Integration test config

🐾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* chore: regenerate API spec with MiniMax provider

🐾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* chore: use MiniMax-M2.1-lightning for CI tests

Switch to the faster/cheaper lightning model variant for integration tests.

🐾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* chore: add MINIMAX_API_KEY to deploy-core command

Co-authored-by: Sarah Wooders <sarahwooders@users.noreply.github.com>

* chore: regenerate web openapi spec with MiniMax provider

Co-authored-by: Sarah Wooders <sarahwooders@users.noreply.github.com>

🐾 Generated with [Letta Code](https://letta.com)

---------

Co-authored-by: Letta <noreply@letta.com>
Co-authored-by: letta-code <248085862+letta-code@users.noreply.github.com>
Co-authored-by: Sarah Wooders <sarahwooders@users.noreply.github.com>
2026-01-29 12:44:04 -08:00
Charles Packer
238894eebd fix(core): disable MCP stdio servers by default (#8969)
* fix(core): disable MCP stdio servers by default

Stdio MCP servers spawn local processes on the host, which is not
suitable for multi-tenant or shared server deployments. This change:

- Changes `mcp_disable_stdio` default from False to True
- Enforces the setting in `get_mcp_client()` and `create_mcp_server_from_config()`
- Users running local/single-user deployments can set MCP_DISABLE_STDIO=false
  to enable stdio-based MCP servers (e.g., for npx/uvx tools)

👾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* update ci

* push

---------

Co-authored-by: Letta <noreply@letta.com>
Co-authored-by: jnjpng <jin@letta.com>
Co-authored-by: Letta Bot <jinjpeng@gmail.com>
2026-01-29 12:43:53 -08:00
Devansh Jain
dfa6ee0c23 feat: add SGLang support (#8838)
* add sglang support

* add tests

* normalize base url

* cleanup

* chore: regenerate autogenerated API files for sglang support
2026-01-29 12:43:51 -08:00
Ari Webb
9dbf428c1f feat: enable bedrock for anthropic models (#8847)
* feat: enable bedrock for anthropic models

* parallel tool calls in ade

* attempt add to ci

* update tests

* add env vars

* hardcode region

* get it working

* debugging

* add bedrock extra

* default env var [skip ci]

* run ci

* reasoner model update

* secrets

* clean up log

* clean up
2026-01-19 15:54:44 -08:00
Kian Jones
2ee28c3264 feat: add telemetry source identifier (#8918)
* add telemetry source

* add source to provider trave
2026-01-19 15:54:44 -08:00
Kian Jones
9418ab9815 feat: add provider trace backend abstraction for multi-backend telemetry (#8814)
* feat: add provider trace backend abstraction for multi-backend telemetry

Introduces a pluggable backend system for provider traces:
- Base class with async/sync create and read interfaces
- PostgreSQL backend (existing behavior)
- ClickHouse backend (via OTEL instrumentation)
- Socket backend (writes to Unix socket for crouton sidecar)
- Factory for instantiating backends from config

Refactors TelemetryManager to use backends with support for:
- Multi-backend writes (concurrent via asyncio.gather)
- Primary backend for reads (first in config list)
- Graceful error handling per backend

Config: LETTA_TELEMETRY_PROVIDER_TRACE_BACKEND (comma-separated)
Example: "postgres,socket" for dual-write to Postgres and crouton

🐙 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* feat: add protocol version to socket backend records

Adds PROTOCOL_VERSION constant to socket backend:
- Included in every telemetry record sent to crouton
- Must match ProtocolVersion in apps/crouton/main.go
- Enables crouton to detect and reject incompatible messages

🐙 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* fix: remove organization_id from ProviderTraceCreate calls

The organization_id is now handled via the actor parameter in the
telemetry manager, not through ProviderTraceCreate schema. This fixes
validation errors after changing ProviderTraceCreate to inherit from
BaseProviderTrace which forbids extra fields.

🐙 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta <noreply@letta.com>

* consolidate provider trace

* add clickhouse-connect to fix bug on main lmao

* auto generated sdk changes, and deployment details, and clikchouse prefix bug and added fields to runs trace return api

* auto generated sdk changes, and deployment details, and clikchouse prefix bug and added fields to runs trace return api

* consolidate provider trace

* consolidate provider trace bug fix

---------

Co-authored-by: Letta <noreply@letta.com>
2026-01-19 15:54:43 -08:00
Kian Jones
2368efd027 fix: add missing use_clickhouse_for_provider_traces setting (#8799)
PR #8682 added code that references settings.use_clickhouse_for_provider_traces
but never added the field to Settings, causing AttributeError in prod.

🤖 Generated with [Letta Code](https://letta.com)

Co-authored-by: Letta <noreply@letta.com>
2026-01-19 15:54:42 -08:00
Ari Webb
cd45212acb feat: add zai provider support (#7626)
* feat: add zai provider support

* add zai_api_key secret to deploy-core

* add to justfile

* add testing, provider integration skill

* enable zai key

* fix zai test

* clean up skill a little

* small changes
2026-01-12 10:57:19 -08:00
Kevin Lin
03a41f8e8d chore: Increase LLM streaming timeout [LET-6562] (#7080)
increase
2025-12-17 17:31:02 -08:00
Kian Jones
fbd89c9360 fix: replace all 'PRODUCTION' references with 'prod' for consistency (#6627)
* fix: replace all 'PRODUCTION' references with 'prod' for consistency

Problem: Codebase had 11 references to 'PRODUCTION' (uppercase) that should
use 'prod' (lowercase) for consistency with the deployment workflows and
environment normalization.

Changes across 8 files:

1. Source files (using settings.environment):
   - letta/functions/function_sets/multi_agent.py
   - letta/services/tool_manager.py
   - letta/services/tool_executor/multi_agent_tool_executor.py
   - letta/services/helpers/agent_manager_helper.py
   All checks changed from: settings.environment == "PRODUCTION"
   To: settings.environment == "prod"

2. OTEL resource configuration:
   - letta/otel/resource.py
     - Updated _normalize_environment_tag() to handle 'prod' directly
     - Removed 'PRODUCTION' -> 'prod' mapping (no longer needed)
     - Updated device.id check from _env != "PRODUCTION" to _env != "prod"

3. Test files:
   - tests/managers/conftest.py
     - Fixture parameter changed from "PRODUCTION" to "prod"
   - tests/managers/test_agent_manager.py (3 occurrences)
   - tests/managers/test_tool_manager.py (2 occurrences)
   All test checks changed to use "prod"

Result: Complete consistency across the codebase:
- All environment checks use "prod" instead of "PRODUCTION"
- Normalization function simplified (no special case for PRODUCTION)
- Tests use correct "prod" value
- Matches deployment workflow configuration from PR #6626

This completes the environment naming standardization effort.

* fix: update settings.py environment description to use 'prod' instead of 'PRODUCTION'

The field description still referenced PRODUCTION as an example value.
Updated to use lowercase 'prod' for consistency with actual usage.

Before: "Application environment (PRODUCTION, DEV, CANARY, etc. - normalized to lowercase for OTEL tags)"
After: "Application environment (prod, dev, canary, etc. - lowercase values used for OTEL tags)"
2025-12-15 12:02:34 -08:00
Kian Jones
3422508d42 feat: add OpenTelemetry distributed tracing to clouid-api and web (#6549)
* feat: add OpenTelemetry distributed tracing to letta-web

Enables end-to-end distributed tracing from letta-web through memgpt-server
using OpenTelemetry. Traces are exported via OTLP to Datadog APM for
monitoring request latency across services.

Key changes:
- Install OTEL packages: @opentelemetry/sdk-node, auto-instrumentations-node
- Create apps/web/src/lib/tracing.ts with full OTEL configuration
- Initialize tracing in instrumentation.ts (before any other imports)
- Add OTEL packages to next.config.js serverExternalPackages
- Add OTEL environment variables to deployment configs:
  - OTEL_EXPORTER_OTLP_ENDPOINT (e.g., http://datadog-agent:4317)
  - OTEL_SERVICE_NAME (letta-web)
  - OTEL_ENABLED (true in production)

Features enabled:
- Automatic HTTP/fetch instrumentation with trace context propagation
- Service metadata (name, version, environment)
- Trace correlation with logs (getCurrentTraceId helper)
- Graceful shutdown handling
- Health check endpoint filtering

Configuration:
- Traces sent to OTLP endpoint (Datadog agent)
- W3C Trace Context propagation for distributed tracing
- BatchSpanProcessor for efficient trace export
- Debug logging in development environment

GitHub variables to set:
- OTEL_EXPORTER_OTLP_ENDPOINT (e.g., http://datadog-agent:4317)
- OTEL_ENABLED (true)

* feat: add OpenTelemetry distributed tracing to cloud-api

Completes end-to-end distributed tracing across the full request chain:
letta-web → cloud-api → memgpt-server (core)

All three services now export traces via OTLP to Datadog APM.

Key changes:
- Install OTEL packages in cloud-api
- Create apps/cloud-api/src/instrument-otel.ts with full OTEL configuration
- Initialize OTEL tracing in main.ts (before Sentry)
- Add OTEL environment variables to deployment configs:
  - OTEL_EXPORTER_OTLP_ENDPOINT (e.g., http://datadog-agent:4317)
  - OTEL_SERVICE_NAME (cloud-api)
  - OTEL_ENABLED (true in production)
  - GIT_HASH (for service version)

Features enabled:
- Automatic HTTP/Express instrumentation
- Trace context propagation (W3C Trace Context)
- Service metadata (name, version, environment)
- Trace correlation with logs (getCurrentTraceId helper)
- Health check endpoint filtering

Configuration:
- Traces sent to OTLP endpoint (Datadog agent)
- Seamless trace propagation through the full request chain
- BatchSpanProcessor for efficient trace export

Complete trace flow:
1. letta-web receives request, starts root span
2. letta-web calls cloud-api, propagates trace context
3. cloud-api calls memgpt-server, propagates trace context
4. All spans linked by trace ID, visible as single trace in Datadog

* fix: prevent duplicate OTEL SDK initialization and handle array headers

Fixes identified by Cursor bugbot:

1. Added initialization guard to prevent duplicate SDK initialization
   - Added isInitialized flag to prevent multiple SDK instances
   - Prevents duplicate SIGTERM handlers from being registered
   - Prevents resource leaks from lost SDK references

2. Fixed array header value handling
   - HTTP headers can be string | string[] | undefined
   - Now properly handles array case by taking first element
   - Prevents passing arrays to span.setAttribute() which expects strings

3. Verified OTEL dependencies are correctly installed
   - Packages are in root package.json (monorepo structure)
   - Available to all workspace packages (web, cloud-api)
   - Bugbot false positive - dependencies ARE present

Applied fixes to both:
- apps/web/src/lib/tracing.ts
- apps/cloud-api/src/instrument-otel.ts

* fix: handle SIGTERM promise rejections and unify initialization pattern

Fixes identified by Cursor bugbot:

1. Fixed unhandled promise rejection in SIGTERM handlers
   - Changed from async arrow function to sync with .catch()
   - Prevents unhandled promise rejections during shutdown
   - Logs errors if OTLP endpoint is unreachable during shutdown
   - Applied to both web and cloud-api

2. Unified initialization pattern across services
   - Removed auto-initialization from cloud-api instrument-otel.ts
   - Now explicitly calls initializeTracing() in main.ts
   - Matches web pattern (explicit call in instrumentation.ts)
   - Reduces confusion and maintains consistency

Both services now follow the same pattern:
- Import tracing module
- Explicitly call initializeTracing()
- Guard against duplicate initialization with isInitialized flag

Before (cloud-api):
  import './instrument-otel'; // Auto-initializes

After (cloud-api):
  import { initializeTracing } from './instrument-otel';
  initializeTracing(); // Explicit call

SIGTERM handler before:
  process.on('SIGTERM', async () => {
    await shutdownTracing(); // Unhandled rejection!
  });

SIGTERM handler after:
  process.on('SIGTERM', () => {
    shutdownTracing().catch((error) => {
      console.error('Error during OTEL shutdown:', error);
    });
  });

* feat: add environment differentiation for distributed tracing

Enables proper environment filtering in Datadog APM by introducing LETTA_ENV
to distinguish between production, staging, canary, and development.

Problem:
- NODE_ENV is always 'production' or 'development'
- No way to differentiate staging, canary, etc. in Datadog
- All traces appeared under no environment or same environment
- Couldn't test with staging traces

Solution:
- Added LETTA_ENV variable (production, staging, canary, development)
- Set deployment.environment attribute for Datadog APM filtering
- Updated all deployment configs (workflows, justfile)
- Falls back to NODE_ENV if LETTA_ENV not set

Changes:
1. Updated tracing code (web + cloud-api):
   - Use LETTA_ENV for environment name
   - Set SEMRESATTRS_DEPLOYMENT_ENVIRONMENT (resolves to deployment.environment)
   - Fallback: LETTA_ENV → NODE_ENV → 'development'

2. Updated deployment configs:
   - .github/workflows/deploy-web.yml: LETTA_ENV=production
   - .github/workflows/deploy-cloud-api.yml: LETTA_ENV=production
   - justfile: LETTA_ENV with default to production

3. Added comprehensive documentation:
   - OTEL_TRACING.md with full setup guide
   - How to view environments in Datadog APM
   - How to test with staging environment
   - Dashboard query examples
   - Troubleshooting guide

Usage:
# Production
LETTA_ENV=production

# Staging
LETTA_ENV=staging

# Local dev
LETTA_ENV=development

Datadog APM now shows:
- env:production (main traffic)
- env:staging (staging deployments)
- env:canary (canary deployments)
- env:development (local testing)

View in Datadog:
APM → Services → Filter by env dropdown → Select production/staging/etc.

* fix: prevent OTEL SDK double shutdown and error handler failures

Fixes identified by Cursor bugbot:

1. SDK double shutdown prevention
   - Set sdk = null after successful shutdown
   - Set isInitialized = false to allow re-initialization
   - Even on shutdown error, mark as shutdown to prevent retry
   - Prevents errors when shutdownTracing() called multiple times
   - Applied to both web and cloud-api

2. Error handler using console.error directly (web only)
   - Replaced dynamic require('./logger') with console.error
   - Logger module may not be loaded during early initialization
   - This code runs in Next.js instrumentation.ts before modules load
   - Prevents masking original OTEL errors with logger failures
   - Cloud-api already correctly used console.error

Before (bug #1):
  await sdk.shutdown();
  // sdk still references shutdown SDK
  // Next call to shutdownTracing() tries to shutdown again

After (bug #1):
  await sdk.shutdown();
  sdk = null; //  Prevent double shutdown
  isInitialized = false; //  Allow re-init

Before (bug #2 - web):
  const { logger } = require('./logger'); //  May fail during init
  logger.error('Failed to initialize OTEL', errorInfo);

After (bug #2 - web):
  console.error('Failed to initialize OTEL:', error); //  Always works

Scenarios protected:
- Multiple SIGTERM signals
- Explicit shutdownTracing() calls
- Logger initialization failures
- Circular dependencies during early init

* feat: add environment differentiation to core and staging deployments

Enables proper environment filtering in Datadog APM for memgpt-server (core)
and staging deployments by adding deployment.environment resource attribute.

Problem:
- Core traces didn't show environment in Datadog APM
- Staging workflow had no OTEL configuration
- Couldn't differentiate staging vs production core traces

Solution:
1. Updated core OTEL resource to include deployment.environment
   - Added deployment.environment attribute in resource.py
   - Uses settings.environment which maps to LETTA_ENVIRONMENT env var
   - Applied .lower() for consistency with web/cloud-api

2. Added LETTA_ENV to staging workflow
   - nightly-staging-deploy-test.yaml: LETTA_ENV=staging
   - Added OTEL_EXPORTER_OTLP_ENDPOINT and OTEL_ENABLED vars
   - Traces from staging will show env:staging in Datadog

3. Added LETTA_ENV to production core workflow
   - deploy-core.yml: LETTA_ENV=production
   - Added OTEL configuration at workflow level
   - Traces from production will show env:production

4. Updated justfile for core deployments
   - Set LETTA_ENVIRONMENT from LETTA_ENV with default to production
   - Maps to settings.environment field (env_prefix="letta_")

Environment mapping:
- Web/Cloud-API: Use LETTA_ENV directly
- Core: Use LETTA_ENVIRONMENT (Pydantic with letta_ prefix)
- Both map to deployment.environment resource attribute

Now all services properly tag traces with environment:
 letta-web: deployment.environment set
 cloud-api: deployment.environment set
 memgpt-server: deployment.environment set

View in Datadog:
APM → Services → Filter by env:production or env:staging

* refactor: unify environment variable to LETTA_ENV across all services

Simplifies environment configuration by using LETTA_ENV consistently across
all three services (web, cloud-api, and core) instead of having core use
LETTA_ENVIRONMENT.

Problem:
- Core used LETTA_ENVIRONMENT (due to Pydantic env_prefix)
- Web and cloud-api used LETTA_ENV
- Confusing to have two different variable names
- Justfile had to map LETTA_ENV → LETTA_ENVIRONMENT

Solution:
- Added validation_alias to core settings.py
- environment field now reads from LETTA_ENV directly
- Falls back to letta_environment for backwards compatibility
- Updated justfile to set LETTA_ENV for core (not LETTA_ENVIRONMENT)
- Updated documentation to clarify consistent naming

Changes:
1. apps/core/letta/settings.py
   - Added validation_alias=AliasChoices("LETTA_ENV", "letta_environment")
   - Prioritizes LETTA_ENV, falls back to letta_environment
   - Updated description to include all environment values

2. justfile
   - Changed --set secrets.LETTA_ENVIRONMENT to --set secrets.LETTA_ENV
   - Now consistent with web and cloud-api deployments

3. OTEL_TRACING.md
   - Added note that all services use LETTA_ENV consistently
   - Fixed trailing whitespace

Before:
- Web: LETTA_ENV
- Cloud-API: LETTA_ENV
- Core: LETTA_ENVIRONMENT 

After:
- Web: LETTA_ENV
- Cloud-API: LETTA_ENV
- Core: LETTA_ENV 

All services now use the same environment variable name!

* refactor: standardize on LETTA_ENVIRONMENT across all services

Unifies environment variable naming to use LETTA_ENVIRONMENT consistently
across all three services (web, cloud-api, and core).

Problem:
- Previous commit tried to use LETTA_ENV everywhere
- Core already uses Pydantic with env_prefix="letta_"
- Better to standardize on LETTA_ENVIRONMENT to match core conventions

Solution:
- All services now read from LETTA_ENVIRONMENT
- Web: process.env.LETTA_ENVIRONMENT
- Cloud-API: process.env.LETTA_ENVIRONMENT
- Core: settings.environment (reads LETTA_ENVIRONMENT via Pydantic prefix)

Changes:
1. apps/web/src/lib/tracing.ts
   - Changed LETTA_ENV → LETTA_ENVIRONMENT

2. apps/cloud-api/src/instrument-otel.ts
   - Changed LETTA_ENV → LETTA_ENVIRONMENT

3. apps/core/letta/settings.py
   - Removed validation_alias (not needed)
   - Uses standard Pydantic env_prefix behavior

4. All workflow files updated:
   - deploy-web.yml: LETTA_ENVIRONMENT=production
   - deploy-cloud-api.yml: LETTA_ENVIRONMENT=production
   - deploy-core.yml: LETTA_ENVIRONMENT=production
   - nightly-staging-deploy-test.yaml: LETTA_ENVIRONMENT=staging
   - stage-web.yaml: LETTA_ENVIRONMENT=staging
   - stage-cloud-api.yaml: LETTA_ENVIRONMENT=staging (added OTEL config)
   - stage-core.yaml: LETTA_ENVIRONMENT=staging (added OTEL config)

5. justfile
   - Updated all LETTA_ENV → LETTA_ENVIRONMENT
   - Web: --set env.LETTA_ENVIRONMENT
   - Cloud-API: --set env.LETTA_ENVIRONMENT
   - Core: --set secrets.LETTA_ENVIRONMENT

6. OTEL_TRACING.md
   - All references updated to LETTA_ENVIRONMENT

Final state:
 Web: LETTA_ENVIRONMENT
 Cloud-API: LETTA_ENVIRONMENT
 Core: LETTA_ENVIRONMENT (via letta_ prefix)

All services use the same variable name with proper Pydantic conventions!

* feat: implement split OTEL architecture (Option A)

Implements Option A: Web and cloud-api send traces directly to Datadog Agent,
while core keeps its existing OTEL sidecar (exports to ClickHouse + Datadog).

Architecture:
- letta-web → Datadog Agent (OTLP:4317) → Datadog APM
- cloud-api → Datadog Agent (OTLP:4317) → Datadog APM
- memgpt-server → OTEL Sidecar → ClickHouse + Datadog (unchanged)

Rationale:
- Core has existing production sidecar setup (exports to ClickHouse for analytics)
- Web/cloud-api don't need ClickHouse export, only APM
- Simpler: Direct to Datadog Agent is sufficient
- Minimal changes to core (already working)
- Traces still link end-to-end via W3C Trace Context propagation

Changes:

1. Helm Charts - Added OTEL config defaults:
   - helm/letta-web/values.yaml: Added OTEL env vars
   - helm/cloud-api/values.yaml: Added OTEL env vars
   - Default: OTEL_ENABLED="false", override in production
   - Endpoint: http://datadog-agent:4317

2. Production Workflows - Direct to Datadog Agent:
   - deploy-web.yml: Set OTEL_EXPORTER_OTLP_ENDPOINT to datadog-agent
   - deploy-cloud-api.yml: Set OTEL_EXPORTER_OTLP_ENDPOINT to datadog-agent
   - deploy-core.yml: Removed OTEL vars (keep existing setup)
   - OTEL_ENABLED="true", LETTA_ENVIRONMENT=production

3. Staging Workflows - Direct to Datadog Agent:
   - stage-web.yaml: Set OTEL_EXPORTER_OTLP_ENDPOINT to datadog-agent
   - stage-cloud-api.yaml: Set OTEL_EXPORTER_OTLP_ENDPOINT to datadog-agent
   - stage-core.yaml: Removed OTEL vars (keep existing setup)
   - nightly-staging-deploy-test.yaml: Removed OTEL vars
   - OTEL_ENABLED="true", LETTA_ENVIRONMENT=staging

4. Justfile:
   - Removed LETTA_ENVIRONMENT from core deployment (keep unchanged)
   - Web/cloud-api already correctly pass OTEL vars from workflows

5. Documentation:
   - Completely rewrote OTEL_TRACING.md
   - Added architecture diagrams explaining split setup
   - Added Datadog Agent prerequisites
   - Added troubleshooting for split architecture
   - Explained why we chose this approach

Prerequisites (must verify before deploying):
- Datadog Agent deployed with service name: datadog-agent
- OTLP receiver enabled on port 4317
- If different service name/namespace, update workflows

Next Steps:
- Verify datadog-agent service exists in cluster
- Verify OTLP receiver is enabled on Datadog agent
- Deploy and test trace propagation across services

* refactor: shorten environment names to prod and dev

Changes LETTA_ENVIRONMENT values from 'production' to 'prod' and
'development' to 'dev' for consistency and brevity.

Changes:
1. Workflows:
   - deploy-web.yml: production → prod
   - deploy-cloud-api.yml: production → prod

2. Helm charts:
   - letta-web/values.yaml: development → dev
   - cloud-api/values.yaml: development → dev

3. Justfile:
   - Default values: production → prod

4. Code:
   - apps/web/src/lib/tracing.ts: Fallback 'development' → 'dev'
   - apps/cloud-api/src/instrument-otel.ts: Fallback 'development' → 'dev'
   - apps/core/letta/settings.py: Updated description

5. Documentation:
   - OTEL_TRACING.md: Updated all examples and table

Environment values:
- prod (was production)
- staging (unchanged)
- canary (unchanged)
- dev (was development)

* refactor: align environment names with codebase patterns

Changes staging to 'dev' and local development to 'local-test' to match
existing codebase conventions (like test_temporal_metrics_local.py).

Rationale:
- 'dev' for staging matches consistent pattern across codebase
- 'local-test' for local development follows test naming convention
- Clearer distinction between deployed staging and local testing

Environment values:
- prod (production)
- dev (staging/dev cluster)
- canary (canary deployments)
- local-test (local development)

Changes:
1. Staging workflows:
   - stage-web.yaml: staging → dev
   - stage-cloud-api.yaml: staging → dev

2. Helm chart defaults (for local):
   - letta-web/values.yaml: dev → local-test
   - cloud-api/values.yaml: dev → local-test

3. Code fallbacks:
   - apps/web/src/lib/tracing.ts: 'dev' → 'local-test'
   - apps/cloud-api/src/instrument-otel.ts: 'dev' → 'local-test'
   - apps/core/letta/settings.py: Updated description

4. Documentation:
   - OTEL_TRACING.md: Updated table, examples, and all references
   - Clarified dev = staging cluster, local-test = local development

Datadog APM filters:
- env:prod (production)
- env:dev (staging cluster)
- env:canary (canary)
- env:local-test (local development)

* fix: update environment checks for lowercase values and add missing configs

Fixes 4 bugs identified by Cursor bugbot:

1. Case-sensitive environment checks (5 locations)
   - Updated all checks from "PRODUCTION" to case-insensitive "prod"
   - Fixed in: resource.py, multi_agent.py, tool_manager.py,
     multi_agent_tool_executor.py, agent_manager_helper.py
   - Now properly filters local-only tools in production
   - Prevents exposing debug tools in production

2. Device ID leak in production
   - Fixed resource.py to use case-insensitive check
   - Now correctly excludes device.id (MAC address) in production
   - Only adds device.id when env is not "prod"

3. Missing @opentelemetry/sdk-trace-base in Next.js externals
   - Added to serverExternalPackages in next.config.js
   - Prevents webpack bundling issues with native dependencies
   - Package is directly imported for BatchSpanProcessor

4. Missing NEXT_PUBLIC_GIT_HASH in stage-web workflow
   - Added NEXT_PUBLIC_GIT_HASH: ${{ github.sha }}
   - Now matches stage-cloud-api.yaml pattern
   - Staging traces will show correct version instead of 'unknown'
   - Enables correlation of traces with specific deployments

Changes:
- apps/core/letta/otel/resource.py: Case-insensitive check, add device.id only if not prod
- apps/core/letta/functions/function_sets/multi_agent.py: Case-insensitive prod check
- apps/core/letta/services/tool_manager.py: Case-insensitive prod check
- apps/core/letta/services/tool_executor/multi_agent_tool_executor.py: Case-insensitive prod check
- apps/core/letta/services/helpers/agent_manager_helper.py: Case-insensitive prod check
- apps/web/next.config.js: Added @opentelemetry/sdk-trace-base to externals
- .github/workflows/stage-web.yaml: Added NEXT_PUBLIC_GIT_HASH

All checks now use: settings.environment.lower() == "prod"
This matches our new convention: prod/dev/canary/local-test

Also includes: distributed-tracing skill (created in /skill session)

* refactor: keep core PRODUCTION but normalize OTEL tags to prod

Changes approach to maintain backward compatibility with core business logic
while standardizing OTEL environment tags.

Previous approach:
- Changed all "PRODUCTION" checks to lowercase "prod"
- Would break existing core business logic expectations

New approach:
- Core continues using "PRODUCTION" (uppercase) for business logic
- OTEL resource.py normalizes environment to lowercase abbreviated tags
- Web/cloud-api use "prod" directly (they don't have business logic checks)

Changes:

1. Reverted business logic checks to use "PRODUCTION" (uppercase):
   - multi_agent.py: Check for "PRODUCTION" to block tools
   - tool_manager.py: Check for "PRODUCTION" to filter local-only tools
   - multi_agent_tool_executor.py: Check for "PRODUCTION" to block tools
   - agent_manager_helper.py: Check for "PRODUCTION" to filter tools

2. Added environment normalization for OTEL tags:
   - resource.py: New _normalize_environment_tag() function
   - Maps PRODUCTION → prod, DEV/STAGING → dev
   - Other values (CANARY, etc.) converted to lowercase
   - Device ID check reverted to != "PRODUCTION"

3. Updated core deployments to set PRODUCTION:
   - deploy-core.yml: LETTA_ENVIRONMENT=PRODUCTION
   - stage-core.yaml: LETTA_ENVIRONMENT=DEV
   - justfile: Added LETTA_ENVIRONMENT with default PRODUCTION

4. Updated settings description:
   - Clarifies values are uppercase (PRODUCTION, DEV)
   - Notes normalization to lowercase for OTEL tags

Result:
- Core business logic: Uses "PRODUCTION" (unchanged, backward compatible)
- OTEL Datadog tags: Shows "prod" (normalized, consistent with web/cloud-api)
- Web/cloud-api: Continue using "prod" directly (no change needed)
- Device ID properly excluded in PRODUCTION environments

* fix: correct Python FastAPI instrumentation and environment normalization

Fixes 3 bugs identified by Cursor bugbot in distributed-tracing skill:

1. Python import typo (line 50)
   - Was: from opentelemetry.instrumentation.fastapi import FastAPIInstrumentatio
   - Now: from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
   - Missing final 'n' in Instrumentatio
   - Correct class name is FastAPIInstrumentor (with 'or' suffix)

2. Wrong class name usage (line 151)
   - Was: FastAPIInstrumentation.instrument_app()
   - Now: FastAPIInstrumentor.instrument_app()
   - Fixed to match correct OpenTelemetry API

3. Environment tag inconsistency
   - Problem: Python template used .lower() which converts PRODUCTION -> production
   - But resource.py normalizes PRODUCTION -> prod
   - Would create inconsistent tags: 'production' vs 'prod' in Datadog

   Solution:
   - Added _normalize_environment_tag() function to Python template
   - Matches resource.py normalization logic
   - PRODUCTION -> prod, DEV/STAGING -> dev, others lowercase
   - Updated comments in workflows to clarify normalization happens in code

Changes:
- .skills/distributed-tracing/templates/python-fastapi-tracing.py:
  - Fixed import: FastAPIInstrumentor (not FastAPIInstrumentatio)
  - Fixed usage: FastAPIInstrumentor.instrument_app()
  - Added _normalize_environment_tag() function
  - Updated environment handling to use normalization
  - Updated docstring to clarify PRODUCTION/DEV -> prod/dev mapping

- .github/workflows/deploy-core.yml:
  - Clarified comment: _normalize_environment_tag() converts to "prod"

- .github/workflows/stage-core.yaml:
  - Clarified comment: _normalize_environment_tag() converts to "dev"

Result:
All services now consistently show 'prod' (not 'production') in Datadog APM,
enabling proper filtering and correlation across distributed traces.

* fix: add Datadog config to staging workflows and fix justfile backslash

Fixes 3 issues found in staging deployment logs:

1. Missing backslash in justfile (line 134)
   Problem: LETTA_ENVIRONMENT line missing backslash caused all subsequent
   helm --set flags to be ignored, including OTEL_EXPORTER_OTLP_ENDPOINT
   Result: letta-web and cloud-api logs showed "OTEL_EXPORTER_OTLP_ENDPOINT not set"

   Fixed:
   --set env.LETTA_ENVIRONMENT=${LETTA_ENVIRONMENT:-prod} \  # Added backslash

2. Missing Datadog vars in staging workflows
   Problem: stage-web.yaml, stage-cloud-api.yaml, stage-core.yaml didn't set
   DD_SITE, DD_API_KEY, DD_LOGS_INJECTION, etc.

   For web/cloud-api:
   - Added to top-level env section so justfile can use them

   For core:
   - Added to top-level env section
   - Added to Deploy step env section (so justfile can pass to helm)
   - core OTEL collector config reads these from environment

   Result: core logs showed "exporters::datadog: api.key is not set"

3. Wrong environment tag in staging (secondary issue)
   Problem: letta-web logs showed 'dd.env":"production"' in staging
   Cause: Missing backslash broke LETTA_ENVIRONMENT, defaulted to prod
   Fixed: Backslash fix ensures LETTA_ENVIRONMENT=dev is set

Changes:
- justfile: Fixed missing backslash on LETTA_ENVIRONMENT line
- .github/workflows/stage-web.yaml: Added DD_* vars to env
- .github/workflows/stage-cloud-api.yaml: Added DD_* vars to env
- .github/workflows/stage-core.yaml: Added DD_* vars to env and Deploy step

After this fix:
- Web/cloud-api will send traces to Datadog Agent via OTLP
- Core OTEL collector will export traces to both ClickHouse and Datadog
- All staging traces will show env:dev tag (not env:production)

* fix: move OTEL config from prod helm to dev helm values

Problem: OTEL configuration was added to production helm values files
(helm/letta-web/values.yaml and helm/cloud-api/values.yaml) but these
are for production deployments. Staging deployments use the dev helm
values (helm/dev/<service>/values.yaml).

Changes:
- Removed OTEL vars from helm/letta-web/values.yaml (prod)
- Removed OTEL vars from helm/cloud-api/values.yaml (prod)
- Added OTEL vars to helm/dev/letta-web/values.yaml (staging)
- Added OTEL vars to helm/dev/cloud-api/values.yaml (staging)

Dev helm values now include:
  OTEL_ENABLED: "true"
  OTEL_SERVICE_NAME: "letta-web" or "cloud-api"
  OTEL_EXPORTER_OTLP_ENDPOINT: "http://datadog-agent.default.svc.cluster.local:4317"
  LETTA_ENVIRONMENT: "dev"

Note: Production deployments override these via workflow env vars, so
prod helm values don't need OTEL config. Dev/staging deployments use
these helm values as defaults.

* remove generated doc

* secrets in dev

* totally unrelated changes to tf for runner sizing and scaling

* feat: add DD_ENV tags to staging helm for log correlation

Problem: Logs show 'dd.env":"production"' instead of 'dd.env":"dev"'
in staging because Datadog's logger injection uses DD_ENV, DD_SERVICE,
and DD_VERSION environment variables for tagging.

Changes:
- Added DD_ENV, DD_SERVICE, DD_VERSION to helm/dev/letta-web/values.yaml
- Added DD_ENV, DD_SERVICE, DD_VERSION to helm/dev/cloud-api/values.yaml

Values:
  DD_ENV: "dev"
  DD_SERVICE: "letta-web" or "cloud-api"
  DD_VERSION: "dev"

This ensures:
- Logs show correct env:dev tag in Datadog
- Traces and logs are properly correlated
- Consistent tagging across OTEL traces and DD logs

* feat: enable OTLP receiver in Datadog Agent configurations

Added OpenTelemetry Protocol (OTLP) receiver to Datadog Agent for both
dev and prod environments to support distributed tracing from services
using OpenTelemetry SDKs.

Changes:
- helm/dev/datadog/datadog-agent.yaml: Added otlp.receiver configuration
- helm/datadog/datadog-agent.yaml: Added otlp.receiver configuration

OTLP Configuration:
  otlp:
    receiver:
      protocols:
        grpc:
          enabled: true
          endpoint: "0.0.0.0:4317"
        http:
          enabled: true
          endpoint: "0.0.0.0:4318"

This enables:
- Web/cloud-api services to send traces via OTLP (port 4317)
- Core OTEL collector to export to Datadog via OTLP (port 4317)
- Alternative HTTP endpoint for OTLP (port 4318)

When applied, the Datadog Agent service will expose:
- Port 4317/TCP - OTLP gRPC (for traces)
- Port 4318/TCP - OTLP HTTP (for traces)
- Port 8126/TCP - Native Datadog APM (existing)
- Port 8125/UDP - DogStatsD (existing)

Apply with:
  kubectl apply -f helm/dev/datadog/datadog-agent.yaml     # staging
  kubectl apply -f helm/datadog/datadog-agent.yaml         # production

* feat: use git hash as DD_VERSION for all services

Changed from static version strings to using git commit hash as the
version tag in Datadog APM for better version tracking and correlation.

Changes:

1. Workflows - Set DD_VERSION to github.sha:
   - .github/workflows/stage-web.yaml: Added DD_VERSION: ${{ github.sha }}
   - .github/workflows/stage-cloud-api.yaml: Added DD_VERSION: ${{ github.sha }}
   - .github/workflows/stage-core.yaml: Added DD_VERSION: ${{ github.sha }}
     (both top-level env and Deploy step env)

2. Justfile - Pass DD_VERSION to helm:
   - deploy-web: Added --set env.DD_VERSION=${DD_VERSION:-unknown}
   - deploy-cloud-api: Added --set env.DD_VERSION=${DD_VERSION:-unknown}
   - deploy-core: Added --set secrets.DD_VERSION=${DD_VERSION:-unknown}

3. Helm dev values - Remove hardcoded version:
   - helm/dev/letta-web/values.yaml: Removed DD_VERSION: "dev"
   - helm/dev/cloud-api/values.yaml: Removed DD_VERSION: "dev"
   - Added comments that DD_VERSION is set via workflow

Result:
- Traces in Datadog will show version as git commit SHA (e.g., "abc123def")
- Can correlate traces with specific deployments/commits
- Consistent with internal versioning strategy (git hash, not semver)
- Defaults to "unknown" if DD_VERSION not set

Example trace tags after deployment:
  env:dev
  service:letta-web
  version:7eafc5b0c12345...

* feat: add DD_VERSION to production workflows

Added DD_VERSION to production deployment workflows for consistent version
tracking across staging and production environments.

Changes:
- .github/workflows/deploy-web.yml: Added DD_VERSION: ${{ github.sha }}
- .github/workflows/deploy-core.yml: Added DD_VERSION: ${{ github.sha }}

Note: deploy-cloud-api.yml doesn't have DD config yet, will add when
cloud-api gets OTEL enabled in production.

Context:
This was partially flagged by bugbot - it noted that NEXT_PUBLIC_GIT_HASH
was missing from prod, but that was incorrect (line 53 already has it).
However, DD_VERSION was indeed missing and needed for Datadog log
correlation.

Result:
- Production logs will show version tag matching git commit SHA
- Consistent with staging configuration
- Better trace/log correlation in Datadog APM

Staging already has DD_VERSION (added in commit fb1a3eea0)

* feat: add DD tags to memgpt-server dev helm for APM correlation

Problem: memgpt-server logs show up in Datadog but traces don't appear
properly in APM UI because DD_ENV, DD_SERVICE, DD_SITE tags were missing.

The service was using native Datadog agent instrumentation (via
LETTA_TELEMETRY_ENABLE_DATADOG) but without proper unified service tagging,
traces weren't being correlated correctly in the APM interface.

Changes:
- helm/dev/memgpt-server/values.yaml:
  - Added DD_ENV: "dev"
  - Added DD_SERVICE: "memgpt-server"
  - Added DD_SITE: "us5.datadoghq.com"
  - Added comment that DD_VERSION comes from workflow

Existing configuration:
- DD_VERSION already passed via stage-core.yaml (line 215) and justfile (line 272)
- DD_API_KEY already in secretsProvider (line 194)
- LETTA_TELEMETRY_ENABLE_DATADOG: "true" (enables native DD agent)
- LETTA_TELEMETRY_DATADOG_AGENT_HOST/PORT (routes to DD cluster agent)

Result:
After redeployment, memgpt-server traces will show in Datadog APM with:
- env:dev
- service:memgpt-server
- version:<git-hash>
- Proper correlation with logs

* refactor: use image tag for DD_VERSION instead of separate env var

Changed from passing DD_VERSION separately to deriving it from the
image.tag that's already set (which contains the git hash).

This is cleaner because:
- Image tag is already set to git hash via TAG env var
- Removes redundant DD_VERSION from workflows (6 locations)
- Single source of truth for version (the deployed image tag)
- Simpler configuration

Changes:

Workflows (removed DD_VERSION):
- .github/workflows/stage-web.yaml
- .github/workflows/stage-cloud-api.yaml
- .github/workflows/stage-core.yaml (2 locations)
- .github/workflows/deploy-web.yml
- .github/workflows/deploy-core.yml

Justfile (use {{TAG}} instead of ${DD_VERSION}):
- deploy-web: --set env.DD_VERSION={{TAG}}
- deploy-cloud-api: --set env.DD_VERSION={{TAG}}
- deploy-core: --set secrets.DD_VERSION={{TAG}}

Helm values (updated comments):
- helm/dev/letta-web/values.yaml
- helm/dev/cloud-api/values.yaml
- helm/dev/memgpt-server/values.yaml
- Changed from "set via workflow" to "set from image.tag by justfile"

Flow:
1. Workflow sets TAG=${{ github.sha }}
2. Workflow calls justfile with TAG env var
3. Justfile sets image.tag={{TAG}} and DD_VERSION={{TAG}}
4. Both use same git hash value

Example:
  image.tag: abc123def
  DD_VERSION: abc123def
  Both from TAG env var set to github.sha

* feat: add Datadog native tracer (dd-trace) to cloud-api for APM

Problem: cloud-api traces weren't appearing in Datadog APM despite OTEL
being configured. Investigation revealed letta-web uses dd-trace (Datadog's
native tracer) in addition to OTEL, and those traces show up perfectly.

Analysis:
- letta-web: Uses BOTH OTEL + dd-trace → traces visible in APM ✓
- cloud-api: Uses ONLY OTEL → traces NOT visible in APM ✗

Root cause: While OTEL *should* work, dd-trace provides better integration
with Datadog's APM backend and is proven to work in production.

Solution: Add dd-trace initialization to cloud-api, matching letta-web's
dual-tracing approach (OTEL + dd-trace).

Changes:
- apps/cloud-api/src/instrument-otel.ts:
  - Added dd-trace initialization after OTEL setup
  - Checks for DD_API_KEY env var (already configured in helm)
  - Enables logInjection, runtimeMetrics, and profiling
  - Graceful fallback if dd-trace fails to initialize

Dependencies:
- dd-trace@^5.31.0 already available in root package.json

Configuration (already set in helm):
- DD_API_KEY: From secretsProvider ✓
- DD_ENV: "dev" ✓
- DD_SERVICE: "cloud-api" ✓
- DD_LOGS_INJECTION: From workflow ✓

Expected result:
After deployment, cloud-api traces will appear in Datadog APM alongside
letta-web and letta-server, with proper env:dev service:cloud-api tags.

* tweak vars in staging

* fix: initialize Datadog tracer for memgpt-server APM traces

Problem: memgpt-server (letta-server) shows up in Datadog APM with env:null
instead of env:dev, and traces weren't being properly captured.

Root cause: The code was only initializing the Datadog Profiler (for CPU/memory
profiling), but NOT the Tracer (for distributed tracing/APM).

Analysis:
- Profiler: Records performance metrics (CPU, memory) - WAS initialized ✓
- Tracer: Records distributed traces/spans for APM - NOT initialized ✗

The existing code (line 248-256) did:
  from ddtrace.profiling import Profiler  # Only profiler!
  profiler = Profiler(...)
  profiler.start()
  # No tracer initialization!

This explains why:
- letta-server appears in Datadog with env:null (profiling data sent without proper tags)
- Traces don't show proper service/env correlation
- APM service map is incomplete

Solution: Initialize the Datadog tracer with ddtrace.patch_all() to:
1. Auto-instrument FastAPI, HTTP clients, database calls, etc.
2. Send proper distributed traces to Datadog APM
3. Use the DD_ENV, DD_SERVICE env vars already set in helm

Changes:
- apps/core/letta/server/rest_api/app.py:
  - Added import ddtrace
  - Added ddtrace.patch_all() to auto-instrument all libraries
  - Added logging for tracer initialization

Configuration (already set in helm):
- DD_ENV: "dev" ✓
- DD_SERVICE: "memgpt-server" ✓
- DD_SITE: "us5.datadoghq.com" ✓
- DD_VERSION: From image.tag ✓
- DD_AGENT_HOST/PORT: Set by code from settings ✓

Expected result:
After redeployment, letta-server will:
- Show as env:dev (not env:null) in Datadog APM
- Send proper distributed traces with full context
- Appear correctly in service maps and trace explorer

* fix: add dd-trace dependency to cloud-api package.json

Problem: cloud-api Docker image doesn't include dd-trace, causing
"Cannot find module 'dd-trace'" error at runtime.

Root cause: dd-trace is in root package.json but not in cloud-api's
package.json, so it's not included in the Docker build.

Solution: Add dd-trace@^5.31.0 to cloud-api dependencies.

Changes:
- apps/cloud-api/package.json: Added dd-trace dependency

* fix: mark dd-trace as external in cloud-api esbuild config

Problem: esbuild fails when trying to bundle dd-trace because it attempts
to bundle optional GraphQL plugin dependencies that aren't installed.

Error:
  Could not resolve "graphql/language/visitor"
  Could not resolve "graphql/language/printer"
  Could not resolve "graphql/utilities"

Root cause: dd-trace has optional plugins for various frameworks (GraphQL,
MongoDB, etc.) that it loads conditionally at runtime. esbuild tries to
statically analyze and bundle all requires, including these optional deps.

Solution: Add dd-trace to the externals list so it's loaded at runtime
instead of being bundled. This is the standard approach for native modules
and packages with optional dependencies.

Changes:
- apps/cloud-api/esbuild.config.js: Added 'dd-trace' to externals array

Result:
- Build succeeds ✓
- dd-trace loads at runtime with only the plugins it needs ✓
- No GraphQL dependency required ✓

* add dd-trace

* fix: increase cloud-api memory and make dd-trace profiling configurable

Problem: cloud-api pods crash looping with out of memory errors when
dd-trace profiling is enabled:
  FATAL ERROR: JavaScript heap out of memory
  current_heap_limit=268435456 (268MB in 512Mi total)

Root cause: dd-trace profiling is memory-intensive (50-100MB+ overhead)
and the original 512Mi limit was too tight.

Solution: Two-part fix:
1. Increase memory limits: 512Mi → 1Gi (gives profiling room to breathe)
2. Make profiling configurable via DD_PROFILING_ENABLED env var

Changes:

helm/dev/cloud-api/values.yaml:
- resources.limits.memory: 512Mi → 1Gi
- resources.requests.memory: 512Mi → 1Gi
- Added DD_PROFILING_ENABLED: "true"

apps/cloud-api/src/instrument-otel.ts:
- Read DD_PROFILING_ENABLED env var
- Pass to tracer.init({ profiling: profilingEnabled })
- Log profiling status on initialization

Benefits:
✓ Profiling enabled by default (CPU/heap flame graphs in Datadog)
✓ Can disable via env var if needed (set to "false")
✓ More headroom prevents OOM crashes (1Gi vs 512Mi)
✓ Configurable per environment

Memory breakdown with profiling:
- App baseline: ~300-400MB
- dd-trace profiling: ~50-100MB
- Buffer/headroom: ~500MB
- Total: 1Gi (comfortable margin)
2025-12-15 12:02:34 -08:00
Ari Webb
3e02f12dfd feat: add tool embedding and search [LET-6333] (#6398)
* feat: add tool embedding and search

* fix ci

* add env variable for embedding tools

---------

Co-authored-by: Ari Webb <ari@letta.com>
2025-11-26 14:39:40 -08:00
Kian Jones
94c2921711 chore: walk back some temporary debugging stuff (#6332)
* first pass

* uv lock
2025-11-24 19:10:27 -08:00
Kian Jones
7ccaa2a33a feat: add profiling enablement flag (#6306)
* add flag and settings tweaks

* add to deploy pipeline

* letta error mesage
2025-11-24 19:10:26 -08:00
Ari Webb
474f8c1f89 fix: in cloud don't give default actor [LET-6184] (#6218)
* fix: in cloud don't give default actor

* set in justfile

---------

Co-authored-by: Ari Webb <ari@letta.com>
2025-11-24 19:09:33 -08:00
Sarah Wooders
5730f69ecf feat: modal tool execution - NO FEATURE FLAGS USES MODAL [LET-4357] (#5120)
* initial commit

* add delay to deploy

* fix tests

* add tests

* passing tests

* cleanup

* and use modal

* working on modal

* gate on tool metadata

* agent state

* cleanup

---------

Co-authored-by: Letta Bot <noreply@letta.com>
2025-11-13 15:36:56 -08:00
Matthew Zhou
8df78e9429 feat: Move file upload to temporal [LET-6089] (#6024)
* Finish writing temporal upload file activity

* Remove prints

* Rewrite content re-use
2025-11-13 15:36:55 -08:00
Sarah Wooders
fd7c8193fe feat: remove chunking for archival memory [LET-6080] (#5997)
* feat: remove chunking for archival memory

* add error and tests
2025-11-13 15:36:55 -08:00
Kian Jones
704d3b2d79 chore: refactor not to use warnings.warn (#5730)
* refactor not to use warnings.warn

* temp circular import fix maybe unecessary/bnad

* fix Deprecation warning

* fix deprecation warning and mcp thing?

* revert changes to mcp server test

* fix deprecation warning
2025-10-24 15:14:31 -07:00
Kian Jones
705bb9d958 fix: support default empty (#5713)
support default empty
2025-10-24 15:14:20 -07:00
Kian Jones
1577a261d8 feat: add profiling and structured logging (#5690)
* test dd build

* dd agent in cluster

* quick poc

* refactor and add logging

* remove tracing etc.

* add changes to otel logging config

* refactor to accept my feedback

* finishing touches
2025-10-24 15:14:20 -07:00
Sarah Wooders
305bb8c8f7 feat: inject letta_client and agent_id into local sandbox (#5192) 2025-10-24 15:12:11 -07:00
Sarah Wooders
1edb21a652 feat: add letta_agent_v1 flag for docker image and some patches (#5081)
* feat: add letta_agent_v1 flag for docker image and some patches

* update build
2025-10-09 15:25:21 -07:00
Kian Jones
ade992d5ca refactor: remove letta prefix from sonnet (#5216)
* Refactor: Rename ANTHROPIC_SONNET_1M setting alias

Co-authored-by: kian <kian@letta.com>

* Refactor: Rename ANTHROPIC_SONNET_1M env var

Co-authored-by: kian <kian@letta.com>

---------

Co-authored-by: Cursor Agent <cursoragent@cursor.com>
2025-10-07 17:50:50 -07:00
cthomas
391ecdba6d feat: clean up settings file (#5210) 2025-10-07 17:50:50 -07:00
Sarah Wooders
e07a589796 chore: rm composio (#5151) 2025-10-07 17:50:49 -07:00
Charles Packer
07a687880f feat(core): add sonnet 1m support [LET-4620] (#5152)
feat(core): add sonnet 1m support
2025-10-07 17:50:49 -07:00
Sarah Wooders
300c32456e feat: add composite message index and reduce pool timeout (#5156) 2025-10-07 17:50:49 -07:00
Charles Packer
811b3e6cb6 feat: allow customizing the handle base for openrouter and for vllm [LET-4609] (#5114)
* feat: allow setting VLLM_HANDLE_BASE

* feat: same thing for openrouter
2025-10-07 17:50:49 -07:00
Charles Packer
9edc7f4d64 feat: add OpenRouterProvider (#4848)
* feat: init add of openrouter provider, doesn't work bc no key pass and no header pass

* fix: working
2025-10-07 17:50:45 -07:00
Sarah Wooders
3b9d59d618 chore: revent db_max_concurrent_sessions (#4878) 2025-10-07 17:50:45 -07:00
Sarah Wooders
4df0a27eb0 chore: remove sync db (#4873) 2025-10-07 17:50:45 -07:00
Matthew Zhou
0ced006375 chore: set db_max_concurrent_sessions default to 48 (#4876)
set default to 48
2025-10-07 17:50:45 -07:00
Matthew Zhou
b27dec5ca5 chore: Bump pg pool timeout (#4870)
Bump pg pool timeout
2025-10-07 17:50:45 -07:00
Matthew Zhou
8395ec429a feat: Add flag for TLS (#4865)
Add flag for TLS
2025-10-07 17:50:45 -07:00
cthomas
ef9bac78ec chore: cleanup experimental flag (#4818) 2025-10-07 17:50:44 -07:00
cthomas
ee39b2bff2 feat: ensure temporal is hit through fastapi (#4784)
feat: ensure temporal works end-to-end
2025-10-07 17:50:44 -07:00
cthomas
992f94da4b feat: integrate temporal into letta (#4766)
* feat: integrate temporal into letta

* use fire and forget, set up cancellation and job status checking
2025-10-07 17:50:43 -07:00
Kian Jones
b8e9a80d93 merge this (#4759)
* wait I forgot to comit locally

* cp the entire core directory and then rm the .git subdir
2025-09-17 15:47:40 -07:00