Initial commit: Community ADE foundation

- Project structure: docs/, src/, tests/, proto/
- Research synthesis: Letta vs commercial ADEs
- Architecture: Redis Streams queue design
- Phase 1 orchestration design
- Execution plan and project state tracking
- Working subagent system (manager.ts fixes)

This is the foundation for a Community ADE built on Letta's
stateful agent architecture with git-native MemFS.

👾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta Code <noreply@letta.com>
This commit is contained in:
Ani (Annie Tunturi)
2026-03-18 10:30:20 -04:00
commit 00382055c6
9 changed files with 2970 additions and 0 deletions

View File

@@ -0,0 +1,835 @@
# Redis Task Queue Architecture for Letta Community ADE
## Executive Summary
This document outlines the architecture for replacing the in-memory `QueueRuntime` with a Redis-backed persistent task queue. The design prioritizes durability, horizontal scalability, and reliable task execution while maintaining compatibility with the existing Task tool and subagent spawning workflows.
**Key Decisions:**
- Use **Redis Streams** (not Sorted Sets) for the primary task queue to leverage consumer groups and at-least-once delivery guarantees
- Hybrid approach: Streams for queue semantics, Sorted Sets for scheduling/delays, Hashes for task state
- Stateless workers with heartbeat-based liveness detection
- Exponential backoff with jitter for retry logic
---
## 1. Redis Data Structures
### 1.1 Primary Queue: Redis Stream
```
Key: ade:queue:tasks
Type: Stream
Purpose: Main task ingestion and distribution
```
**Why Streams over Sorted Sets?**
| Feature | Sorted Sets | Redis Streams |
|---------|-------------|---------------|
| Ordering | Score-based (can have ties) | Strict temporal (millisecond ID) |
| Consumer Groups | Manual implementation | Built-in XREADGROUP |
| Delivery Semantics | At-most-once (easy) / At-least-once (complex) | At-least-once with ACK |
| Pending Tracking | Manual | Built-in XPENDING |
| Claim/Retry | Custom Lua scripts | Built-in XCLAIM/XAUTOCLAIM |
| Message Visibility | Immediate to all | Consumer-group isolated |
Streams provide the exact semantics needed for reliable task processing without custom Lua scripting.
**Stream Entries:**
```
XADD ade:queue:tasks * taskId <uuid> payload <json> priority <int>
```
### 1.2 Delayed Tasks: Sorted Set
```
Key: ade:queue:delayed
Type: Sorted Set (ZSET)
Score: scheduled execution timestamp (ms)
Member: taskId
```
Used for:
- Tasks with explicit `runAfter` timestamps
- Retry scheduling with exponential backoff
- Rate-limited task release
### 1.3 Task State Storage: Redis Hash
```
Key: ade:task:{taskId}
Type: Hash
Fields:
- id: string (UUID v4)
- status: pending|running|completed|failed
- payload: JSON (task arguments)
- createdAt: timestamp (ms)
- startedAt: timestamp (ms)
- completedAt: timestamp (ms)
- workerId: string (nullable)
- attemptCount: integer
- maxAttempts: integer (default: 3)
- error: string (last error message)
- result: JSON (completed task result)
- parentTaskId: string (nullable, for task chains)
- subagentId: string (link to subagent state)
- priority: integer (0-9, default 5)
- kind: message|task_notification|approval_result|overlay_action
TTL: 7 days (configurable cleanup for completed/failed tasks)
```
### 1.4 Worker Registry: Redis Hash + Sorted Set
```
Key: ade:workers:active
Type: Hash
Fields per worker:
- {workerId}: JSON { hostname, pid, startedAt, lastHeartbeat, version }
Key: ade:workers:heartbeat
Type: Sorted Set
Score: last heartbeat timestamp
Member: workerId
```
### 1.5 Consumer Group State
```
Stream Consumer Group: ade:queue:tasks
Group Name: ade-workers
Consumer Name: {workerId} (unique per process)
```
Redis Streams automatically track:
- Pending messages per consumer (XPENDING)
- Delivery count per message
- Idle time since last read
---
## 2. Task Entity Schema
### 2.1 TypeScript Interface
```typescript
// src/queue/redis/types.ts
export type TaskStatus =
| "pending" // Enqueued, not yet claimed
| "running" // Claimed by worker, processing
| "completed" // Successfully finished
| "failed" // Exhausted all retries
| "cancelled"; // Explicitly cancelled
export type TaskKind =
| "message"
| "task_notification"
| "approval_result"
| "overlay_action";
export interface TaskPayload {
// Task identification
id: string; // UUID v4
kind: TaskKind;
// Execution context
agentId?: string;
conversationId?: string;
clientMessageId?: string;
// Content (varies by kind)
content?: unknown; // For "message" kind
text?: string; // For notification/approval/overlay
// Subagent execution params (for task_notification)
subagentType?: string;
prompt?: string;
model?: string;
existingAgentId?: string;
existingConversationId?: string;
maxTurns?: number;
// Scheduling
priority: number; // 0-9, lower = higher priority
runAfter?: number; // Timestamp ms (for delayed tasks)
// Retry configuration
maxAttempts: number;
backoffMultiplier: number; // Default: 2
maxBackoffMs: number; // Default: 300000 (5 min)
// Metadata
enqueuedAt: number;
source: "user" | "system" | "hook";
}
export interface TaskState extends TaskPayload {
status: TaskStatus;
workerId?: string;
attemptCount: number;
startedAt?: number;
completedAt?: number;
error?: string;
result?: unknown;
// Coalescing support (from QueueRuntime)
isCoalescable: boolean;
scopeKey?: string; // For grouping coalescable items
}
```
### 2.2 State Transitions
```
┌─────────────┐
│ PENDING │◄──────────────────┐
│ (queued) │ │
└──────┬──────┘ │
│ claim │ retry
▼ │ (with delay)
┌─────────────┐ │
┌─────────│ RUNNING │───────────────────┘
│ │ (claimed) │ fail (retryable)
│ └──────┬──────┘
complete │ │ fail (final)
│ ▼
│ ┌─────────────┐
└────────►│ COMPLETED │
└─────────────┘
┌──────┴──────┐
│ FAILED │
│ (exhausted)│
└─────────────┘
```
---
## 3. Worker Pool Registration and Heartbeat
### 3.1 Worker Lifecycle
```typescript
// src/queue/redis/worker.ts
class TaskWorker {
private workerId: string;
private redis: RedisClient;
private isRunning: boolean = false;
private heartbeatInterval?: NodeJS.Timeout;
private claimInterval?: NodeJS.Timeout;
// Config
private readonly HEARTBEAT_INTERVAL_MS = 5000;
private readonly HEARTBEAT_TIMEOUT_MS = 30000;
private readonly CLAIM_BATCH_SIZE = 10;
private readonly PROCESSING_TIMEOUT_MS = 300000; // 5 min
async start(): Promise<void> {
this.workerId = generateWorkerId(); // {hostname}:{pid}:{uuid}
// Register in worker registry
await this.redis.hSet("ade:workers:active", this.workerId, JSON.stringify({
hostname: os.hostname(),
pid: process.pid,
startedAt: Date.now(),
lastHeartbeat: Date.now(),
version: process.env.npm_package_version || "unknown"
}));
// Create consumer in stream group (idempotent)
try {
await this.redis.xGroupCreate("ade:queue:tasks", "ade-workers", "$", {
MKSTREAM: true
});
} catch (err) {
// Group already exists - ignore
}
this.isRunning = true;
this.startHeartbeat();
this.startClaimLoop();
}
async stop(): Promise<void> {
this.isRunning = false;
clearInterval(this.heartbeatInterval);
clearInterval(this.claimInterval);
// Release pending tasks back to queue
await this.releasePendingTasks();
// Deregister
await this.redis.hDel("ade:workers:active", this.workerId);
await this.redis.zRem("ade:workers:heartbeat", this.workerId);
}
private startHeartbeat(): void {
this.heartbeatInterval = setInterval(async () => {
await this.redis.zAdd("ade:workers:heartbeat", {
score: Date.now(),
value: this.workerId
});
await this.redis.hSet("ade:workers:active", this.workerId, JSON.stringify({
...currentInfo,
lastHeartbeat: Date.now()
}));
}, this.HEARTBEAT_INTERVAL_MS);
}
}
```
### 3.2 Dead Worker Detection
```typescript
// src/queue/redis/orchestrator.ts (singleton, per-deployment)
class QueueOrchestrator {
async detectAndReclaimDeadWorkerTasks(): Promise<number> {
const now = Date.now();
const cutoff = now - this.HEARTBEAT_TIMEOUT_MS;
// Find dead workers
const deadWorkers = await this.redis.zRangeByScore(
"ade:workers:heartbeat",
"-inf",
cutoff
);
let reclaimedCount = 0;
for (const workerId of deadWorkers) {
// Find pending tasks for this worker using XPENDING
const pending = await this.redis.xPendingRange(
"ade:queue:tasks",
"ade-workers",
"-",
"+
this.CLAIM_BATCH_SIZE
);
for (const item of pending) {
if (item.consumer === workerId && item.idle > this.PROCESSING_TIMEOUT_MS) {
// Use XAUTOCLAIM to atomically claim and retry
const [nextId, claimed] = await this.redis.xAutoClaim(
"ade:queue:tasks",
"ade-workers",
"orchestrator", // consumer name for cleanup
this.PROCESSING_TIMEOUT_MS,
item.id,
{ COUNT: 1 }
);
// Release back to pending by ACKing (removes from pending list)
// The orchestrator will re-add to delayed queue for retry
await this.redis.xAck("ade:queue:tasks", "ade-workers", item.id);
await this.scheduleRetry(item.id);
reclaimedCount++;
}
}
// Clean up dead worker registration
await this.redis.hDel("ade:workers:active", workerId);
await this.redis.zRem("ade:workers:heartbeat", workerId);
}
return reclaimedCount;
}
}
```
---
## 4. Retry Logic with Exponential Backoff
### 4.1 Backoff Calculation
```typescript
// src/queue/redis/retry.ts
interface RetryConfig {
attempt: number; // 0-indexed (0 = first retry)
baseDelayMs: number; // Default: 1000
multiplier: number; // Default: 2
maxDelayMs: number; // Default: 300000 (5 min)
jitterFactor: number; // Default: 0.1 (10% randomization)
}
function calculateRetryDelay(config: RetryConfig): number {
// Exponential backoff: base * (multiplier ^ attempt)
const exponentialDelay = config.baseDelayMs *
Math.pow(config.multiplier, config.attempt);
// Cap at max
const cappedDelay = Math.min(exponentialDelay, config.maxDelayMs);
// Add jitter to prevent thundering herd: ±jitterFactor
const jitter = cappedDelay * config.jitterFactor * (Math.random() * 2 - 1);
return Math.floor(cappedDelay + jitter);
}
// Examples with defaults:
// Attempt 0 (first retry): ~1000ms ±100ms
// Attempt 1: ~2000ms ±200ms
// Attempt 2: ~4000ms ±400ms
// Attempt 3: ~8000ms ±800ms
// Attempt 4: ~16000ms ±1600ms
// ...up to max 300000ms (5 min)
```
### 4.2 Retry Flow
```typescript
async function handleTaskFailure(
taskId: string,
error: Error,
workerId: string
): Promise<void> {
const taskKey = `ade:task:${taskId}`;
const task = await redis.hGetAll(taskKey);
const attemptCount = parseInt(task.attemptCount) + 1;
const maxAttempts = parseInt(task.maxAttempts);
if (attemptCount >= maxAttempts) {
// Final failure - mark as failed
await redis.hSet(taskKey, {
status: "failed",
error: error.message,
completedAt: Date.now(),
attemptCount: attemptCount.toString()
});
// Publish failure event for observers
await redis.publish("ade:events:task-failed", JSON.stringify({
taskId,
error: error.message,
totalAttempts: attemptCount
}));
// ACK to remove from pending
await redis.xAck("ade:queue:tasks", "ade-workers", taskId);
} else {
// Schedule retry
const delay = calculateRetryDelay({
attempt: attemptCount,
baseDelayMs: 1000,
multiplier: 2,
maxDelayMs: 300000,
jitterFactor: 0.1
});
const runAfter = Date.now() + delay;
// Update task state
await redis.hSet(taskKey, {
status: "pending",
attemptCount: attemptCount.toString(),
error: error.message,
workerId: "" // Clear worker assignment
});
// Add to delayed queue
await redis.zAdd("ade:queue:delayed", {
score: runAfter,
value: taskId
});
// ACK to remove from stream pending
await redis.xAck("ade:queue:tasks", "ade-workers", taskId);
}
}
```
### 4.3 Delayed Task Promoter
```typescript
// Runs periodically (every 1 second) to move due tasks from delayed set to stream
async function promoteDelayedTasks(): Promise<number> {
const now = Date.now();
// Atomically get and remove due tasks
const dueTasks = await redis.zRangeByScore(
"ade:queue:delayed",
"-inf",
now,
{ LIMIT: { offset: 0, count: 100 } }
);
if (dueTasks.length === 0) return 0;
// Remove from delayed queue
await redis.zRem("ade:queue:delayed", dueTasks);
// Re-add to stream for processing
for (const taskId of dueTasks) {
const task = await redis.hGetAll(`ade:task:${taskId}`);
await redis.xAdd("ade:queue:tasks", "*", {
taskId,
payload: task.payload,
priority: task.priority
});
}
return dueTasks.length;
}
```
---
## 5. Integration with Existing Task.ts
### 5.1 Adapter Pattern
```typescript
// src/queue/redis/adapter.ts
import { QueueRuntime, QueueItem, DequeuedBatch } from "../queueRuntime";
import { RedisQueue } from "./queue";
/**
* Redis-backed implementation of QueueRuntime interface.
* Allows drop-in replacement of in-memory queue.
*/
export class RedisQueueAdapter implements QueueRuntime {
private redisQueue: RedisQueue;
private localBatchBuffer: Map<string, QueueItem> = new Map();
constructor(redisUrl: string, options?: QueueRuntimeOptions) {
this.redisQueue = new RedisQueue(redisUrl, {
...options,
onTaskCompleted: this.handleTaskCompleted.bind(this),
onTaskFailed: this.handleTaskFailed.bind(this)
});
}
async enqueue(input: Omit<QueueItem, "id" | "enqueuedAt">): Promise<QueueItem | null> {
// Map QueueItem to TaskPayload
const taskId = generateUUID();
const enqueuedAt = Date.now();
const payload: TaskPayload = {
id: taskId,
kind: input.kind,
agentId: input.agentId,
conversationId: input.conversationId,
clientMessageId: input.clientMessageId,
text: (input as any).text,
content: (input as any).content,
priority: 5, // Default priority
maxAttempts: 3,
backoffMultiplier: 2,
maxBackoffMs: 300000,
enqueuedAt,
source: "user",
isCoalescable: isCoalescable(input.kind)
};
const success = await this.redisQueue.enqueue(payload);
if (!success) return null;
return {
...input,
id: taskId,
enqueuedAt
} as QueueItem;
}
async tryDequeue(blockedReason: QueueBlockedReason | null): Promise<DequeuedBatch | null> {
if (blockedReason !== null) {
// Emit blocked event if needed (preserving QueueRuntime behavior)
return null;
}
// Claim batch from Redis
const batch = await this.redisQueue.claimBatch({
consumerId: this.workerId,
batchSize: this.getCoalescingBatchSize(),
coalescingWindowMs: 50 // Small window for coalescing
});
if (!batch || batch.length === 0) return null;
// Map back to QueueItem format
const items: QueueItem[] = batch.map(task => this.mapTaskToQueueItem(task));
return {
batchId: generateBatchId(),
items,
mergedCount: items.length,
queueLenAfter: await this.redisQueue.getQueueLength()
};
}
// ... other QueueRuntime methods
}
```
### 5.2 Task.ts Integration Points
**Current Flow (Task.ts line 403+):**
```typescript
// Background task spawning
const { taskId, outputFile, subagentId } = spawnBackgroundSubagentTask({
subagentType: subagent_type,
prompt,
description,
model,
toolCallId,
existingAgentId: args.agent_id,
existingConversationId: args.conversation_id,
maxTurns: args.max_turns,
});
```
**Proposed Redis Integration:**
```typescript
// New: Redis-backed task queue integration
interface TaskQueueEnqueueOptions {
subagentType: string;
prompt: string;
description: string;
model?: string;
toolCallId?: string;
existingAgentId?: string;
existingConversationId?: string;
maxTurns?: number;
priority?: number;
runInBackground?: boolean;
}
// In Task.ts - replace spawnBackgroundSubagentTask with:
export async function enqueueSubagentTask(
args: TaskQueueEnqueueOptions,
queue: RedisQueue
): Promise<TaskEnqueueResult> {
const taskId = generateTaskId();
const subagentId = generateSubagentId();
// Register in subagent state store (for UI)
registerSubagent(subagentId, args.subagentType, args.description, args.toolCallId, true);
const outputFile = createBackgroundOutputFile(taskId);
// Create task payload
const payload: TaskPayload = {
id: taskId,
kind: "task_notification",
subagentType: args.subagentType,
prompt: args.prompt,
description: args.description,
model: args.model,
existingAgentId: args.existingAgentId,
existingConversationId: args.existingConversationId,
maxTurns: args.maxTurns,
subagentId,
outputFile,
priority: args.priority ?? 5,
maxAttempts: 3,
backoffMultiplier: 2,
maxBackoffMs: 300000,
enqueuedAt: Date.now(),
source: "user",
isCoalescable: false // Task notifications are not coalescable
};
// Enqueue to Redis
await queue.enqueue(payload);
return { taskId, outputFile, subagentId };
}
```
### 5.3 Worker Implementation for Subagents
```typescript
// src/queue/redis/subagent-worker.ts
class SubagentTaskWorker extends TaskWorker {
protected async processTask(task: TaskState): Promise<void> {
// Update subagent state to "running"
updateSubagent(task.subagentId!, { status: "running" });
try {
// Execute subagent (existing manager.ts logic)
const result = await spawnSubagent(
task.subagentType!,
task.prompt!,
task.model,
task.subagentId!,
undefined, // signal - handled via task cancellation
task.existingAgentId,
task.existingConversationId,
task.maxTurns
);
// Write transcript
writeTaskTranscriptResult(task.outputFile!, result, "");
// Complete subagent state
completeSubagent(task.subagentId!, {
success: result.success,
error: result.error,
totalTokens: result.totalTokens
});
// Send notification if not silent
if (!task.silent) {
const notification = formatTaskNotification({
taskId: task.id,
status: result.success ? "completed" : "failed",
summary: `Agent "${task.description}" ${result.success ? "completed" : "failed"}`,
result: result.success ? result.report : result.error,
outputFile: task.outputFile!
});
// Add to message queue for parent agent
addToMessageQueue({
kind: "task_notification",
text: notification
});
}
// Mark task completed
await this.completeTask(task.id, result);
} catch (error) {
const errorMessage = error instanceof Error ? error.message : String(error);
// Update subagent state
completeSubagent(task.subagentId!, { success: false, error: errorMessage });
// Fail task (triggers retry logic)
await this.failTask(task.id, new Error(errorMessage));
}
}
}
```
---
## 6. Operational Considerations
### 6.1 Redis Configuration
```yaml
# Recommended Redis config for task queue
maxmemory: 1gb
maxmemory-policy: allkeys-lru # Evict old completed tasks first
# Persistence (for durability)
appendonly: yes
appendfsync: everysec
# Stream trimming (prevent unbounded growth)
# Set via XTRIM or MAXLEN on XADD
```
### 6.2 Key Patterns and Cleanup
| Key Pattern | Type | TTL | Cleanup Strategy |
|-------------|------|-----|------------------|
| `ade:queue:tasks` | Stream | - | XTRIM by MAXLEN (keep 100k) |
| `ade:queue:delayed` | ZSET | - | Processed by promoter |
| `ade:task:{id}` | Hash | 7 days | Expire completed/failed |
| `ade:workers:active` | Hash | - | On worker deregistration |
| `ade:workers:heartbeat` | ZSET | - | On worker timeout |
### 6.3 Monitoring Metrics
```typescript
// Metrics to expose via Prometheus/StatsD
interface QueueMetrics {
// Queue depth
"ade_queue_pending_total": number; // XPENDING count
"ade_queue_delayed_total": number; // ZCARD ade:queue:delayed
"ade_queue_stream_length": number; // XLEN ade:queue:tasks
// Throughput
"ade_tasks_enqueued_rate": number; // XADD rate
"ade_tasks_completed_rate": number; // Completion rate
"ade_tasks_failed_rate": number; // Failure rate
// Worker health
"ade_workers_active_total": number; // HLEN ade:workers:active
"ade_workers_dead_total": number; // Detected dead workers
// Processing
"ade_task_duration_ms": Histogram; // Time from claim to complete
"ade_task_wait_ms": Histogram; // Time from enqueue to claim
"ade_task_attempts": Histogram; // Distribution of retry counts
}
```
### 6.4 Failure Modes
| Scenario | Handling |
|----------|----------|
| Redis unavailable | Tasks fail immediately; caller responsible for retry |
| Worker crash | Tasks reclaimed via heartbeat timeout (30s) |
| Poison message | Max retries (3) then moved to DLQ |
| Slow task | Processing timeout (5 min) triggers requeue |
| Duplicate task | Idempotent task IDs (UUID) prevent double execution |
---
## 7. Migration Strategy
### Phase 1: Dual-Write (Week 1)
- Implement RedisQueueAdapter
- Write to both in-memory and Redis queues
- Read from in-memory only (Redis for validation)
### Phase 2: Shadow Mode (Week 2)
- Read from both queues
- Compare results, log discrepancies
- Fix any edge cases
### Phase 3: Cutover (Week 3)
- Switch reads to Redis
- Keep in-memory as fallback
- Monitor for 1 week
### Phase 4: Cleanup (Week 4)
- Remove in-memory queue code
- Full Redis dependency
---
## 8. Implementation Checklist
- [ ] Redis client configuration (ioredis or node-redis)
- [ ] Task entity schema and serialization
- [ ] Stream consumer group setup
- [ ] Worker registration and heartbeat
- [ ] Task claim and processing loop
- [ ] Retry logic with exponential backoff
- [ ] Delayed task promotion
- [ ] Dead worker detection and reclamation
- [ ] QueueRuntime adapter implementation
- [ ] Task.ts integration
- [ ] Subagent state synchronization
- [ ] Metrics and monitoring
- [ ] Error handling and DLQ
- [ ] Tests (unit, integration, load)
- [ ] Documentation
---
## 9. Appendix: Redis Commands Reference
| Operation | Command | Complexity |
|-----------|---------|------------|
| Enqueue task | `XADD` | O(1) |
| Claim tasks | `XREADGROUP` | O(N) N=count |
| Ack completion | `XACK` | O(1) |
| Get pending | `XPENDING` | O(1) |
| Claim pending | `XCLAIM` / `XAUTOCLAIM` | O(log N) |
| Delay task | `ZADD` delayed | O(log N) |
| Promote delayed | `ZRANGEBYSCORE` + `ZREM` + `XADD` | O(log N + M) |
| Register worker | `HSET` + `ZADD` | O(1) |
| Heartbeat | `ZADD` | O(log N) |
| Detect dead | `ZRANGEBYSCORE` | O(log N + M) |