- Project structure: docs/, src/, tests/, proto/ - Research synthesis: Letta vs commercial ADEs - Architecture: Redis Streams queue design - Phase 1 orchestration design - Execution plan and project state tracking - Working subagent system (manager.ts fixes) This is the foundation for a Community ADE built on Letta's stateful agent architecture with git-native MemFS. 👾 Generated with [Letta Code](https://letta.com) Co-Authored-By: Letta Code <noreply@letta.com>
8.9 KiB
8.9 KiB
Phase 1: Orchestration Layer Design
Date: March 18, 2026 Architect: Researcher subagent Goal: Design persistent task queue system for Community ADE
1. Core Data Model
@dataclass
class Task:
id: UUID # Unique task identifier
subagent_type: str # "researcher", "coder", etc.
prompt: str # User prompt to subagent
system_prompt: Optional[str] # Override default system prompt
model: Optional[str] # Override default model
# State tracking
status: TaskStatus # pending/running/completed/failed/cancelled
priority: int = 100 # Lower = higher priority
created_at: datetime
started_at: Optional[datetime]
completed_at: Optional[datetime]
# Execution tracking
agent_id: Optional[UUID] # Assigned worker agent
retry_count: int = 0
max_retries: int = 3
# Results
result: Optional[dict] # Success result
error: Optional[str] # Failure message
exit_code: Optional[int] # Subprocess exit code
# Metadata
tags: List[str] # For filtering/grouping
user_id: str # Task owner
parent_task: Optional[UUID] # For task chains
TaskStatus Enum
class TaskStatus(Enum):
PENDING = "pending" # Waiting for worker
RUNNING = "running" # Assigned to worker
COMPLETED = "completed" # Success
FAILED = "failed" # Permanent failure (max retries)
CANCELLED = "cancelled" # User cancelled
STALLED = "stalled" # Worker crashed, needs recovery
2. State Machine
+-----------+
| PENDING |
+-----+-----+
| dequeue()
v
+--------+ +-------------+ +-----------+
| FAILED |<--------+ RUNNING +-------->| COMPLETED |
+--------+ fail() +------+------+ success +-----------+
^ max | |
| retries | |
+------------------+ | cancel()
retry() v
+-----------+
| CANCELLED |
+-----------+
^
| stall detected
+----------+
| STALLED |
+----------+
Transitions
PENDING → RUNNING: Worker dequeues taskRUNNING → COMPLETED: Subagent succeedsRUNNING → FAILED: Subagent fails, max retries reachedRUNNING → STALLED: Worker heartbeat timeoutSTALLED → RUNNING: Reassigned to new workerFAILED → RUNNING: Manual retry triggered- Any → CANCELLED: User cancellation
3. Redis Data Structures
| Purpose | Structure | Key Pattern |
|---|---|---|
| Task payload | Hash | task:{task_id} |
| Pending queue | Sorted Set (by priority) | queue:pending |
| Running set | Set | queue:running |
| Worker registry | Hash | worker:{agent_id} |
| Status index | Set per status | status:{status} |
| User tasks | Set | user:{user_id}:tasks |
Example Redis Operations
# Enqueue (pending)
ZADD queue:pending {priority} {task_id}
HSET task:{task_id} status pending created_at {timestamp} ...
SADD status:pending {task_id}
# Dequeue (atomic)
WATCH queue:pending
task_id = ZPOPMIN queue:pending
MULTI
ZADD queue:running {now} {task_id}
HSET task:{task_id} status running agent_id {worker} started_at {now}
SMOVE status:pending status:running {task_id}
EXEC
# Complete
ZREM queue:running {task_id}
SADD status:completed {task_id}
HSET task:{task_id} status completed result {...} completed_at {now}
# Fail with retry
HINCRBY task:{task_id} retry_count 1
ZADD queue:pending {priority} {task_id} # Re-enqueue
SMOVE status:running status:pending {task_id}
HSET task:{task_id} status pending error {...}
# Stall recovery (cron job)
SMEMBERS queue:running
# For each task where worker heartbeat > threshold:
ZREM queue:running {task_id}
SADD status:stalled {task_id}
ZADD queue:pending {priority} {task_id}
4. Key API Methods
class TaskQueue:
# Core operations
async def enqueue(task: Task) -> UUID
async def dequeue(worker_id: UUID, timeout_ms: int = 5000) -> Optional[Task]
async def complete(task_id: UUID, result: dict) -> None
async def fail(task_id: UUID, error: str, retryable: bool = True) -> None
async def cancel(task_id: UUID) -> None
# Management
async def retry(task_id: UUID) -> None # Manual retry
async def requeue_stalled(max_age_ms: int = 60000) -> int # Recover crashed
async def get_status(task_id: UUID) -> TaskStatus
async def list_by_user(user_id: str, status: Optional[str]) -> List[TaskSummary]
# Worker management
async def register_worker(agent_id: UUID, capacity: int) -> None
async def heartbeat(agent_id: UUID) -> None
async def unregister_worker(agent_id: UUID, reason: str) -> None
5. Integration with Existing Task Tool
Current Flow
Task tool → immediate subprocess spawn → wait → return result
New Flow (with persistence)
Task tool → enqueue() → return task_id (immediate)
↓
Background worker → dequeue() → spawn subprocess → complete()/fail()
↓
Caller polls/gets notification when task completes
Changes to Task Tool Schema
class TaskTool:
async def run(
self,
prompt: str,
subagent_type: str,
# ... existing args ...
persist: bool = False, # NEW: enqueue instead of immediate run
priority: int = 100, # NEW
tags: Optional[List[str]] = None # NEW
) -> TaskResult:
if persist:
task_id = await self.queue.enqueue(...)
return TaskResult(task_id=task_id, status="pending")
else:
# Legacy: immediate execution
...
Worker Agent Integration
Worker subscribes to queue:
async def worker_loop(agent_id: UUID):
while running:
task = await queue.dequeue(agent_id, timeout_ms=5000)
if task:
# Spawn subprocess
proc = await asyncio.create_subprocess_exec(
"letta", "run-agent", f"--task-id={task.id}",
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE
)
# Monitor and wait
stdout, stderr = await proc.communicate()
# Update queue based on result
if proc.returncode == 0:
await queue.complete(task.id, parse_result(stdout))
else:
await queue.fail(task.id, stderr.decode(), retryable=True)
6. Implementation Phases
Phase 1a: In-Memory Prototype (Week 1)
- Python
asyncio.Queuefor pending tasks - In-memory dict for task storage
- Single worker process
- No Redis dependency
Phase 1b: Redis Integration (Week 2)
- Replace queue with Redis
- Add task persistence
- Implement retry logic
- Add stall recovery
Phase 1c: Worker Pool (Week 3-4)
- Multiple worker processes
- Worker heartbeat monitoring
- Task assignment logic
- Graceful shutdown handling
Phase 1d: API & CLI (Week 5-6)
- REST API for task management
- CLI commands for queue inspection
- Task status dashboard endpoint
- Webhook notifications
Phase 1e: Integration (Week 7-8)
- Modify Task tool to use queue
- Add persistence flag
- Maintain backward compatibility
- Migration path for existing code
7. Retry Logic with Exponential Backoff
async def retry_with_backoff(task_id: UUID):
task = await queue.get(task_id)
if task.retry_count >= task.max_retries:
await queue.fail(task_id, "Max retries exceeded", retryable=False)
return
# Exponential backoff: 2^retry_count seconds
delay = min(2 ** task.retry_count, 300) # Cap at 5 minutes
await asyncio.sleep(delay)
# Re-enqueue with same priority
await queue.enqueue(task, priority=task.priority)
8. Error Handling Strategy
| Error Type | Retry? | Action |
|---|---|---|
| Subagent crash | Yes | Increment retry, requeue |
| Syntax error in code | No | Fail immediately |
| Timeout | Yes | Retry with longer timeout |
| API rate limit | Yes | Retry with exponential backoff |
| Out of memory | No | Fail, alert admin |
| Redis connection lost | Yes | Reconnect, retry operation |
Next Steps
- Implement in-memory prototype (Week 1)
- Add Redis persistence (Week 2)
- Build worker pool (Week 3-4)
- Integrate with Task tool (Week 7-8)
- Write tests for queue durability (ongoing)
Design by Researcher subagent, March 18, 2026