# Phase 1 Execution Plan: Orchestration Layer **Date:** March 18, 2026 **Status:** Ready for Implementation **Estimated Duration:** 6 weeks **Owner:** TBD --- ## Overview This document provides actionable implementation guidance for Phase 1 of the Community ADE, based on synthesized research from commercial tools (Intent, Warp) and open-source alternatives (Aider, Cline, Agno). --- ## Key Research Insights ### 1. Letta's Competitive Position **✅ Strongest Open-Source Position:** - No competitor combines: stateful agents + hierarchical memory + git-native persistence + subagent orchestration - Aider has git integration but no agent memory - Cline is session-based with no persistence - Agno lacks Letta's memory architecture **⚠️ Commercial Tools Lead in UX:** - Warp: Terminal-native with rich context (@file, images) - Intent: Specification-driven development - Both have web dashboards; Letta needs one ### 2. Technical Pattern Validation **Redis + Workers (Selected for Phase 1):** - ✅ Proven pattern (Celery uses Redis under hood) - ✅ Simpler than Temporal for our use case - ✅ More control over data model - ⚠️ Temporal deferred to Phase 2 evaluation **React + FastAPI (Selected for Phase 2):** - ✅ Industry standard - ✅ shadcn/ui provides accessible components - ✅ TanStack Query for caching/real-time sync --- ## Phase 1 Scope ### Goals 1. Replace in-process Task execution with persistent queue 2. Ensure tasks survive agent restarts 3. Support 5+ concurrent workers 4. Maintain backward compatibility ### Out of Scope (Phase 2+) - Web dashboard (Phase 2) - Temporal workflows (Phase 2 evaluation) - GitHub integration (Phase 3) - Computer Use (Phase 4) --- ## Implementation Breakdown ### Week 1: In-Memory Prototype **Deliverables:** - [ ] `TaskQueue` class with asyncio.Queue - [ ] Task dataclass with all fields - [ ] Worker process skeleton - [ ] Basic enqueue/dequeue/complete/fail operations **Testing:** ```python # Test: Task survives worker crash # Test: Concurrent task execution # Test: Priority ordering ``` **Code Structure:** ``` letta_ade/ ├── __init__.py ├── queue/ │ ├── __init__.py │ ├── models.py # Task dataclass, enums │ ├── memory_queue.py # Week 1 implementation │ └── base.py # Abstract base class └── worker/ ├── __init__.py └── runner.py # Worker process logic ``` ### Week 2: Redis Integration **Deliverables:** - [ ] Redis connection manager - [ ] Task serialization (JSON/pickle) - [ ] Atomic dequeue with WATCH/MULTI/EXEC - [ ] Status tracking (Sets per status) **Redis Schema:** ```redis # Task storage HSET task:{uuid} field value ... # Priority queue (pending) ZADD queue:pending {priority} {task_id} # Running tasks ZADD queue:running {started_at} {task_id} # Status index SADD status:pending {task_id} SADD status:running {task_id} SADD status:completed {task_id} SADD status:failed {task_id} # User index SADD user:{user_id}:tasks {task_id} ``` **Dependencies:** ```toml [dependencies] redis = { version = "^5.0", extras = ["hiredis"] } ``` ### Week 3-4: Worker Pool + Heartbeat **Deliverables:** - [ ] Multiple worker processes - [ ] Worker heartbeat (every 30s) - [ ] Stall detection (2x heartbeat timeout) - [ ] Graceful shutdown handling - [ ] Worker capacity management **Worker Logic:** ```python async def worker_loop(agent_id: UUID, queue: TaskQueue): while running: # Send heartbeat await queue.heartbeat(agent_id) # Try to get task (5s timeout) task = await queue.dequeue(agent_id, timeout_ms=5000) if task: # Spawn subagent process proc = await asyncio.create_subprocess_exec( "letta", "run-agent", f"--task-id={task.id}", stdout=asyncio.subprocess.PIPE, stderr=asyncio.subprocess.PIPE ) # Wait for completion stdout, stderr = await proc.communicate() # Update queue if proc.returncode == 0: await queue.complete(task.id, parse_result(stdout)) else: await queue.fail(task.id, stderr.decode()) # Brief pause to prevent tight loop await asyncio.sleep(0.1) ``` **Stall Recovery (Cron job):** ```python async def recover_stalled_tasks(queue: TaskQueue, max_age: timedelta): """Requeue tasks from crashed workers.""" stalled = await queue.find_stalled(max_age) for task_id in stalled: await queue.requeue(task_id) ``` ### Week 5: API Layer **Deliverables:** - [ ] FastAPI application structure - [ ] REST endpoints (CRUD for tasks) - [ ] WebSocket endpoint for real-time updates - [ ] Authentication middleware **REST Endpoints:** ```python @app.post("/tasks") async def create_task(task: TaskCreate) -> TaskResponse: """Enqueue a new task.""" task_id = await queue.enqueue(task) return TaskResponse(task_id=task_id, status="pending") @app.get("/tasks/{task_id}") async def get_task(task_id: UUID) -> Task: """Get task status and result.""" return await queue.get(task_id) @app.get("/tasks") async def list_tasks( user_id: str, status: Optional[TaskStatus] = None ) -> List[TaskSummary]: """List tasks with optional filtering.""" return await queue.list_by_user(user_id, status) @app.post("/tasks/{task_id}/cancel") async def cancel_task(task_id: UUID): """Cancel a pending or running task.""" await queue.cancel(task_id) @app.post("/tasks/{task_id}/retry") async def retry_task(task_id: UUID): """Retry a failed task.""" await queue.retry(task_id) ``` **WebSocket:** ```python @app.websocket("/ws") async def websocket_endpoint(websocket: WebSocket): await websocket.accept() # Subscribe to Redis pub/sub for updates pubsub = redis.pubsub() pubsub.subscribe("task_updates") async for message in pubsub.listen(): if message["type"] == "message": await websocket.send_json(message["data"]) ``` ### Week 6: Task Tool Integration **Deliverables:** - [ ] Modify existing Task tool to use queue - [ ] `persist` flag for backward compatibility - [ ] Polling support for task completion - [ ] Migration guide for existing code **Modified Task Tool:** ```python class TaskTool: async def run( self, prompt: str, subagent_type: str, # ... existing args ... persist: bool = False, # NEW priority: int = 100, # NEW wait: bool = False, # NEW timeout: int = 300, # NEW ) -> TaskResult: if persist: # Enqueue and optionally wait task_id = await self.queue.enqueue(...) if wait: # Poll for completion result = await self._wait_for_task(task_id, timeout) return result else: # Return immediately with task_id return TaskResult(task_id=task_id, status="pending") else: # Legacy immediate execution return await self._execute_immediately(...) ``` --- ## Technical Specifications ### Task Data Model ```python @dataclass class Task: id: UUID subagent_type: str prompt: str system_prompt: Optional[str] model: Optional[str] # State status: TaskStatus priority: int = 100 created_at: datetime started_at: Optional[datetime] completed_at: Optional[datetime] # Execution agent_id: Optional[UUID] retry_count: int = 0 max_retries: int = 3 # Results result: Optional[dict] error: Optional[str] exit_code: Optional[int] # Metadata tags: List[str] user_id: str parent_task: Optional[UUID] # Cost tracking (NEW) input_tokens: int = 0 output_tokens: int = 0 estimated_cost: float = 0.0 ``` ### Retry Logic ```python async def retry_with_backoff(task: Task) -> bool: if task.retry_count >= task.max_retries: return False # Permanent failure # Exponential backoff: 2^retry_count seconds delay = min(2 ** task.retry_count, 300) # Cap at 5 min await asyncio.sleep(delay) task.retry_count += 1 # Re-enqueue with same priority await queue.enqueue(task, priority=task.priority) return True ``` ### Error Classification | Error | Retry? | Action | |-------|--------|--------| | Subagent crash | Yes | Requeue with backoff | | Syntax error | No | Fail immediately | | API rate limit | Yes | Exponential backoff | | Out of memory | No | Alert admin, fail | | Redis connection | Yes | Reconnect, retry | | Timeout | Yes | Retry with longer timeout | --- ## Testing Strategy ### Unit Tests ```python # test_queue.py def test_enqueue_creates_pending_task(): def test_dequeue_removes_from_pending(): def test_complete_moves_to_completed(): def test_fail_triggers_retry(): def test_max_retries_exceeded(): def test_cancel_stops_running_task(): ``` ### Integration Tests ```python # test_worker.py async def test_worker_processes_task(): async def test_worker_handles_failure(): async def test_worker_heartbeat(): async def test_stall_recovery(): ``` ### Durability Tests ```python # test_durability.py async def test_tasks_survive_restart(): """Enqueue tasks, restart Redis, verify tasks persist.""" async def test_worker_crash_recovery(): """Kill worker mid-task, verify task requeued.""" async def test_concurrent_workers(): """5 workers, 20 tasks, verify all complete.""" ``` --- ## Dependencies ### Required ```toml redis = { version = "^5.0", extras = ["hiredis"] } fastapi = "^0.115" websockets = "^13.0" pydantic = "^2.0" ``` ### Development ```toml pytest = "^8.0" pytest-asyncio = "^0.24" httpx = "^0.27" # For FastAPI test client ``` ### Infrastructure - Redis 7.0+ (local or cloud) - Python 3.11+ --- ## Migration Guide ### For Existing Task Tool Users **Before:** ```python result = await task_tool.run( prompt="Create a React component", subagent_type="coder" ) # Blocks until complete ``` **After (backward compatible):** ```python # Same behavior (immediate execution) result = await task_tool.run( prompt="Create a React component", subagent_type="coder", persist=False # default ) ``` **New (persistent):** ```python # Fire-and-forget task_id = await task_tool.run( prompt="Create a React component", subagent_type="coder", persist=True ) # Wait for completion result = await task_tool.run( prompt="Create a React component", subagent_type="coder", persist=True, wait=True, timeout=600 ) ``` --- ## Success Criteria | Metric | Target | Measurement | |--------|--------|-------------| | Task durability | 100% | Tasks never lost on restart | | Throughput | 10 tasks/min | With 3 workers | | Latency | <100ms | Enqueue → pending | | Recovery time | <60s | Worker crash → requeue | | API uptime | 99.9% | Health check endpoint | | Backward compat | 100% | Existing tests pass | --- ## Risk Mitigation | Risk | Likelihood | Impact | Mitigation | |------|------------|--------|------------| | Redis complexity | Low | Medium | Start with simple ops | | Worker pool bugs | Medium | High | Extensive testing | | Performance issues | Low | Medium | Load testing Week 5 | | Migration breakage | Low | High | Full test suite | --- ## Handoff to Phase 2 **Phase 2 Prereqs:** - [ ] All Phase 1 success criteria met - [ ] API documentation complete - [ ] WebSocket tested with simple client - [ ] Cost tracking working **Phase 2 Inputs:** - Task queue API (REST + WebSocket) - Task data model - Worker management API - Redis schema --- ## Appendix: Quick Reference ### Redis Commands Cheat Sheet ```bash # Start Redis docker run -d -p 6379:6379 redis:7-alpine # Monitor redis-cli monitor # Inspect keys redis-cli KEYS "task:*" redis-cli HGETALL task:abc-123 # Clear queue redis-cli FLUSHDB ``` ### Development Commands ```bash # Start worker python -m letta_ade.worker.runner --agent-id worker-1 # Start API uvicorn letta_ade.api:app --reload # Run tests pytest tests/ -v --tb=short # Integration test pytest tests/integration/ -v ``` --- *Ready for implementation. Questions? See community-ade-research-synthesis-2026-03-18.md for full context.*