Initial commit: Community ADE foundation

- Project structure: docs/, src/, tests/, proto/ - Research synthesis: Letta vs commercial ADEs - Architecture: Redis Streams queue design - Phase 1 orchestration design - Execution plan and project state tracking - Working subagent system (manager.ts fixes) This is the foundation for a Community ADE built on Letta's stateful agent architecture with git-native MemFS. 👾 Generated with [Letta Code](https://letta.com) Co-Authored-By: Letta Code <noreply@letta.com>
2026-03-18 10:30:20 -04:00
commit 00382055c6
9 changed files with 2970 additions and 0 deletions
--- a/docs/ade-phase1-execution-plan.md
+++ b/docs/ade-phase1-execution-plan.md
@@ -0,0 +1,525 @@
+# Phase 1 Execution Plan: Orchestration Layer
+
+**Date:** March 18, 2026  
+**Status:** Ready for Implementation  
+**Estimated Duration:** 6 weeks  
+**Owner:** TBD
+
+---
+
+## Overview
+
+This document provides actionable implementation guidance for Phase 1 of the Community ADE, based on synthesized research from commercial tools (Intent, Warp) and open-source alternatives (Aider, Cline, Agno).
+
+---
+
+## Key Research Insights
+
+### 1. Letta's Competitive Position
+
+**✅ Strongest Open-Source Position:**
+- No competitor combines: stateful agents + hierarchical memory + git-native persistence + subagent orchestration
+- Aider has git integration but no agent memory
+- Cline is session-based with no persistence
+- Agno lacks Letta's memory architecture
+
+**⚠️ Commercial Tools Lead in UX:**
+- Warp: Terminal-native with rich context (@file, images)
+- Intent: Specification-driven development
+- Both have web dashboards; Letta needs one
+
+### 2. Technical Pattern Validation
+
+**Redis + Workers (Selected for Phase 1):**
+- ✅ Proven pattern (Celery uses Redis under hood)
+- ✅ Simpler than Temporal for our use case
+- ✅ More control over data model
+- ⚠️ Temporal deferred to Phase 2 evaluation
+
+**React + FastAPI (Selected for Phase 2):**
+- ✅ Industry standard
+- ✅ shadcn/ui provides accessible components
+- ✅ TanStack Query for caching/real-time sync
+
+---
+
+## Phase 1 Scope
+
+### Goals
+1. Replace in-process Task execution with persistent queue
+2. Ensure tasks survive agent restarts
+3. Support 5+ concurrent workers
+4. Maintain backward compatibility
+
+### Out of Scope (Phase 2+)
+- Web dashboard (Phase 2)
+- Temporal workflows (Phase 2 evaluation)
+- GitHub integration (Phase 3)
+- Computer Use (Phase 4)
+
+---
+
+## Implementation Breakdown
+
+### Week 1: In-Memory Prototype
+
+**Deliverables:**
+- [ ] `TaskQueue` class with asyncio.Queue
+- [ ] Task dataclass with all fields
+- [ ] Worker process skeleton
+- [ ] Basic enqueue/dequeue/complete/fail operations
+
+**Testing:**
+```python
+# Test: Task survives worker crash
+# Test: Concurrent task execution
+# Test: Priority ordering
+```
+
+**Code Structure:**
+```
+letta_ade/
+├── __init__.py
+├── queue/
+│   ├── __init__.py
+│   ├── models.py          # Task dataclass, enums
+│   ├── memory_queue.py    # Week 1 implementation
+│   └── base.py            # Abstract base class
+└── worker/
+    ├── __init__.py
+    └── runner.py          # Worker process logic
+```
+
+### Week 2: Redis Integration
+
+**Deliverables:**
+- [ ] Redis connection manager
+- [ ] Task serialization (JSON/pickle)
+- [ ] Atomic dequeue with WATCH/MULTI/EXEC
+- [ ] Status tracking (Sets per status)
+
+**Redis Schema:**
+```redis
+# Task storage
+HSET task:{uuid} field value ...
+
+# Priority queue (pending)
+ZADD queue:pending {priority} {task_id}
+
+# Running tasks
+ZADD queue:running {started_at} {task_id}
+
+# Status index
+SADD status:pending {task_id}
+SADD status:running {task_id}
+SADD status:completed {task_id}
+SADD status:failed {task_id}
+
+# User index
+SADD user:{user_id}:tasks {task_id}
+```
+
+**Dependencies:**
+```toml
+[dependencies]
+redis = { version = "^5.0", extras = ["hiredis"] }
+```
+
+### Week 3-4: Worker Pool + Heartbeat
+
+**Deliverables:**
+- [ ] Multiple worker processes
+- [ ] Worker heartbeat (every 30s)
+- [ ] Stall detection (2x heartbeat timeout)
+- [ ] Graceful shutdown handling
+- [ ] Worker capacity management
+
+**Worker Logic:**
+```python
+async def worker_loop(agent_id: UUID, queue: TaskQueue):
+    while running:
+        # Send heartbeat
+        await queue.heartbeat(agent_id)
+        
+        # Try to get task (5s timeout)
+        task = await queue.dequeue(agent_id, timeout_ms=5000)
+        
+        if task:
+            # Spawn subagent process
+            proc = await asyncio.create_subprocess_exec(
+                "letta", "run-agent",
+                f"--task-id={task.id}",
+                stdout=asyncio.subprocess.PIPE,
+                stderr=asyncio.subprocess.PIPE
+            )
+            
+            # Wait for completion
+            stdout, stderr = await proc.communicate()
+            
+            # Update queue
+            if proc.returncode == 0:
+                await queue.complete(task.id, parse_result(stdout))
+            else:
+                await queue.fail(task.id, stderr.decode())
+        
+        # Brief pause to prevent tight loop
+        await asyncio.sleep(0.1)
+```
+
+**Stall Recovery (Cron job):**
+```python
+async def recover_stalled_tasks(queue: TaskQueue, max_age: timedelta):
+    """Requeue tasks from crashed workers."""
+    stalled = await queue.find_stalled(max_age)
+    for task_id in stalled:
+        await queue.requeue(task_id)
+```
+
+### Week 5: API Layer
+
+**Deliverables:**
+- [ ] FastAPI application structure
+- [ ] REST endpoints (CRUD for tasks)
+- [ ] WebSocket endpoint for real-time updates
+- [ ] Authentication middleware
+
+**REST Endpoints:**
+```python
+@app.post("/tasks")
+async def create_task(task: TaskCreate) -> TaskResponse:
+    """Enqueue a new task."""
+    task_id = await queue.enqueue(task)
+    return TaskResponse(task_id=task_id, status="pending")
+
+@app.get("/tasks/{task_id}")
+async def get_task(task_id: UUID) -> Task:
+    """Get task status and result."""
+    return await queue.get(task_id)
+
+@app.get("/tasks")
+async def list_tasks(
+    user_id: str,
+    status: Optional[TaskStatus] = None
+) -> List[TaskSummary]:
+    """List tasks with optional filtering."""
+    return await queue.list_by_user(user_id, status)
+
+@app.post("/tasks/{task_id}/cancel")
+async def cancel_task(task_id: UUID):
+    """Cancel a pending or running task."""
+    await queue.cancel(task_id)
+
+@app.post("/tasks/{task_id}/retry")
+async def retry_task(task_id: UUID):
+    """Retry a failed task."""
+    await queue.retry(task_id)
+```
+
+**WebSocket:**
+```python
+@app.websocket("/ws")
+async def websocket_endpoint(websocket: WebSocket):
+    await websocket.accept()
+    
+    # Subscribe to Redis pub/sub for updates
+    pubsub = redis.pubsub()
+    pubsub.subscribe("task_updates")
+    
+    async for message in pubsub.listen():
+        if message["type"] == "message":
+            await websocket.send_json(message["data"])
+```
+
+### Week 6: Task Tool Integration
+
+**Deliverables:**
+- [ ] Modify existing Task tool to use queue
+- [ ] `persist` flag for backward compatibility
+- [ ] Polling support for task completion
+- [ ] Migration guide for existing code
+
+**Modified Task Tool:**
+```python
+class TaskTool:
+    async def run(
+        self,
+        prompt: str,
+        subagent_type: str,
+        # ... existing args ...
+        persist: bool = False,        # NEW
+        priority: int = 100,          # NEW
+        wait: bool = False,           # NEW
+        timeout: int = 300,           # NEW
+    ) -> TaskResult:
+        
+        if persist:
+            # Enqueue and optionally wait
+            task_id = await self.queue.enqueue(...)
+            
+            if wait:
+                # Poll for completion
+                result = await self._wait_for_task(task_id, timeout)
+                return result
+            else:
+                # Return immediately with task_id
+                return TaskResult(task_id=task_id, status="pending")
+        else:
+            # Legacy immediate execution
+            return await self._execute_immediately(...)
+```
+
+---
+
+## Technical Specifications
+
+### Task Data Model
+
+```python
+@dataclass
+class Task:
+    id: UUID
+    subagent_type: str
+    prompt: str
+    system_prompt: Optional[str]
+    model: Optional[str]
+    
+    # State
+    status: TaskStatus
+    priority: int = 100
+    created_at: datetime
+    started_at: Optional[datetime]
+    completed_at: Optional[datetime]
+    
+    # Execution
+    agent_id: Optional[UUID]
+    retry_count: int = 0
+    max_retries: int = 3
+    
+    # Results
+    result: Optional[dict]
+    error: Optional[str]
+    exit_code: Optional[int]
+    
+    # Metadata
+    tags: List[str]
+    user_id: str
+    parent_task: Optional[UUID]
+    
+    # Cost tracking (NEW)
+    input_tokens: int = 0
+    output_tokens: int = 0
+    estimated_cost: float = 0.0
+```
+
+### Retry Logic
+
+```python
+async def retry_with_backoff(task: Task) -> bool:
+    if task.retry_count >= task.max_retries:
+        return False  # Permanent failure
+    
+    # Exponential backoff: 2^retry_count seconds
+    delay = min(2 ** task.retry_count, 300)  # Cap at 5 min
+    
+    await asyncio.sleep(delay)
+    task.retry_count += 1
+    
+    # Re-enqueue with same priority
+    await queue.enqueue(task, priority=task.priority)
+    return True
+```
+
+### Error Classification
+
+| Error | Retry? | Action |
+|-------|--------|--------|
+| Subagent crash | Yes | Requeue with backoff |
+| Syntax error | No | Fail immediately |
+| API rate limit | Yes | Exponential backoff |
+| Out of memory | No | Alert admin, fail |
+| Redis connection | Yes | Reconnect, retry |
+| Timeout | Yes | Retry with longer timeout |
+
+---
+
+## Testing Strategy
+
+### Unit Tests
+```python
+# test_queue.py
+def test_enqueue_creates_pending_task():
+def test_dequeue_removes_from_pending():
+def test_complete_moves_to_completed():
+def test_fail_triggers_retry():
+def test_max_retries_exceeded():
+def test_cancel_stops_running_task():
+```
+
+### Integration Tests
+```python
+# test_worker.py
+async def test_worker_processes_task():
+async def test_worker_handles_failure():
+async def test_worker_heartbeat():
+async def test_stall_recovery():
+```
+
+### Durability Tests
+```python
+# test_durability.py
+async def test_tasks_survive_restart():
+    """Enqueue tasks, restart Redis, verify tasks persist."""
+    
+async def test_worker_crash_recovery():
+    """Kill worker mid-task, verify task requeued."""
+    
+async def test_concurrent_workers():
+    """5 workers, 20 tasks, verify all complete."""
+```
+
+---
+
+## Dependencies
+
+### Required
+```toml
+redis = { version = "^5.0", extras = ["hiredis"] }
+fastapi = "^0.115"
+websockets = "^13.0"
+pydantic = "^2.0"
+```
+
+### Development
+```toml
+pytest = "^8.0"
+pytest-asyncio = "^0.24"
+httpx = "^0.27"  # For FastAPI test client
+```
+
+### Infrastructure
+- Redis 7.0+ (local or cloud)
+- Python 3.11+
+
+---
+
+## Migration Guide
+
+### For Existing Task Tool Users
+
+**Before:**
+```python
+result = await task_tool.run(
+    prompt="Create a React component",
+    subagent_type="coder"
+)  # Blocks until complete
+```
+
+**After (backward compatible):**
+```python
+# Same behavior (immediate execution)
+result = await task_tool.run(
+    prompt="Create a React component",
+    subagent_type="coder",
+    persist=False  # default
+)
+```
+
+**New (persistent):**
+```python
+# Fire-and-forget
+task_id = await task_tool.run(
+    prompt="Create a React component",
+    subagent_type="coder",
+    persist=True
+)
+
+# Wait for completion
+result = await task_tool.run(
+    prompt="Create a React component",
+    subagent_type="coder",
+    persist=True,
+    wait=True,
+    timeout=600
+)
+```
+
+---
+
+## Success Criteria
+
+| Metric | Target | Measurement |
+|--------|--------|-------------|
+| Task durability | 100% | Tasks never lost on restart |
+| Throughput | 10 tasks/min | With 3 workers |
+| Latency | <100ms | Enqueue → pending |
+| Recovery time | <60s | Worker crash → requeue |
+| API uptime | 99.9% | Health check endpoint |
+| Backward compat | 100% | Existing tests pass |
+
+---
+
+## Risk Mitigation
+
+| Risk | Likelihood | Impact | Mitigation |
+|------|------------|--------|------------|
+| Redis complexity | Low | Medium | Start with simple ops |
+| Worker pool bugs | Medium | High | Extensive testing |
+| Performance issues | Low | Medium | Load testing Week 5 |
+| Migration breakage | Low | High | Full test suite |
+
+---
+
+## Handoff to Phase 2
+
+**Phase 2 Prereqs:**
+- [ ] All Phase 1 success criteria met
+- [ ] API documentation complete
+- [ ] WebSocket tested with simple client
+- [ ] Cost tracking working
+
+**Phase 2 Inputs:**
+- Task queue API (REST + WebSocket)
+- Task data model
+- Worker management API
+- Redis schema
+
+---
+
+## Appendix: Quick Reference
+
+### Redis Commands Cheat Sheet
+
+```bash
+# Start Redis
+docker run -d -p 6379:6379 redis:7-alpine
+
+# Monitor
+redis-cli monitor
+
+# Inspect keys
+redis-cli KEYS "task:*"
+redis-cli HGETALL task:abc-123
+
+# Clear queue
+redis-cli FLUSHDB
+```
+
+### Development Commands
+
+```bash
+# Start worker
+python -m letta_ade.worker.runner --agent-id worker-1
+
+# Start API
+uvicorn letta_ade.api:app --reload
+
+# Run tests
+pytest tests/ -v --tb=short
+
+# Integration test
+pytest tests/integration/ -v
+```
+
+---
+
+*Ready for implementation. Questions? See community-ade-research-synthesis-2026-03-18.md for full context.*