community-ade/docs/ade-phase1-execution-plan.md

# Phase 1 Execution Plan: Orchestration Layer

**Date:** March 18, 2026
**Status:** Ready for Implementation
**Estimated Duration:** 6 weeks
**Owner:** TBD

---

## Overview

This document provides actionable implementation guidance for Phase 1 of the Community ADE, based on synthesized research from commercial tools (Intent, Warp) and open-source alternatives (Aider, Cline, Agno).

---

## Key Research Insights

### 1. Letta's Competitive Position

**✅ Strongest Open-Source Position:**
- No competitor combines: stateful agents + hierarchical memory + git-native persistence + subagent orchestration
- Aider has git integration but no agent memory
- Cline is session-based with no persistence
- Agno lacks Letta's memory architecture

**⚠️ Commercial Tools Lead in UX:**
- Warp: Terminal-native with rich context (@file, images)
- Intent: Specification-driven development
- Both have web dashboards; Letta needs one

### 2. Technical Pattern Validation

**Redis + Workers (Selected for Phase 1):**
- ✅ Proven pattern (Celery uses Redis under hood)
- ✅ Simpler than Temporal for our use case
- ✅ More control over data model
- ⚠️ Temporal deferred to Phase 2 evaluation

**React + FastAPI (Selected for Phase 2):**
- ✅ Industry standard
- ✅ shadcn/ui provides accessible components
- ✅ TanStack Query for caching/real-time sync

---

## Phase 1 Scope

### Goals
1. Replace in-process Task execution with persistent queue
2. Ensure tasks survive agent restarts
3. Support 5+ concurrent workers
4. Maintain backward compatibility

### Out of Scope (Phase 2+)
- Web dashboard (Phase 2)
- Temporal workflows (Phase 2 evaluation)
- GitHub integration (Phase 3)
- Computer Use (Phase 4)

---

## Implementation Breakdown

### Week 1: In-Memory Prototype

**Deliverables:**
- [ ] `TaskQueue` class with asyncio.Queue
- [ ] Task dataclass with all fields
- [ ] Worker process skeleton
- [ ] Basic enqueue/dequeue/complete/fail operations

**Testing:**
```python
# Test: Task survives worker crash
# Test: Concurrent task execution
# Test: Priority ordering
```

**Code Structure:**
```
letta_ade/
├── __init__.py
├── queue/
│   ├── __init__.py
│   ├── models.py          # Task dataclass, enums
│   ├── memory_queue.py    # Week 1 implementation
│   └── base.py            # Abstract base class
└── worker/
    ├── __init__.py
    └── runner.py          # Worker process logic
```

### Week 2: Redis Integration

**Deliverables:**
- [ ] Redis connection manager
- [ ] Task serialization (JSON/pickle)
- [ ] Atomic dequeue with WATCH/MULTI/EXEC
- [ ] Status tracking (Sets per status)

**Redis Schema:**
```redis
# Task storage
HSET task:{uuid} field value ...

# Priority queue (pending)
ZADD queue:pending {priority} {task_id}

# Running tasks
ZADD queue:running {started_at} {task_id}

# Status index
SADD status:pending {task_id}
SADD status:running {task_id}
SADD status:completed {task_id}
SADD status:failed {task_id}

# User index
SADD user:{user_id}:tasks {task_id}
```

**Dependencies:**
```toml
[dependencies]
redis = { version = "^5.0", extras = ["hiredis"] }
```

### Week 3-4: Worker Pool + Heartbeat

**Deliverables:**
- [ ] Multiple worker processes
- [ ] Worker heartbeat (every 30s)
- [ ] Stall detection (2x heartbeat timeout)
- [ ] Graceful shutdown handling
- [ ] Worker capacity management

**Worker Logic:**
```python
async def worker_loop(agent_id: UUID, queue: TaskQueue):
    while running:
        # Send heartbeat
        await queue.heartbeat(agent_id)

        # Try to get task (5s timeout)
        task = await queue.dequeue(agent_id, timeout_ms=5000)

        if task:
            # Spawn subagent process
            proc = await asyncio.create_subprocess_exec(
                "letta", "run-agent",
                f"--task-id={task.id}",
                stdout=asyncio.subprocess.PIPE,
                stderr=asyncio.subprocess.PIPE
            )

            # Wait for completion
            stdout, stderr = await proc.communicate()

            # Update queue
            if proc.returncode == 0:
                await queue.complete(task.id, parse_result(stdout))
            else:
                await queue.fail(task.id, stderr.decode())

        # Brief pause to prevent tight loop
        await asyncio.sleep(0.1)
```

**Stall Recovery (Cron job):**
```python
async def recover_stalled_tasks(queue: TaskQueue, max_age: timedelta):
    """Requeue tasks from crashed workers."""
    stalled = await queue.find_stalled(max_age)
    for task_id in stalled:
        await queue.requeue(task_id)
```

### Week 5: API Layer

**Deliverables:**
- [ ] FastAPI application structure
- [ ] REST endpoints (CRUD for tasks)
- [ ] WebSocket endpoint for real-time updates
- [ ] Authentication middleware

**REST Endpoints:**
```python
@app.post("/tasks")
async def create_task(task: TaskCreate) -> TaskResponse:
    """Enqueue a new task."""
    task_id = await queue.enqueue(task)
    return TaskResponse(task_id=task_id, status="pending")

@app.get("/tasks/{task_id}")
async def get_task(task_id: UUID) -> Task:
    """Get task status and result."""
    return await queue.get(task_id)

@app.get("/tasks")
async def list_tasks(
    user_id: str,
    status: Optional[TaskStatus] = None
) -> List[TaskSummary]:
    """List tasks with optional filtering."""
    return await queue.list_by_user(user_id, status)

@app.post("/tasks/{task_id}/cancel")
async def cancel_task(task_id: UUID):
    """Cancel a pending or running task."""
    await queue.cancel(task_id)

@app.post("/tasks/{task_id}/retry")
async def retry_task(task_id: UUID):
    """Retry a failed task."""
    await queue.retry(task_id)
```

**WebSocket:**
```python
@app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket):
    await websocket.accept()

    # Subscribe to Redis pub/sub for updates
    pubsub = redis.pubsub()
    pubsub.subscribe("task_updates")

    async for message in pubsub.listen():
        if message["type"] == "message":
            await websocket.send_json(message["data"])
```

### Week 6: Task Tool Integration

**Deliverables:**
- [ ] Modify existing Task tool to use queue
- [ ] `persist` flag for backward compatibility
- [ ] Polling support for task completion
- [ ] Migration guide for existing code

**Modified Task Tool:**
```python
class TaskTool:
    async def run(
        self,
        prompt: str,
        subagent_type: str,
        # ... existing args ...
        persist: bool = False,        # NEW
        priority: int = 100,          # NEW
        wait: bool = False,           # NEW
        timeout: int = 300,           # NEW
    ) -> TaskResult:

        if persist:
            # Enqueue and optionally wait
            task_id = await self.queue.enqueue(...)

            if wait:
                # Poll for completion
                result = await self._wait_for_task(task_id, timeout)
                return result
            else:
                # Return immediately with task_id
                return TaskResult(task_id=task_id, status="pending")
        else:
            # Legacy immediate execution
            return await self._execute_immediately(...)
```

---

## Technical Specifications

### Task Data Model

```python
@dataclass
class Task:
    id: UUID
    subagent_type: str
    prompt: str
    system_prompt: Optional[str]
    model: Optional[str]

    # State
    status: TaskStatus
    priority: int = 100
    created_at: datetime
    started_at: Optional[datetime]
    completed_at: Optional[datetime]

    # Execution
    agent_id: Optional[UUID]
    retry_count: int = 0
    max_retries: int = 3

    # Results
    result: Optional[dict]
    error: Optional[str]
    exit_code: Optional[int]

    # Metadata
    tags: List[str]
    user_id: str
    parent_task: Optional[UUID]

    # Cost tracking (NEW)
    input_tokens: int = 0
    output_tokens: int = 0
    estimated_cost: float = 0.0
```

### Retry Logic

```python
async def retry_with_backoff(task: Task) -> bool:
    if task.retry_count >= task.max_retries:
        return False  # Permanent failure

    # Exponential backoff: 2^retry_count seconds
    delay = min(2 ** task.retry_count, 300)  # Cap at 5 min

    await asyncio.sleep(delay)
    task.retry_count += 1

    # Re-enqueue with same priority
    await queue.enqueue(task, priority=task.priority)
    return True
```

### Error Classification

| Error | Retry? | Action |
|-------|--------|--------|
| Subagent crash | Yes | Requeue with backoff |
| Syntax error | No | Fail immediately |
| API rate limit | Yes | Exponential backoff |
| Out of memory | No | Alert admin, fail |
| Redis connection | Yes | Reconnect, retry |
| Timeout | Yes | Retry with longer timeout |

---

## Testing Strategy

### Unit Tests
```python
# test_queue.py
def test_enqueue_creates_pending_task():
def test_dequeue_removes_from_pending():
def test_complete_moves_to_completed():
def test_fail_triggers_retry():
def test_max_retries_exceeded():
def test_cancel_stops_running_task():
```

### Integration Tests
```python
# test_worker.py
async def test_worker_processes_task():
async def test_worker_handles_failure():
async def test_worker_heartbeat():
async def test_stall_recovery():
```

### Durability Tests
```python
# test_durability.py
async def test_tasks_survive_restart():
    """Enqueue tasks, restart Redis, verify tasks persist."""

async def test_worker_crash_recovery():
    """Kill worker mid-task, verify task requeued."""

async def test_concurrent_workers():
    """5 workers, 20 tasks, verify all complete."""
```

---

## Dependencies

### Required
```toml
redis = { version = "^5.0", extras = ["hiredis"] }
fastapi = "^0.115"
websockets = "^13.0"
pydantic = "^2.0"
```

### Development
```toml
pytest = "^8.0"
pytest-asyncio = "^0.24"
httpx = "^0.27"  # For FastAPI test client
```

### Infrastructure
- Redis 7.0+ (local or cloud)
- Python 3.11+

---

## Migration Guide

### For Existing Task Tool Users

**Before:**
```python
result = await task_tool.run(
    prompt="Create a React component",
    subagent_type="coder"
)  # Blocks until complete
```

**After (backward compatible):**
```python
# Same behavior (immediate execution)
result = await task_tool.run(
    prompt="Create a React component",
    subagent_type="coder",
    persist=False  # default
)
```

**New (persistent):**
```python
# Fire-and-forget
task_id = await task_tool.run(
    prompt="Create a React component",
    subagent_type="coder",
    persist=True
)

# Wait for completion
result = await task_tool.run(
    prompt="Create a React component",
    subagent_type="coder",
    persist=True,
    wait=True,
    timeout=600
)
```

---

## Success Criteria

| Metric | Target | Measurement |
|--------|--------|-------------|
| Task durability | 100% | Tasks never lost on restart |
| Throughput | 10 tasks/min | With 3 workers |
| Latency | <100ms | Enqueue → pending |
| Recovery time | <60s | Worker crash → requeue |
| API uptime | 99.9% | Health check endpoint |
| Backward compat | 100% | Existing tests pass |

---

## Risk Mitigation

| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| Redis complexity | Low | Medium | Start with simple ops |
| Worker pool bugs | Medium | High | Extensive testing |
| Performance issues | Low | Medium | Load testing Week 5 |
| Migration breakage | Low | High | Full test suite |

---

## Handoff to Phase 2

**Phase 2 Prereqs:**
- [ ] All Phase 1 success criteria met
- [ ] API documentation complete
- [ ] WebSocket tested with simple client
- [ ] Cost tracking working

**Phase 2 Inputs:**
- Task queue API (REST + WebSocket)
- Task data model
- Worker management API
- Redis schema

---

## Appendix: Quick Reference

### Redis Commands Cheat Sheet

```bash
# Start Redis
docker run -d -p 6379:6379 redis:7-alpine

# Monitor
redis-cli monitor

# Inspect keys
redis-cli KEYS "task:*"
redis-cli HGETALL task:abc-123

# Clear queue
redis-cli FLUSHDB
```

### Development Commands

```bash
# Start worker
python -m letta_ade.worker.runner --agent-id worker-1

# Start API
uvicorn letta_ade.api:app --reload

# Run tests
pytest tests/ -v --tb=short

# Integration test
pytest tests/integration/ -v
```

---

*Ready for implementation. Questions? See community-ade-research-synthesis-2026-03-18.md for full context.*