community-ade/docs/ade-phase1-orchestration-design.md

# Phase 1: Orchestration Layer Design

**Date:** March 18, 2026
**Architect:** Researcher subagent
**Goal:** Design persistent task queue system for Community ADE

---

## 1. Core Data Model

```python
@dataclass
class Task:
    id: UUID                           # Unique task identifier
    subagent_type: str                 # "researcher", "coder", etc.
    prompt: str                        # User prompt to subagent
    system_prompt: Optional[str]       # Override default system prompt
    model: Optional[str]               # Override default model

    # State tracking
    status: TaskStatus                 # pending/running/completed/failed/cancelled
    priority: int = 100                # Lower = higher priority
    created_at: datetime
    started_at: Optional[datetime]
    completed_at: Optional[datetime]

    # Execution tracking
    agent_id: Optional[UUID]           # Assigned worker agent
    retry_count: int = 0
    max_retries: int = 3

    # Results
    result: Optional[dict]             # Success result
    error: Optional[str]               # Failure message
    exit_code: Optional[int]           # Subprocess exit code

    # Metadata
    tags: List[str]                    # For filtering/grouping
    user_id: str                       # Task owner
    parent_task: Optional[UUID]        # For task chains
```

### TaskStatus Enum
```python
class TaskStatus(Enum):
    PENDING = "pending"           # Waiting for worker
    RUNNING = "running"           # Assigned to worker
    COMPLETED = "completed"       # Success
    FAILED = "failed"             # Permanent failure (max retries)
    CANCELLED = "cancelled"       # User cancelled
    STALLED = "stalled"           # Worker crashed, needs recovery
```

---

## 2. State Machine

```
                    +-----------+
                    |  PENDING  |
                    +-----+-----+
                          | dequeue()
                          v
+--------+         +-------------+         +-----------+
| FAILED |<--------+   RUNNING   +-------->| COMPLETED |
+--------+  fail()  +------+------+  success +-----------+
   ^ max              |      |
   | retries          |      |
   +------------------+      | cancel()
         retry()             v
                      +-----------+
                      | CANCELLED |
                      +-----------+
                          ^
                          | stall detected
                    +----------+
                    | STALLED  |
                    +----------+
```

### Transitions
- `PENDING → RUNNING`: Worker dequeues task
- `RUNNING → COMPLETED`: Subagent succeeds
- `RUNNING → FAILED`: Subagent fails, max retries reached
- `RUNNING → STALLED`: Worker heartbeat timeout
- `STALLED → RUNNING`: Reassigned to new worker
- `FAILED → RUNNING`: Manual retry triggered
- Any → CANCELLED: User cancellation

---

## 3. Redis Data Structures

| Purpose | Structure | Key Pattern |
|---------|-----------|-------------|
| Task payload | Hash | `task:{task_id}` |
| Pending queue | Sorted Set (by priority) | `queue:pending` |
| Running set | Set | `queue:running` |
| Worker registry | Hash | `worker:{agent_id}` |
| Status index | Set per status | `status:{status}` |
| User tasks | Set | `user:{user_id}:tasks` |

### Example Redis Operations

```redis
# Enqueue (pending)
ZADD queue:pending {priority} {task_id}
HSET task:{task_id} status pending created_at {timestamp} ...
SADD status:pending {task_id}

# Dequeue (atomic)
WATCH queue:pending
task_id = ZPOPMIN queue:pending
MULTI
    ZADD queue:running {now} {task_id}
    HSET task:{task_id} status running agent_id {worker} started_at {now}
    SMOVE status:pending status:running {task_id}
EXEC

# Complete
ZREM queue:running {task_id}
SADD status:completed {task_id}
HSET task:{task_id} status completed result {...} completed_at {now}

# Fail with retry
HINCRBY task:{task_id} retry_count 1
ZADD queue:pending {priority} {task_id}  # Re-enqueue
SMOVE status:running status:pending {task_id}
HSET task:{task_id} status pending error {...}

# Stall recovery (cron job)
SMEMBERS queue:running
# For each task where worker heartbeat > threshold:
ZREM queue:running {task_id}
SADD status:stalled {task_id}
ZADD queue:pending {priority} {task_id}
```

---

## 4. Key API Methods

```python
class TaskQueue:
    # Core operations
    async def enqueue(task: Task) -> UUID
    async def dequeue(worker_id: UUID, timeout_ms: int = 5000) -> Optional[Task]
    async def complete(task_id: UUID, result: dict) -> None
    async def fail(task_id: UUID, error: str, retryable: bool = True) -> None
    async def cancel(task_id: UUID) -> None

    # Management
    async def retry(task_id: UUID) -> None           # Manual retry
    async def requeue_stalled(max_age_ms: int = 60000) -> int  # Recover crashed
    async def get_status(task_id: UUID) -> TaskStatus
    async def list_by_user(user_id: str, status: Optional[str]) -> List[TaskSummary]

    # Worker management
    async def register_worker(agent_id: UUID, capacity: int) -> None
    async def heartbeat(agent_id: UUID) -> None
    async def unregister_worker(agent_id: UUID, reason: str) -> None
```

---

## 5. Integration with Existing Task Tool

### Current Flow
```
Task tool → immediate subprocess spawn → wait → return result
```

### New Flow (with persistence)
```
Task tool → enqueue() → return task_id (immediate)
                    ↓
Background worker → dequeue() → spawn subprocess → complete()/fail()
                    ↓
Caller polls/gets notification when task completes
```

### Changes to Task Tool Schema
```python
class TaskTool:
    async def run(
        self,
        prompt: str,
        subagent_type: str,
        # ... existing args ...
        persist: bool = False,           # NEW: enqueue instead of immediate run
        priority: int = 100,            # NEW
        tags: Optional[List[str]] = None  # NEW
    ) -> TaskResult:
        if persist:
            task_id = await self.queue.enqueue(...)
            return TaskResult(task_id=task_id, status="pending")
        else:
            # Legacy: immediate execution
            ...
```

### Worker Agent Integration

**Worker subscribes to queue:**
```python
async def worker_loop(agent_id: UUID):
    while running:
        task = await queue.dequeue(agent_id, timeout_ms=5000)
        if task:
            # Spawn subprocess
            proc = await asyncio.create_subprocess_exec(
                "letta", "run-agent", f"--task-id={task.id}",
                stdout=asyncio.subprocess.PIPE,
                stderr=asyncio.subprocess.PIPE
            )

            # Monitor and wait
            stdout, stderr = await proc.communicate()

            # Update queue based on result
            if proc.returncode == 0:
                await queue.complete(task.id, parse_result(stdout))
            else:
                await queue.fail(task.id, stderr.decode(), retryable=True)
```

---

## 6. Implementation Phases

### Phase 1a: In-Memory Prototype (Week 1)
- Python `asyncio.Queue` for pending tasks
- In-memory dict for task storage
- Single worker process
- No Redis dependency

### Phase 1b: Redis Integration (Week 2)
- Replace queue with Redis
- Add task persistence
- Implement retry logic
- Add stall recovery

### Phase 1c: Worker Pool (Week 3-4)
- Multiple worker processes
- Worker heartbeat monitoring
- Task assignment logic
- Graceful shutdown handling

### Phase 1d: API & CLI (Week 5-6)
- REST API for task management
- CLI commands for queue inspection
- Task status dashboard endpoint
- Webhook notifications

### Phase 1e: Integration (Week 7-8)
- Modify Task tool to use queue
- Add persistence flag
- Maintain backward compatibility
- Migration path for existing code

---

## 7. Retry Logic with Exponential Backoff

```python
async def retry_with_backoff(task_id: UUID):
    task = await queue.get(task_id)

    if task.retry_count >= task.max_retries:
        await queue.fail(task_id, "Max retries exceeded", retryable=False)
        return

    # Exponential backoff: 2^retry_count seconds
    delay = min(2 ** task.retry_count, 300)  # Cap at 5 minutes

    await asyncio.sleep(delay)

    # Re-enqueue with same priority
    await queue.enqueue(task, priority=task.priority)
```

---

## 8. Error Handling Strategy

| Error Type | Retry? | Action |
|------------|--------|--------|
| Subagent crash | Yes | Increment retry, requeue |
| Syntax error in code | No | Fail immediately |
| Timeout | Yes | Retry with longer timeout |
| API rate limit | Yes | Retry with exponential backoff |
| Out of memory | No | Fail, alert admin |
| Redis connection lost | Yes | Reconnect, retry operation |

---

## Next Steps

1. **Implement in-memory prototype** (Week 1)
2. **Add Redis persistence** (Week 2)
3. **Build worker pool** (Week 3-4)
4. **Integrate with Task tool** (Week 7-8)
5. **Write tests for queue durability** (ongoing)

---

*Design by Researcher subagent, March 18, 2026*