Files

Ani (Annie Tunturi) 00382055c6 Initial commit: Community ADE foundation

- Project structure: docs/, src/, tests/, proto/
- Research synthesis: Letta vs commercial ADEs
- Architecture: Redis Streams queue design
- Phase 1 orchestration design
- Execution plan and project state tracking
- Working subagent system (manager.ts fixes)

This is the foundation for a Community ADE built on Letta's
stateful agent architecture with git-native MemFS.

👾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta Code <noreply@letta.com>

2026-03-18 10:30:20 -04:00

8.9 KiB

Raw Blame History

Phase 1: Orchestration Layer Design

Date: March 18, 2026 Architect: Researcher subagent Goal: Design persistent task queue system for Community ADE

1. Core Data Model

@dataclass
class Task:
    id: UUID                           # Unique task identifier
    subagent_type: str                 # "researcher", "coder", etc.
    prompt: str                        # User prompt to subagent
    system_prompt: Optional[str]       # Override default system prompt
    model: Optional[str]               # Override default model
    
    # State tracking
    status: TaskStatus                 # pending/running/completed/failed/cancelled
    priority: int = 100                # Lower = higher priority
    created_at: datetime
    started_at: Optional[datetime]
    completed_at: Optional[datetime]
    
    # Execution tracking
    agent_id: Optional[UUID]           # Assigned worker agent
    retry_count: int = 0
    max_retries: int = 3
    
    # Results
    result: Optional[dict]             # Success result
    error: Optional[str]               # Failure message
    exit_code: Optional[int]           # Subprocess exit code
    
    # Metadata
    tags: List[str]                    # For filtering/grouping
    user_id: str                       # Task owner
    parent_task: Optional[UUID]        # For task chains

TaskStatus Enum

class TaskStatus(Enum):
    PENDING = "pending"           # Waiting for worker
    RUNNING = "running"           # Assigned to worker
    COMPLETED = "completed"       # Success
    FAILED = "failed"             # Permanent failure (max retries)
    CANCELLED = "cancelled"       # User cancelled
    STALLED = "stalled"           # Worker crashed, needs recovery

2. State Machine

                    +-----------+
                    |  PENDING  |
                    +-----+-----+
                          | dequeue()
                          v
+--------+         +-------------+         +-----------+
| FAILED |<--------+   RUNNING   +-------->| COMPLETED |
+--------+  fail()  +------+------+  success +-----------+
   ^ max              |      |
   | retries          |      |
   +------------------+      | cancel()
         retry()             v
                      +-----------+
                      | CANCELLED |
                      +-----------+
                          ^
                          | stall detected
                    +----------+
                    | STALLED  |
                    +----------+

Transitions

PENDING → RUNNING: Worker dequeues task
RUNNING → COMPLETED: Subagent succeeds
RUNNING → FAILED: Subagent fails, max retries reached
RUNNING → STALLED: Worker heartbeat timeout
STALLED → RUNNING: Reassigned to new worker
FAILED → RUNNING: Manual retry triggered
Any → CANCELLED: User cancellation

3. Redis Data Structures

Purpose	Structure	Key Pattern
Task payload	Hash	`task:{task_id}`
Pending queue	Sorted Set (by priority)	`queue:pending`
Running set	Set	`queue:running`
Worker registry	Hash	`worker:{agent_id}`
Status index	Set per status	`status:{status}`
User tasks	Set	`user:{user_id}:tasks`

Example Redis Operations

# Enqueue (pending)
ZADD queue:pending {priority} {task_id}
HSET task:{task_id} status pending created_at {timestamp} ...
SADD status:pending {task_id}

# Dequeue (atomic)
WATCH queue:pending
task_id = ZPOPMIN queue:pending
MULTI
    ZADD queue:running {now} {task_id}
    HSET task:{task_id} status running agent_id {worker} started_at {now}
    SMOVE status:pending status:running {task_id}
EXEC

# Complete
ZREM queue:running {task_id}
SADD status:completed {task_id}
HSET task:{task_id} status completed result {...} completed_at {now}

# Fail with retry
HINCRBY task:{task_id} retry_count 1
ZADD queue:pending {priority} {task_id}  # Re-enqueue
SMOVE status:running status:pending {task_id}
HSET task:{task_id} status pending error {...}

# Stall recovery (cron job)
SMEMBERS queue:running
# For each task where worker heartbeat > threshold:
ZREM queue:running {task_id}
SADD status:stalled {task_id}
ZADD queue:pending {priority} {task_id}

4. Key API Methods

class TaskQueue:
    # Core operations
    async def enqueue(task: Task) -> UUID
    async def dequeue(worker_id: UUID, timeout_ms: int = 5000) -> Optional[Task]
    async def complete(task_id: UUID, result: dict) -> None
    async def fail(task_id: UUID, error: str, retryable: bool = True) -> None
    async def cancel(task_id: UUID) -> None
    
    # Management
    async def retry(task_id: UUID) -> None           # Manual retry
    async def requeue_stalled(max_age_ms: int = 60000) -> int  # Recover crashed
    async def get_status(task_id: UUID) -> TaskStatus
    async def list_by_user(user_id: str, status: Optional[str]) -> List[TaskSummary]
    
    # Worker management
    async def register_worker(agent_id: UUID, capacity: int) -> None
    async def heartbeat(agent_id: UUID) -> None
    async def unregister_worker(agent_id: UUID, reason: str) -> None

5. Integration with Existing Task Tool

Current Flow

Task tool → immediate subprocess spawn → wait → return result

New Flow (with persistence)

Task tool → enqueue() → return task_id (immediate)
                    ↓
Background worker → dequeue() → spawn subprocess → complete()/fail()
                    ↓
Caller polls/gets notification when task completes

Changes to Task Tool Schema

class TaskTool:
    async def run(
        self,
        prompt: str,
        subagent_type: str,
        # ... existing args ...
        persist: bool = False,           # NEW: enqueue instead of immediate run
        priority: int = 100,            # NEW
        tags: Optional[List[str]] = None  # NEW
    ) -> TaskResult:
        if persist:
            task_id = await self.queue.enqueue(...)
            return TaskResult(task_id=task_id, status="pending")
        else:
            # Legacy: immediate execution
            ...

Worker Agent Integration

Worker subscribes to queue:

async def worker_loop(agent_id: UUID):
    while running:
        task = await queue.dequeue(agent_id, timeout_ms=5000)
        if task:
            # Spawn subprocess
            proc = await asyncio.create_subprocess_exec(
                "letta", "run-agent", f"--task-id={task.id}",
                stdout=asyncio.subprocess.PIPE,
                stderr=asyncio.subprocess.PIPE
            )
            
            # Monitor and wait
            stdout, stderr = await proc.communicate()
            
            # Update queue based on result
            if proc.returncode == 0:
                await queue.complete(task.id, parse_result(stdout))
            else:
                await queue.fail(task.id, stderr.decode(), retryable=True)

6. Implementation Phases

Phase 1a: In-Memory Prototype (Week 1)

Python asyncio.Queue for pending tasks
In-memory dict for task storage
Single worker process
No Redis dependency

Phase 1b: Redis Integration (Week 2)

Replace queue with Redis
Add task persistence
Implement retry logic
Add stall recovery

Phase 1c: Worker Pool (Week 3-4)

Multiple worker processes
Worker heartbeat monitoring
Task assignment logic
Graceful shutdown handling

Phase 1d: API & CLI (Week 5-6)

REST API for task management
CLI commands for queue inspection
Task status dashboard endpoint
Webhook notifications

Phase 1e: Integration (Week 7-8)

Modify Task tool to use queue
Add persistence flag
Maintain backward compatibility
Migration path for existing code

7. Retry Logic with Exponential Backoff

async def retry_with_backoff(task_id: UUID):
    task = await queue.get(task_id)
    
    if task.retry_count >= task.max_retries:
        await queue.fail(task_id, "Max retries exceeded", retryable=False)
        return
    
    # Exponential backoff: 2^retry_count seconds
    delay = min(2 ** task.retry_count, 300)  # Cap at 5 minutes
    
    await asyncio.sleep(delay)
    
    # Re-enqueue with same priority
    await queue.enqueue(task, priority=task.priority)

8. Error Handling Strategy

Error Type	Retry?	Action
Subagent crash	Yes	Increment retry, requeue
Syntax error in code	No	Fail immediately
Timeout	Yes	Retry with longer timeout
API rate limit	Yes	Retry with exponential backoff
Out of memory	No	Fail, alert admin
Redis connection lost	Yes	Reconnect, retry operation

Next Steps

Implement in-memory prototype (Week 1)
Add Redis persistence (Week 2)
Build worker pool (Week 3-4)
Integrate with Task tool (Week 7-8)
Write tests for queue durability (ongoing)

Design by Researcher subagent, March 18, 2026

8.9 KiB Raw Blame History