Initial commit: Community ADE foundation

- Project structure: docs/, src/, tests/, proto/
- Research synthesis: Letta vs commercial ADEs
- Architecture: Redis Streams queue design
- Phase 1 orchestration design
- Execution plan and project state tracking
- Working subagent system (manager.ts fixes)

This is the foundation for a Community ADE built on Letta's
stateful agent architecture with git-native MemFS.

👾 Generated with [Letta Code](https://letta.com)

Co-Authored-By: Letta Code <noreply@letta.com>
This commit is contained in:
Ani (Annie Tunturi)
2026-03-18 10:30:20 -04:00
commit 00382055c6
9 changed files with 2970 additions and 0 deletions

View File

@@ -0,0 +1,525 @@
# Phase 1 Execution Plan: Orchestration Layer
**Date:** March 18, 2026
**Status:** Ready for Implementation
**Estimated Duration:** 6 weeks
**Owner:** TBD
---
## Overview
This document provides actionable implementation guidance for Phase 1 of the Community ADE, based on synthesized research from commercial tools (Intent, Warp) and open-source alternatives (Aider, Cline, Agno).
---
## Key Research Insights
### 1. Letta's Competitive Position
**✅ Strongest Open-Source Position:**
- No competitor combines: stateful agents + hierarchical memory + git-native persistence + subagent orchestration
- Aider has git integration but no agent memory
- Cline is session-based with no persistence
- Agno lacks Letta's memory architecture
**⚠️ Commercial Tools Lead in UX:**
- Warp: Terminal-native with rich context (@file, images)
- Intent: Specification-driven development
- Both have web dashboards; Letta needs one
### 2. Technical Pattern Validation
**Redis + Workers (Selected for Phase 1):**
- ✅ Proven pattern (Celery uses Redis under hood)
- ✅ Simpler than Temporal for our use case
- ✅ More control over data model
- ⚠️ Temporal deferred to Phase 2 evaluation
**React + FastAPI (Selected for Phase 2):**
- ✅ Industry standard
- ✅ shadcn/ui provides accessible components
- ✅ TanStack Query for caching/real-time sync
---
## Phase 1 Scope
### Goals
1. Replace in-process Task execution with persistent queue
2. Ensure tasks survive agent restarts
3. Support 5+ concurrent workers
4. Maintain backward compatibility
### Out of Scope (Phase 2+)
- Web dashboard (Phase 2)
- Temporal workflows (Phase 2 evaluation)
- GitHub integration (Phase 3)
- Computer Use (Phase 4)
---
## Implementation Breakdown
### Week 1: In-Memory Prototype
**Deliverables:**
- [ ] `TaskQueue` class with asyncio.Queue
- [ ] Task dataclass with all fields
- [ ] Worker process skeleton
- [ ] Basic enqueue/dequeue/complete/fail operations
**Testing:**
```python
# Test: Task survives worker crash
# Test: Concurrent task execution
# Test: Priority ordering
```
**Code Structure:**
```
letta_ade/
├── __init__.py
├── queue/
│ ├── __init__.py
│ ├── models.py # Task dataclass, enums
│ ├── memory_queue.py # Week 1 implementation
│ └── base.py # Abstract base class
└── worker/
├── __init__.py
└── runner.py # Worker process logic
```
### Week 2: Redis Integration
**Deliverables:**
- [ ] Redis connection manager
- [ ] Task serialization (JSON/pickle)
- [ ] Atomic dequeue with WATCH/MULTI/EXEC
- [ ] Status tracking (Sets per status)
**Redis Schema:**
```redis
# Task storage
HSET task:{uuid} field value ...
# Priority queue (pending)
ZADD queue:pending {priority} {task_id}
# Running tasks
ZADD queue:running {started_at} {task_id}
# Status index
SADD status:pending {task_id}
SADD status:running {task_id}
SADD status:completed {task_id}
SADD status:failed {task_id}
# User index
SADD user:{user_id}:tasks {task_id}
```
**Dependencies:**
```toml
[dependencies]
redis = { version = "^5.0", extras = ["hiredis"] }
```
### Week 3-4: Worker Pool + Heartbeat
**Deliverables:**
- [ ] Multiple worker processes
- [ ] Worker heartbeat (every 30s)
- [ ] Stall detection (2x heartbeat timeout)
- [ ] Graceful shutdown handling
- [ ] Worker capacity management
**Worker Logic:**
```python
async def worker_loop(agent_id: UUID, queue: TaskQueue):
while running:
# Send heartbeat
await queue.heartbeat(agent_id)
# Try to get task (5s timeout)
task = await queue.dequeue(agent_id, timeout_ms=5000)
if task:
# Spawn subagent process
proc = await asyncio.create_subprocess_exec(
"letta", "run-agent",
f"--task-id={task.id}",
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE
)
# Wait for completion
stdout, stderr = await proc.communicate()
# Update queue
if proc.returncode == 0:
await queue.complete(task.id, parse_result(stdout))
else:
await queue.fail(task.id, stderr.decode())
# Brief pause to prevent tight loop
await asyncio.sleep(0.1)
```
**Stall Recovery (Cron job):**
```python
async def recover_stalled_tasks(queue: TaskQueue, max_age: timedelta):
"""Requeue tasks from crashed workers."""
stalled = await queue.find_stalled(max_age)
for task_id in stalled:
await queue.requeue(task_id)
```
### Week 5: API Layer
**Deliverables:**
- [ ] FastAPI application structure
- [ ] REST endpoints (CRUD for tasks)
- [ ] WebSocket endpoint for real-time updates
- [ ] Authentication middleware
**REST Endpoints:**
```python
@app.post("/tasks")
async def create_task(task: TaskCreate) -> TaskResponse:
"""Enqueue a new task."""
task_id = await queue.enqueue(task)
return TaskResponse(task_id=task_id, status="pending")
@app.get("/tasks/{task_id}")
async def get_task(task_id: UUID) -> Task:
"""Get task status and result."""
return await queue.get(task_id)
@app.get("/tasks")
async def list_tasks(
user_id: str,
status: Optional[TaskStatus] = None
) -> List[TaskSummary]:
"""List tasks with optional filtering."""
return await queue.list_by_user(user_id, status)
@app.post("/tasks/{task_id}/cancel")
async def cancel_task(task_id: UUID):
"""Cancel a pending or running task."""
await queue.cancel(task_id)
@app.post("/tasks/{task_id}/retry")
async def retry_task(task_id: UUID):
"""Retry a failed task."""
await queue.retry(task_id)
```
**WebSocket:**
```python
@app.websocket("/ws")
async def websocket_endpoint(websocket: WebSocket):
await websocket.accept()
# Subscribe to Redis pub/sub for updates
pubsub = redis.pubsub()
pubsub.subscribe("task_updates")
async for message in pubsub.listen():
if message["type"] == "message":
await websocket.send_json(message["data"])
```
### Week 6: Task Tool Integration
**Deliverables:**
- [ ] Modify existing Task tool to use queue
- [ ] `persist` flag for backward compatibility
- [ ] Polling support for task completion
- [ ] Migration guide for existing code
**Modified Task Tool:**
```python
class TaskTool:
async def run(
self,
prompt: str,
subagent_type: str,
# ... existing args ...
persist: bool = False, # NEW
priority: int = 100, # NEW
wait: bool = False, # NEW
timeout: int = 300, # NEW
) -> TaskResult:
if persist:
# Enqueue and optionally wait
task_id = await self.queue.enqueue(...)
if wait:
# Poll for completion
result = await self._wait_for_task(task_id, timeout)
return result
else:
# Return immediately with task_id
return TaskResult(task_id=task_id, status="pending")
else:
# Legacy immediate execution
return await self._execute_immediately(...)
```
---
## Technical Specifications
### Task Data Model
```python
@dataclass
class Task:
id: UUID
subagent_type: str
prompt: str
system_prompt: Optional[str]
model: Optional[str]
# State
status: TaskStatus
priority: int = 100
created_at: datetime
started_at: Optional[datetime]
completed_at: Optional[datetime]
# Execution
agent_id: Optional[UUID]
retry_count: int = 0
max_retries: int = 3
# Results
result: Optional[dict]
error: Optional[str]
exit_code: Optional[int]
# Metadata
tags: List[str]
user_id: str
parent_task: Optional[UUID]
# Cost tracking (NEW)
input_tokens: int = 0
output_tokens: int = 0
estimated_cost: float = 0.0
```
### Retry Logic
```python
async def retry_with_backoff(task: Task) -> bool:
if task.retry_count >= task.max_retries:
return False # Permanent failure
# Exponential backoff: 2^retry_count seconds
delay = min(2 ** task.retry_count, 300) # Cap at 5 min
await asyncio.sleep(delay)
task.retry_count += 1
# Re-enqueue with same priority
await queue.enqueue(task, priority=task.priority)
return True
```
### Error Classification
| Error | Retry? | Action |
|-------|--------|--------|
| Subagent crash | Yes | Requeue with backoff |
| Syntax error | No | Fail immediately |
| API rate limit | Yes | Exponential backoff |
| Out of memory | No | Alert admin, fail |
| Redis connection | Yes | Reconnect, retry |
| Timeout | Yes | Retry with longer timeout |
---
## Testing Strategy
### Unit Tests
```python
# test_queue.py
def test_enqueue_creates_pending_task():
def test_dequeue_removes_from_pending():
def test_complete_moves_to_completed():
def test_fail_triggers_retry():
def test_max_retries_exceeded():
def test_cancel_stops_running_task():
```
### Integration Tests
```python
# test_worker.py
async def test_worker_processes_task():
async def test_worker_handles_failure():
async def test_worker_heartbeat():
async def test_stall_recovery():
```
### Durability Tests
```python
# test_durability.py
async def test_tasks_survive_restart():
"""Enqueue tasks, restart Redis, verify tasks persist."""
async def test_worker_crash_recovery():
"""Kill worker mid-task, verify task requeued."""
async def test_concurrent_workers():
"""5 workers, 20 tasks, verify all complete."""
```
---
## Dependencies
### Required
```toml
redis = { version = "^5.0", extras = ["hiredis"] }
fastapi = "^0.115"
websockets = "^13.0"
pydantic = "^2.0"
```
### Development
```toml
pytest = "^8.0"
pytest-asyncio = "^0.24"
httpx = "^0.27" # For FastAPI test client
```
### Infrastructure
- Redis 7.0+ (local or cloud)
- Python 3.11+
---
## Migration Guide
### For Existing Task Tool Users
**Before:**
```python
result = await task_tool.run(
prompt="Create a React component",
subagent_type="coder"
) # Blocks until complete
```
**After (backward compatible):**
```python
# Same behavior (immediate execution)
result = await task_tool.run(
prompt="Create a React component",
subagent_type="coder",
persist=False # default
)
```
**New (persistent):**
```python
# Fire-and-forget
task_id = await task_tool.run(
prompt="Create a React component",
subagent_type="coder",
persist=True
)
# Wait for completion
result = await task_tool.run(
prompt="Create a React component",
subagent_type="coder",
persist=True,
wait=True,
timeout=600
)
```
---
## Success Criteria
| Metric | Target | Measurement |
|--------|--------|-------------|
| Task durability | 100% | Tasks never lost on restart |
| Throughput | 10 tasks/min | With 3 workers |
| Latency | <100ms | Enqueue → pending |
| Recovery time | <60s | Worker crash → requeue |
| API uptime | 99.9% | Health check endpoint |
| Backward compat | 100% | Existing tests pass |
---
## Risk Mitigation
| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| Redis complexity | Low | Medium | Start with simple ops |
| Worker pool bugs | Medium | High | Extensive testing |
| Performance issues | Low | Medium | Load testing Week 5 |
| Migration breakage | Low | High | Full test suite |
---
## Handoff to Phase 2
**Phase 2 Prereqs:**
- [ ] All Phase 1 success criteria met
- [ ] API documentation complete
- [ ] WebSocket tested with simple client
- [ ] Cost tracking working
**Phase 2 Inputs:**
- Task queue API (REST + WebSocket)
- Task data model
- Worker management API
- Redis schema
---
## Appendix: Quick Reference
### Redis Commands Cheat Sheet
```bash
# Start Redis
docker run -d -p 6379:6379 redis:7-alpine
# Monitor
redis-cli monitor
# Inspect keys
redis-cli KEYS "task:*"
redis-cli HGETALL task:abc-123
# Clear queue
redis-cli FLUSHDB
```
### Development Commands
```bash
# Start worker
python -m letta_ade.worker.runner --agent-id worker-1
# Start API
uvicorn letta_ade.api:app --reload
# Run tests
pytest tests/ -v --tb=short
# Integration test
pytest tests/integration/ -v
```
---
*Ready for implementation. Questions? See community-ade-research-synthesis-2026-03-18.md for full context.*