Files
community-ade/docs/design.md
Ani (Annie Tunturi) ce8dd84840 feat: Add approval system and agent config UI
- Omega (Kimi-K2.5): Approval system architecture
  - design.md: Full system architecture with state machines
  - api-spec.ts: Express routes + Zod schemas (33KB)
  - redis-schema.md: Redis key patterns (19KB)
  - ui-components.md: Dashboard UI specs (31KB)

- Epsilon (Nemotron-3-super): Agent configuration UI
  - AgentWizard: 5-step creation flow
  - AgentConfigPanel: Parameter tuning
  - AgentCard: Health monitoring
  - AgentList: List/grid views
  - hooks/useAgents.ts: WebSocket integration
  - types/agent.ts: TypeScript definitions

Total: 150KB new code, 22 components

👾 Generated with [Letta Code](https://letta.com)
2026-03-18 12:23:59 -04:00

511 lines
29 KiB
Markdown

# Community ADE Approval System Architecture
## Executive Summary
The Approval System provides governance and safety controls for the Community ADE platform. It introduces human-in-the-loop validation for task execution, distributed locking for resource protection, and a complete audit trail for compliance.
---
## Core Concepts
### 1. Clean Apply Locks
**Philosophy:** A lock should only grant permission to attempt an operation, not guarantee success. Locks are advisory but strictly enforced by the system.
**Lock Hierarchy:**
```
Task Lock (task:{id}:lock) - Single task execution
Resource Lock (resource:{type}:{id}:lock) - Shared resource protection
Agent Lock (agent:{id}:lock) - Agent capacity management
```
**Lock Properties:**
- **Ownership:** UUID of the lock holder
- **TTL:** 30 seconds default, extendable via heartbeats
- **Queue:** FIFO ordered waiting list for fairness
- **Metadata:** Timestamp, purpose, agent info
### 2. Approval Lifecycle State Machine
```
┌─────────────────────────────────────────────────────────────────────────────┐
│ APPROVAL LIFECYCLE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────┐ ┌──────────┐ ┌───────────┐ ┌─────────┐ ┌─────────┐ │
│ │DRAFT │───→│ SUBMITTED│───→│ REVIEWING │───→│ APPROVED│───→│ APPLYING│ │
│ └──────┘ └──────────┘ └───────────┘ └─────────┘ └────┬────┘ │
│ │ │ │ │ │ │
│ │ │ │ │ ▼ │
│ │ │ │ │ ┌─────────┐ │
│ │ │ │ │ │COMPLETED│ │
│ │ │ │ │ └─────────┘ │
│ │ │ │ │ │
│ │ │ └──────────────┘ │
│ │ │ │ │
│ │ │ ▼ │
│ │ │ ┌─────────┐ │
│ │ │ │REJECTED │ │
│ │ │ └─────────┘ │
│ │ │ │ │
│ │ └───────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────┐ │
│ │ CANCELLED│ │
│ └─────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
```
**State Descriptions:**
| State | Description | Permissions |
|-------|-------------|-------------|
| `DRAFT` | Task created but not submitted | Edit, Delete, Submit |
| `SUBMITTED` | Validation complete, awaiting review | None (locked) |
| `REVIEWING` | Under active review by approvers | Add comments |
| `APPROVED` | All required approvals received | Queue for apply |
| `APPLYING` | Lock acquired, executing changes | Read-only |
| `COMPLETED` | Changes successfully applied | Read-only, audit |
| `REJECTED` | Approval denied | Can resubmit as new |
| `CANCELLED` | Aborted before apply | Archive only |
### 3. Human Gates
**Review Policies:**
- **Auto-approve:** Tasks below risk threshold skip human review
- **Required reviewers:** Based on task type, resource scope, risk score
- **Delegation chains:** "If my manager approves, auto-approve for me"
- **Quorum rules:** N-of-M approvals required
**Risk Scoring:**
```typescript
RiskScore = (
resource_criticality * 0.4 +
change_magnitude * 0.3 +
blast_radius * 0.2 +
historical_failure_rate * 0.1
) // 0-100 scale
```
---
## System Architecture
### Component Diagram
```
┌──────────────────────────────────────────────────────────────────────────────┐
│ CLIENT LAYER │
├──────────────────────────────────────────────────────────────────────────────┤
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Dashboard UI │ │ CLI Tool │ │ Webhook API │ │
│ │ (Delta-V2) │ │ (Omega-CLI) │ │ (External) │ │
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
└───────────┼────────────────────┼────────────────────┼────────────────────────┘
│ │ │
▼ ▼ ▼
┌──────────────────────────────────────────────────────────────────────────────┐
│ API GATEWAY LAYER │
├──────────────────────────────────────────────────────────────────────────────┤
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ Express API Server (Beta Patterns) │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌─────────────┐ │ │
│ │ │ Task Routes │ │ApprovalRoutes│ │ Lock Routes │ │ WebSocket │ │ │
│ │ │ │ │ │ │ │ │ Handler │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ └─────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────┬────────────────────────────────────────────┘
┌─────────────────────┼─────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ REDIS LAYER │ │ POSTGRESQL │ │ EVENT BUS │
│ (Alpha) │ │ (Persistence) │ │ (WebSocket) │
├─────────────────┤ ├─────────────────┤ ├─────────────────┤
│ • Locks │ │ • Task History │ │ • approval:* │
│ • Queues │ │ • Audit Log │ │ • lock:* │
│ • Sessions │ │ • User Policies │ │ • task:* │
│ • Rate Limits │ │ • Delegations │ │ │
└────────┬────────┘ └─────────────────┘ └─────────────────┘
┌──────────────────────────────────────────────────────────────────────────────┐
│ WORKER LAYER (Gamma) │
├──────────────────────────────────────────────────────────────────────────────┤
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Lock Manager │ │ Task Executor │ │ Heartbeat Mon │ │
│ │ │ │ │ │ │ │
│ │ • Acquire locks │ │ • Check locks │ │ • Watchdog │ │
│ │ • Queue waiters │ │ • Execute apply │ │ • Deadlock det │ │
│ │ • Release/clean │ │ • Rollback on │ │ • Auto-recovery │ │
│ │ │ │ failure │ │ │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└──────────────────────────────────────────────────────────────────────────────┘
```
### Data Flow: Submit → Approve → Apply
```
┌─────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ USER │ │ SYSTEM │ │ SYSTEM │ │ WORKER │
└────┬────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │ │
│ POST /tasks │ │ │
│ {config} │ │ │
├────────────────→│ │ │
│ │ │ │
│ │──┐ │ │
│ │ │ Validate │ │
│ │ │ Calculate risk │ │
│ │ │ Generate preview│ │
│ │←─┘ │ │
│ │ │ │
│ 201 Created │ │ │
│ task_id: xyz │ │ │
│←────────────────│ │ │
│ │ │ │
│ POST /tasks/xyz │ │ │
│ /submit │ │ │
├────────────────→│ │ │
│ │ │ │
│ │──┐ │ │
│ │ │ State:SUBMITTED│ │
│ │ │ Lock resources │ │
│ │ │ (preview only) │ │
│ │←─┘ │ │
│ │ │ │
│ 202 Accepted │ │ │
│←────────────────│ │ │
│ │ │ │
│ │ approval:requested │
│ ├──────────────────→│ │
│ │ │ │
│ │ │──┐ │
│ │ │ │ Check policies │
│ │ │ │ Notify reviewers│
│ │ │←─┘ │
│ │ │ │
│ [Time passes] │ │ │
│ │ │ │
│ POST /approvals │ │ │
│ /{id}/approve │ │ │
├────────────────→│ │ │
│ │ │ │
│ │ │──┐ │
│ │ │ │ Record approval │
│ │ │ │ Check quorum │
│ │ │←─┘ │
│ │ │ │
│ 200 OK │ │ │
│←────────────────│ │ │
│ │ │ │
│ │ │ approval:approved│
│ │←──────────────────┤ │
│ │ │ │
│ │──┐ │
│ │ │ State:APPROVED │
│ │ │ Queue for apply │
│ │←─┘ │
│ │ │ │
│ │ task:approved │ │
│ ├──────────────────────────────────────→│
│ │ │ │
│ │ │ │──┐
│ │ │ │ │ Acquire apply lock
│ │ │ │ │ State:APPLYING
│ │ │ │ │ Execute changes
│ │ │ │←─┘
│ │ │ │
│ │ lock:acquired │ │
│ │←──────────────────────────────────────┤
│ │ │ │
│ │ │ │──┐
│ │ │ │ │ Apply succeeded
│ │ │ │ │ State:COMPLETED
│ │ │ │ │ Release locks
│ │ │ │←─┘
│ │ │ │
│ │ task:completed │ │
│ │←──────────────────────────────────────┤
│ │ │ │
```
---
## Lock System Deep Dive
### Lock Types
#### 1. Task Lock (Exclusive)
```
Key: lock:task:{task_id}
Value: { holder_agent_id, acquired_at, expires_at, purpose }
TTL: 30s with automatic extension on heartbeat
```
Prevents concurrent execution of the same task. Released on completion or failure.
#### 2. Resource Lock (Shared/Exclusive)
```
Key: lock:resource:{resource_type}:{resource_id}
Value: {
mode: 'exclusive' | 'shared',
holders: [{ agent_id, acquired_at }],
queue: [{ agent_id, mode, requested_at }]
}
```
Allows multiple readers (shared) or single writer (exclusive). Queue ensures FIFO ordering.
#### 3. Agent Capacity Lock
```
Key: lock:agent:{agent_id}:capacity
Value: { active_tasks: number, max_tasks: number }
```
Prevents agent overload. Each agent has configurable concurrency limits.
### Deadlock Detection
**Algorithm:** Wait-For Graph
```
If Agent A holds Lock X and waits for Lock Y
And Agent B holds Lock Y and waits for Lock X
→ Deadlock detected
```
**Resolution:**
1. Abort youngest transaction (lower cost)
2. Release all held locks
3. Notify owner with `DEADLOCK_DETECTED` error
4. Auto-retry with exponential backoff
### Lock Heartbeat Protocol
```typescript
interface HeartbeatMessage {
lock_id: string;
agent_id: string;
timestamp: number;
ttl_extension: number; // seconds
}
// Client must send heartbeat every 10s (configurable)
// Server extends TTL on receipt
// If no heartbeat for 30s, lock auto-expires
// Expired locks trigger cleanup and notify waiters
```
---
## Approval Engine
### Reviewer Assignment
```typescript
interface ReviewerPolicy {
task_types: string[];
resource_patterns: string[];
min_approvers: number;
required_roles: string[];
risk_threshold: number;
auto_approve_if: {
risk_below: number;
author_has_role: string[];
resources_in_scope: string[];
};
}
```
**Assignment Algorithm:**
1. Match task against all policies
2. Union all required reviewers from matching policies
3. Check for delegation chains
4. Filter out auto-approved reviewers (based on policy)
5. Calculate minimum approvals needed
6. Create approval requests
### Delegation Chains
```
Alice delegates to Bob when:
- Task type is "infrastructure"
- Risk score > 50
Bob delegates to Carol when:
- Resource matches "prod-*"
Result: For prod infrastructure with high risk,
only Carol's approval is needed
```
**Resolution:** Depth-first traversal with cycle detection.
### Batch Operations
**Batch Approve:**
```typescript
POST /approvals/batch
{
approval_ids: string[];
action: 'approve' | 'reject';
reason?: string;
options: {
skip_validation: boolean;
apply_immediately: boolean;
}
}
```
Atomic operation: either all approvals succeed or all fail.
---
## Error Handling & Edge Cases
### Lock Acquisition Failures
| Scenario | Response | Retry Strategy |
|----------|----------|----------------|
| Lock held by another agent | 423 Locked | Queue and wait |
| Lock expired during operation | 409 Conflict | Abort, notify, retry |
| Deadlock detected | 423 Deadlock | Abort, auto-retry with backoff |
| Max queue depth exceeded | 503 Queue Full | Fail fast, notify operator |
### Approval Edge Cases
| Scenario | Behavior |
|----------|----------|
| Approver leaves organization | Auto-reassign to delegate or manager |
| Approval timeout (48h default) | Escalate to next level, notify on-call |
| Required reviewer unavailable | Bypass with admin override + audit |
| Task modified during review | Invalidate approvals, restart review |
| Concurrent approvals | Last write wins, notify others of resolution |
### System Degradation
| Condition | Response |
|-----------|----------|
| Redis unavailable | Queue in PostgreSQL, async recovery |
| High lock contention | Exponential backoff, circuit breaker |
| Approval queue backlog | Priority escalation, auto-approve low-risk |
| WebSocket failure | Polling fallback, queued events |
---
## Security Model
### Permission Matrix
| Action | Author | Reviewer | Admin | System |
|--------|--------|----------|-------|--------|
| Create task | ✓ | ✗ | ✓ | ✗ |
| Submit for approval | ✓ | ✗ | ✓ | ✗ |
| Approve/reject | ✗ | ✓ | ✓ | ✗ |
| Force apply | ✗ | ✗ | ✓ | ✗ |
| Cancel task | ✓ | ✗ | ✓ | ✓ |
| Override policy | ✗ | ✗ | ✓* | ✗ |
| View audit log | ✓ | ✓ | ✓ | ✓ |
*Requires secondary approval and incident ticket
### Audit Requirements
Every state transition logged:
- Who (user ID, session)
- What (from state, to state, action)
- When (timestamp with microsecond precision)
- Where (IP, user agent, service)
- Why (reason, ticket reference)
---
## Scalability Considerations
### Horizontal Scaling
- **API Layer:** Stateless, scale via load balancer
- **Redis:** Cluster mode, hash tags for lock locality
- **PostgreSQL:** Read replicas for audit queries
- **WebSocket:** Sticky sessions or Redis pub/sub
### Performance Targets
| Metric | Target | Peak |
|--------|--------|------|
| Lock acquisition | < 10ms | < 50ms @ p99 |
| Approval latency | < 100ms | < 500ms @ p99 |
| Task throughput | 1000/min | 5000/min burst |
| Concurrent locks | 10,000 | 50,000 |
---
## Deployment Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ KUBERNETES CLUSTER │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ API Pod │ │ API Pod │ │ API Pod │ │
│ │ (3+ replicas)│ │ │ │ │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ └────────────────┼────────────────┘ │
│ │ │
│ ┌──────┴──────┐ │
│ │ Ingress │ │
│ │ Controller │ │
│ └──────┬──────┘ │
│ │ │
├──────────────────────────┼──────────────────────────────────────┤
│ ┌───────────────────────┴───────────────────────┐ │
│ │ Redis Cluster │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ Master │ │ Master │ │ Master │ │ │
│ │ │ + Repl │ │ + Repl │ │ + Repl │ │ │
│ │ └─────────┘ └─────────┘ └─────────┘ │ │
│ └───────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ PostgreSQL (HA: Patroni) │ │
│ │ Primary + 2 Replicas │ │
│ └─────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Worker Pod │ │ Worker Pod │ │ Worker Pod │ │
│ │ (HPA: 2-20) │ │ │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
```
---
## Future Enhancements
1. **ML-Based Risk Scoring:** Train models on historical task outcomes
2. **Predictive Locking:** Pre-acquire locks based on task patterns
3. **Approval Simulation:** "What if" analysis before submitting
4. **Time-Based Policies:** Different rules for on-call hours
5. **Integration Marketplace:** Slack, PagerDuty, ServiceNow webhooks
---
## Glossary
| Term | Definition |
|------|------------|
| **Clean Apply** | Applying changes only after successful lock acquisition and approval |
| **Deadlock** | Circular wait condition between multiple lock holders |
| **Delegation Chain** | Hierarchical approval routing |
| **Lock Queue** | FIFO waiting list for lock acquisition |
| **Risk Score** | Calculated metric (0-100) indicating task danger |
| **Stash** | Saved state for potential rollback |
| **Wait-For Graph** | Data structure for deadlock detection |