community-ade/docs/design.md

# Community ADE Approval System Architecture

## Executive Summary

The Approval System provides governance and safety controls for the Community ADE platform. It introduces human-in-the-loop validation for task execution, distributed locking for resource protection, and a complete audit trail for compliance.

---

## Core Concepts

### 1. Clean Apply Locks

**Philosophy:** A lock should only grant permission to attempt an operation, not guarantee success. Locks are advisory but strictly enforced by the system.

**Lock Hierarchy:**
```
Task Lock (task:{id}:lock)       - Single task execution
Resource Lock (resource:{type}:{id}:lock) - Shared resource protection
Agent Lock (agent:{id}:lock)      - Agent capacity management
```

**Lock Properties:**
- **Ownership:** UUID of the lock holder
- **TTL:** 30 seconds default, extendable via heartbeats
- **Queue:** FIFO ordered waiting list for fairness
- **Metadata:** Timestamp, purpose, agent info

### 2. Approval Lifecycle State Machine

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                         APPROVAL LIFECYCLE                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   ┌──────┐    ┌──────────┐    ┌───────────┐    ┌─────────┐    ┌─────────┐  │
│   │DRAFT │───→│ SUBMITTED│───→│ REVIEWING │───→│ APPROVED│───→│ APPLYING│  │
│   └──────┘    └──────────┘    └───────────┘    └─────────┘    └────┬────┘  │
│       │              │                │              │              │      │
│       │              │                │              │              ▼      │
│       │              │                │              │         ┌─────────┐ │
│       │              │                │              │         │COMPLETED│ │
│       │              │                │              │         └─────────┘ │
│       │              │                │              │                     │
│       │              │                └──────────────┘                     │
│       │              │                       │                            │
│       │              │                       ▼                            │
│       │              │                  ┌─────────┐                        │
│       │              │                  │REJECTED │                        │
│       │              │                  └─────────┘                        │
│       │              │                       │                            │
│       │              └───────────────────────┘                            │
│       │                                                                    │
│       ▼                                                                    │
│   ┌─────────┐                                                              │
│   │ CANCELLED│                                                              │
│   └─────────┘                                                              │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

**State Descriptions:**

| State | Description | Permissions |
|-------|-------------|-------------|
| `DRAFT` | Task created but not submitted | Edit, Delete, Submit |
| `SUBMITTED` | Validation complete, awaiting review | None (locked) |
| `REVIEWING` | Under active review by approvers | Add comments |
| `APPROVED` | All required approvals received | Queue for apply |
| `APPLYING` | Lock acquired, executing changes | Read-only |
| `COMPLETED` | Changes successfully applied | Read-only, audit |
| `REJECTED` | Approval denied | Can resubmit as new |
| `CANCELLED` | Aborted before apply | Archive only |

### 3. Human Gates

**Review Policies:**
- **Auto-approve:** Tasks below risk threshold skip human review
- **Required reviewers:** Based on task type, resource scope, risk score
- **Delegation chains:** "If my manager approves, auto-approve for me"
- **Quorum rules:** N-of-M approvals required

**Risk Scoring:**
```typescript
RiskScore = (
  resource_criticality * 0.4 +
  change_magnitude * 0.3 +
  blast_radius * 0.2 +
  historical_failure_rate * 0.1
) // 0-100 scale
```

---

## System Architecture

### Component Diagram

```
┌──────────────────────────────────────────────────────────────────────────────┐
│                              CLIENT LAYER                                     │
├──────────────────────────────────────────────────────────────────────────────┤
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐              │
│  │   Dashboard UI  │  │   CLI Tool      │  │   Webhook API   │              │
│  │   (Delta-V2)    │  │   (Omega-CLI)   │  │   (External)    │              │
│  └────────┬────────┘  └────────┬────────┘  └────────┬────────┘              │
└───────────┼────────────────────┼────────────────────┼────────────────────────┘
            │                    │                    │
            ▼                    ▼                    ▼
┌──────────────────────────────────────────────────────────────────────────────┐
│                           API GATEWAY LAYER                                   │
├──────────────────────────────────────────────────────────────────────────────┤
│  ┌─────────────────────────────────────────────────────────────────────────┐ │
│  │                    Express API Server (Beta Patterns)                   │ │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌─────────────┐ │ │
│  │  │ Task Routes  │  │ApprovalRoutes│  │ Lock Routes  │  │  WebSocket  │ │ │
│  │  │              │  │              │  │              │  │   Handler   │ │ │
│  │  └──────────────┘  └──────────────┘  └──────────────┘  └─────────────┘ │ │
│  └─────────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────┬────────────────────────────────────────────┘
                                  │
            ┌─────────────────────┼─────────────────────┐
            │                     │                     │
            ▼                     ▼                     ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   REDIS LAYER   │    │  POSTGRESQL     │    │   EVENT BUS     │
│   (Alpha)       │    │  (Persistence)  │    │   (WebSocket)   │
├─────────────────┤    ├─────────────────┤    ├─────────────────┤
│ • Locks         │    │ • Task History  │    │ • approval:*    │
│ • Queues        │    │ • Audit Log     │    │ • lock:*        │
│ • Sessions      │    │ • User Policies │    │ • task:*        │
│ • Rate Limits   │    │ • Delegations   │    │                 │
└────────┬────────┘    └─────────────────┘    └─────────────────┘
         │
         ▼
┌──────────────────────────────────────────────────────────────────────────────┐
│                          WORKER LAYER (Gamma)                                 │
├──────────────────────────────────────────────────────────────────────────────┤
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐              │
│  │  Lock Manager   │  │  Task Executor  │  │  Heartbeat Mon  │              │
│  │                 │  │                 │  │                 │              │
│  │ • Acquire locks │  │ • Check locks   │  │ • Watchdog      │              │
│  │ • Queue waiters │  │ • Execute apply │  │ • Deadlock det  │              │
│  │ • Release/clean │  │ • Rollback on   │  │ • Auto-recovery │              │
│  │                 │  │   failure       │  │                 │              │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘              │
└──────────────────────────────────────────────────────────────────────────────┘
```

### Data Flow: Submit → Approve → Apply

```
┌─────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  USER   │     │   SYSTEM    │     │   SYSTEM    │     │   WORKER    │
└────┬────┘     └──────┬──────┘     └──────┬──────┘     └──────┬──────┘
     │                 │                   │                   │
     │ POST /tasks     │                   │                   │
     │ {config}        │                   │                   │
     ├────────────────→│                   │                   │
     │                 │                   │                   │
     │                 │──┐                │                   │
     │                 │  │ Validate       │                   │
     │                 │  │ Calculate risk │                   │
     │                 │  │ Generate preview│                  │
     │                 │←─┘                │                   │
     │                 │                   │                   │
     │ 201 Created     │                   │                   │
     │ task_id: xyz    │                   │                   │
     │←────────────────│                   │                   │
     │                 │                   │                   │
     │ POST /tasks/xyz │                   │                   │
     │ /submit         │                   │                   │
     ├────────────────→│                   │                   │
     │                 │                   │                   │
     │                 │──┐                │                   │
     │                 │  │ State:SUBMITTED│                   │
     │                 │  │ Lock resources │                   │
     │                 │  │ (preview only) │                   │
     │                 │←─┘                │                   │
     │                 │                   │                   │
     │ 202 Accepted    │                   │                   │
     │←────────────────│                   │                   │
     │                 │                   │                   │
     │                 │   approval:requested                  │
     │                 ├──────────────────→│                   │
     │                 │                   │                   │
     │                 │                   │──┐                │
     │                 │                   │  │ Check policies  │
     │                 │                   │  │ Notify reviewers│
     │                 │                   │←─┘                │
     │                 │                   │                   │
     │  [Time passes]  │                   │                   │
     │                 │                   │                   │
     │ POST /approvals │                   │                   │
     │ /{id}/approve   │                   │                   │
     ├────────────────→│                   │                   │
     │                 │                   │                   │
     │                 │                   │──┐                │
     │                 │                   │  │ Record approval │
     │                 │                   │  │ Check quorum    │
     │                 │                   │←─┘                │
     │                 │                   │                   │
     │ 200 OK          │                   │                   │
     │←────────────────│                   │                   │
     │                 │                   │                   │
     │                 │                   │   approval:approved│
     │                 │←──────────────────┤                   │
     │                 │                   │                   │
     │                 │──┐                                    │
     │                 │  │ State:APPROVED                     │
     │                 │  │ Queue for apply                    │
     │                 │←─┘                                    │
     │                 │                   │                   │
     │                 │   task:approved   │                   │
     │                 ├──────────────────────────────────────→│
     │                 │                   │                   │
     │                 │                   │                   │──┐
     │                 │                   │                   │  │ Acquire apply lock
     │                 │                   │                   │  │ State:APPLYING
     │                 │                   │                   │  │ Execute changes
     │                 │                   │                   │←─┘
     │                 │                   │                   │
     │                 │   lock:acquired   │                   │
     │                 │←──────────────────────────────────────┤
     │                 │                   │                   │
     │                 │                   │                   │──┐
     │                 │                   │                   │  │ Apply succeeded
     │                 │                   │                   │  │ State:COMPLETED
     │                 │                   │                   │  │ Release locks
     │                 │                   │                   │←─┘
     │                 │                   │                   │
     │                 │   task:completed  │                   │
     │                 │←──────────────────────────────────────┤
     │                 │                   │                   │
```

---

## Lock System Deep Dive

### Lock Types

#### 1. Task Lock (Exclusive)
```
Key: lock:task:{task_id}
Value: { holder_agent_id, acquired_at, expires_at, purpose }
TTL: 30s with automatic extension on heartbeat
```

Prevents concurrent execution of the same task. Released on completion or failure.

#### 2. Resource Lock (Shared/Exclusive)
```
Key: lock:resource:{resource_type}:{resource_id}
Value: {
  mode: 'exclusive' | 'shared',
  holders: [{ agent_id, acquired_at }],
  queue: [{ agent_id, mode, requested_at }]
}
```

Allows multiple readers (shared) or single writer (exclusive). Queue ensures FIFO ordering.

#### 3. Agent Capacity Lock
```
Key: lock:agent:{agent_id}:capacity
Value: { active_tasks: number, max_tasks: number }
```

Prevents agent overload. Each agent has configurable concurrency limits.

### Deadlock Detection

**Algorithm:** Wait-For Graph

```
If Agent A holds Lock X and waits for Lock Y
And Agent B holds Lock Y and waits for Lock X
→ Deadlock detected
```

**Resolution:**
1. Abort youngest transaction (lower cost)
2. Release all held locks
3. Notify owner with `DEADLOCK_DETECTED` error
4. Auto-retry with exponential backoff

### Lock Heartbeat Protocol

```typescript
interface HeartbeatMessage {
  lock_id: string;
  agent_id: string;
  timestamp: number;
  ttl_extension: number; // seconds
}

// Client must send heartbeat every 10s (configurable)
// Server extends TTL on receipt
// If no heartbeat for 30s, lock auto-expires
// Expired locks trigger cleanup and notify waiters
```

---

## Approval Engine

### Reviewer Assignment

```typescript
interface ReviewerPolicy {
  task_types: string[];
  resource_patterns: string[];
  min_approvers: number;
  required_roles: string[];
  risk_threshold: number;
  auto_approve_if: {
    risk_below: number;
    author_has_role: string[];
    resources_in_scope: string[];
  };
}
```

**Assignment Algorithm:**
1. Match task against all policies
2. Union all required reviewers from matching policies
3. Check for delegation chains
4. Filter out auto-approved reviewers (based on policy)
5. Calculate minimum approvals needed
6. Create approval requests

### Delegation Chains

```
Alice delegates to Bob when:
  - Task type is "infrastructure"
  - Risk score > 50

Bob delegates to Carol when:
  - Resource matches "prod-*"

Result: For prod infrastructure with high risk,
        only Carol's approval is needed
```

**Resolution:** Depth-first traversal with cycle detection.

### Batch Operations

**Batch Approve:**
```typescript
POST /approvals/batch
{
  approval_ids: string[];
  action: 'approve' | 'reject';
  reason?: string;
  options: {
    skip_validation: boolean;
    apply_immediately: boolean;
  }
}
```

Atomic operation: either all approvals succeed or all fail.

---

## Error Handling & Edge Cases

### Lock Acquisition Failures

| Scenario | Response | Retry Strategy |
|----------|----------|----------------|
| Lock held by another agent | 423 Locked | Queue and wait |
| Lock expired during operation | 409 Conflict | Abort, notify, retry |
| Deadlock detected | 423 Deadlock | Abort, auto-retry with backoff |
| Max queue depth exceeded | 503 Queue Full | Fail fast, notify operator |

### Approval Edge Cases

| Scenario | Behavior |
|----------|----------|
| Approver leaves organization | Auto-reassign to delegate or manager |
| Approval timeout (48h default) | Escalate to next level, notify on-call |
| Required reviewer unavailable | Bypass with admin override + audit |
| Task modified during review | Invalidate approvals, restart review |
| Concurrent approvals | Last write wins, notify others of resolution |

### System Degradation

| Condition | Response |
|-----------|----------|
| Redis unavailable | Queue in PostgreSQL, async recovery |
| High lock contention | Exponential backoff, circuit breaker |
| Approval queue backlog | Priority escalation, auto-approve low-risk |
| WebSocket failure | Polling fallback, queued events |

---

## Security Model

### Permission Matrix

| Action | Author | Reviewer | Admin | System |
|--------|--------|----------|-------|--------|
| Create task | ✓ | ✗ | ✓ | ✗ |
| Submit for approval | ✓ | ✗ | ✓ | ✗ |
| Approve/reject | ✗ | ✓ | ✓ | ✗ |
| Force apply | ✗ | ✗ | ✓ | ✗ |
| Cancel task | ✓ | ✗ | ✓ | ✓ |
| Override policy | ✗ | ✗ | ✓* | ✗ |
| View audit log | ✓ | ✓ | ✓ | ✓ |

*Requires secondary approval and incident ticket

### Audit Requirements

Every state transition logged:
- Who (user ID, session)
- What (from state, to state, action)
- When (timestamp with microsecond precision)
- Where (IP, user agent, service)
- Why (reason, ticket reference)

---

## Scalability Considerations

### Horizontal Scaling

- **API Layer:** Stateless, scale via load balancer
- **Redis:** Cluster mode, hash tags for lock locality
- **PostgreSQL:** Read replicas for audit queries
- **WebSocket:** Sticky sessions or Redis pub/sub

### Performance Targets

| Metric | Target | Peak |
|--------|--------|------|
| Lock acquisition | < 10ms | < 50ms @ p99 |
| Approval latency | < 100ms | < 500ms @ p99 |
| Task throughput | 1000/min | 5000/min burst |
| Concurrent locks | 10,000 | 50,000 |

---

## Deployment Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                        KUBERNETES CLUSTER                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐             │
│  │  API Pod    │  │  API Pod    │  │  API Pod    │             │
│  │  (3+ replicas)│  │             │  │             │             │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘             │
│         │                │                │                     │
│         └────────────────┼────────────────┘                     │
│                          │                                      │
│                   ┌──────┴──────┐                              │
│                   │   Ingress   │                              │
│                   │  Controller │                              │
│                   └──────┬──────┘                              │
│                          │                                      │
├──────────────────────────┼──────────────────────────────────────┤
│  ┌───────────────────────┴───────────────────────┐             │
│  │              Redis Cluster                    │             │
│  │  ┌─────────┐  ┌─────────┐  ┌─────────┐       │             │
│  │  │ Master  │  │ Master  │  │ Master  │       │             │
│  │  │  + Repl │  │  + Repl │  │  + Repl │       │             │
│  │  └─────────┘  └─────────┘  └─────────┘       │             │
│  └───────────────────────────────────────────────┘             │
│                                                                 │
│  ┌─────────────────────────────────────────────────┐           │
│  │         PostgreSQL (HA: Patroni)                │           │
│  │         Primary + 2 Replicas                    │           │
│  └─────────────────────────────────────────────────┘           │
│                                                                 │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐             │
│  │ Worker Pod  │  │ Worker Pod  │  │ Worker Pod  │             │
│  │ (HPA: 2-20) │  │             │  │             │             │
│  └─────────────┘  └─────────────┘  └─────────────┘             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

---

## Future Enhancements

1. **ML-Based Risk Scoring:** Train models on historical task outcomes
2. **Predictive Locking:** Pre-acquire locks based on task patterns
3. **Approval Simulation:** "What if" analysis before submitting
4. **Time-Based Policies:** Different rules for on-call hours
5. **Integration Marketplace:** Slack, PagerDuty, ServiceNow webhooks

---

## Glossary

| Term | Definition |
|------|------------|
| **Clean Apply** | Applying changes only after successful lock acquisition and approval |
| **Deadlock** | Circular wait condition between multiple lock holders |
| **Delegation Chain** | Hierarchical approval routing |
| **Lock Queue** | FIFO waiting list for lock acquisition |
| **Risk Score** | Calculated metric (0-100) indicating task danger |
| **Stash** | Saved state for potential rollback |
| **Wait-For Graph** | Data structure for deadlock detection |