# Community ADE Approval System Architecture ## Executive Summary The Approval System provides governance and safety controls for the Community ADE platform. It introduces human-in-the-loop validation for task execution, distributed locking for resource protection, and a complete audit trail for compliance. --- ## Core Concepts ### 1. Clean Apply Locks **Philosophy:** A lock should only grant permission to attempt an operation, not guarantee success. Locks are advisory but strictly enforced by the system. **Lock Hierarchy:** ``` Task Lock (task:{id}:lock) - Single task execution Resource Lock (resource:{type}:{id}:lock) - Shared resource protection Agent Lock (agent:{id}:lock) - Agent capacity management ``` **Lock Properties:** - **Ownership:** UUID of the lock holder - **TTL:** 30 seconds default, extendable via heartbeats - **Queue:** FIFO ordered waiting list for fairness - **Metadata:** Timestamp, purpose, agent info ### 2. Approval Lifecycle State Machine ``` ┌─────────────────────────────────────────────────────────────────────────────┐ │ APPROVAL LIFECYCLE │ ├─────────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌──────┐ ┌──────────┐ ┌───────────┐ ┌─────────┐ ┌─────────┐ │ │ │DRAFT │───→│ SUBMITTED│───→│ REVIEWING │───→│ APPROVED│───→│ APPLYING│ │ │ └──────┘ └──────────┘ └───────────┘ └─────────┘ └────┬────┘ │ │ │ │ │ │ │ │ │ │ │ │ │ ▼ │ │ │ │ │ │ ┌─────────┐ │ │ │ │ │ │ │COMPLETED│ │ │ │ │ │ │ └─────────┘ │ │ │ │ │ │ │ │ │ │ └──────────────┘ │ │ │ │ │ │ │ │ │ ▼ │ │ │ │ ┌─────────┐ │ │ │ │ │REJECTED │ │ │ │ │ └─────────┘ │ │ │ │ │ │ │ │ └───────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────┐ │ │ │ CANCELLED│ │ │ └─────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────────┘ ``` **State Descriptions:** | State | Description | Permissions | |-------|-------------|-------------| | `DRAFT` | Task created but not submitted | Edit, Delete, Submit | | `SUBMITTED` | Validation complete, awaiting review | None (locked) | | `REVIEWING` | Under active review by approvers | Add comments | | `APPROVED` | All required approvals received | Queue for apply | | `APPLYING` | Lock acquired, executing changes | Read-only | | `COMPLETED` | Changes successfully applied | Read-only, audit | | `REJECTED` | Approval denied | Can resubmit as new | | `CANCELLED` | Aborted before apply | Archive only | ### 3. Human Gates **Review Policies:** - **Auto-approve:** Tasks below risk threshold skip human review - **Required reviewers:** Based on task type, resource scope, risk score - **Delegation chains:** "If my manager approves, auto-approve for me" - **Quorum rules:** N-of-M approvals required **Risk Scoring:** ```typescript RiskScore = ( resource_criticality * 0.4 + change_magnitude * 0.3 + blast_radius * 0.2 + historical_failure_rate * 0.1 ) // 0-100 scale ``` --- ## System Architecture ### Component Diagram ``` ┌──────────────────────────────────────────────────────────────────────────────┐ │ CLIENT LAYER │ ├──────────────────────────────────────────────────────────────────────────────┤ │ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ │ │ Dashboard UI │ │ CLI Tool │ │ Webhook API │ │ │ │ (Delta-V2) │ │ (Omega-CLI) │ │ (External) │ │ │ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │ └───────────┼────────────────────┼────────────────────┼────────────────────────┘ │ │ │ ▼ ▼ ▼ ┌──────────────────────────────────────────────────────────────────────────────┐ │ API GATEWAY LAYER │ ├──────────────────────────────────────────────────────────────────────────────┤ │ ┌─────────────────────────────────────────────────────────────────────────┐ │ │ │ Express API Server (Beta Patterns) │ │ │ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌─────────────┐ │ │ │ │ │ Task Routes │ │ApprovalRoutes│ │ Lock Routes │ │ WebSocket │ │ │ │ │ │ │ │ │ │ │ │ Handler │ │ │ │ │ └──────────────┘ └──────────────┘ └──────────────┘ └─────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────┬────────────────────────────────────────────┘ │ ┌─────────────────────┼─────────────────────┐ │ │ │ ▼ ▼ ▼ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ REDIS LAYER │ │ POSTGRESQL │ │ EVENT BUS │ │ (Alpha) │ │ (Persistence) │ │ (WebSocket) │ ├─────────────────┤ ├─────────────────┤ ├─────────────────┤ │ • Locks │ │ • Task History │ │ • approval:* │ │ • Queues │ │ • Audit Log │ │ • lock:* │ │ • Sessions │ │ • User Policies │ │ • task:* │ │ • Rate Limits │ │ • Delegations │ │ │ └────────┬────────┘ └─────────────────┘ └─────────────────┘ │ ▼ ┌──────────────────────────────────────────────────────────────────────────────┐ │ WORKER LAYER (Gamma) │ ├──────────────────────────────────────────────────────────────────────────────┤ │ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ │ │ Lock Manager │ │ Task Executor │ │ Heartbeat Mon │ │ │ │ │ │ │ │ │ │ │ │ • Acquire locks │ │ • Check locks │ │ • Watchdog │ │ │ │ • Queue waiters │ │ • Execute apply │ │ • Deadlock det │ │ │ │ • Release/clean │ │ • Rollback on │ │ • Auto-recovery │ │ │ │ │ │ failure │ │ │ │ │ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ └──────────────────────────────────────────────────────────────────────────────┘ ``` ### Data Flow: Submit → Approve → Apply ``` ┌─────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ USER │ │ SYSTEM │ │ SYSTEM │ │ WORKER │ └────┬────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │ │ │ │ POST /tasks │ │ │ │ {config} │ │ │ ├────────────────→│ │ │ │ │ │ │ │ │──┐ │ │ │ │ │ Validate │ │ │ │ │ Calculate risk │ │ │ │ │ Generate preview│ │ │ │←─┘ │ │ │ │ │ │ │ 201 Created │ │ │ │ task_id: xyz │ │ │ │←────────────────│ │ │ │ │ │ │ │ POST /tasks/xyz │ │ │ │ /submit │ │ │ ├────────────────→│ │ │ │ │ │ │ │ │──┐ │ │ │ │ │ State:SUBMITTED│ │ │ │ │ Lock resources │ │ │ │ │ (preview only) │ │ │ │←─┘ │ │ │ │ │ │ │ 202 Accepted │ │ │ │←────────────────│ │ │ │ │ │ │ │ │ approval:requested │ │ ├──────────────────→│ │ │ │ │ │ │ │ │──┐ │ │ │ │ │ Check policies │ │ │ │ │ Notify reviewers│ │ │ │←─┘ │ │ │ │ │ │ [Time passes] │ │ │ │ │ │ │ │ POST /approvals │ │ │ │ /{id}/approve │ │ │ ├────────────────→│ │ │ │ │ │ │ │ │ │──┐ │ │ │ │ │ Record approval │ │ │ │ │ Check quorum │ │ │ │←─┘ │ │ │ │ │ │ 200 OK │ │ │ │←────────────────│ │ │ │ │ │ │ │ │ │ approval:approved│ │ │←──────────────────┤ │ │ │ │ │ │ │──┐ │ │ │ │ State:APPROVED │ │ │ │ Queue for apply │ │ │←─┘ │ │ │ │ │ │ │ task:approved │ │ │ ├──────────────────────────────────────→│ │ │ │ │ │ │ │ │──┐ │ │ │ │ │ Acquire apply lock │ │ │ │ │ State:APPLYING │ │ │ │ │ Execute changes │ │ │ │←─┘ │ │ │ │ │ │ lock:acquired │ │ │ │←──────────────────────────────────────┤ │ │ │ │ │ │ │ │──┐ │ │ │ │ │ Apply succeeded │ │ │ │ │ State:COMPLETED │ │ │ │ │ Release locks │ │ │ │←─┘ │ │ │ │ │ │ task:completed │ │ │ │←──────────────────────────────────────┤ │ │ │ │ ``` --- ## Lock System Deep Dive ### Lock Types #### 1. Task Lock (Exclusive) ``` Key: lock:task:{task_id} Value: { holder_agent_id, acquired_at, expires_at, purpose } TTL: 30s with automatic extension on heartbeat ``` Prevents concurrent execution of the same task. Released on completion or failure. #### 2. Resource Lock (Shared/Exclusive) ``` Key: lock:resource:{resource_type}:{resource_id} Value: { mode: 'exclusive' | 'shared', holders: [{ agent_id, acquired_at }], queue: [{ agent_id, mode, requested_at }] } ``` Allows multiple readers (shared) or single writer (exclusive). Queue ensures FIFO ordering. #### 3. Agent Capacity Lock ``` Key: lock:agent:{agent_id}:capacity Value: { active_tasks: number, max_tasks: number } ``` Prevents agent overload. Each agent has configurable concurrency limits. ### Deadlock Detection **Algorithm:** Wait-For Graph ``` If Agent A holds Lock X and waits for Lock Y And Agent B holds Lock Y and waits for Lock X → Deadlock detected ``` **Resolution:** 1. Abort youngest transaction (lower cost) 2. Release all held locks 3. Notify owner with `DEADLOCK_DETECTED` error 4. Auto-retry with exponential backoff ### Lock Heartbeat Protocol ```typescript interface HeartbeatMessage { lock_id: string; agent_id: string; timestamp: number; ttl_extension: number; // seconds } // Client must send heartbeat every 10s (configurable) // Server extends TTL on receipt // If no heartbeat for 30s, lock auto-expires // Expired locks trigger cleanup and notify waiters ``` --- ## Approval Engine ### Reviewer Assignment ```typescript interface ReviewerPolicy { task_types: string[]; resource_patterns: string[]; min_approvers: number; required_roles: string[]; risk_threshold: number; auto_approve_if: { risk_below: number; author_has_role: string[]; resources_in_scope: string[]; }; } ``` **Assignment Algorithm:** 1. Match task against all policies 2. Union all required reviewers from matching policies 3. Check for delegation chains 4. Filter out auto-approved reviewers (based on policy) 5. Calculate minimum approvals needed 6. Create approval requests ### Delegation Chains ``` Alice delegates to Bob when: - Task type is "infrastructure" - Risk score > 50 Bob delegates to Carol when: - Resource matches "prod-*" Result: For prod infrastructure with high risk, only Carol's approval is needed ``` **Resolution:** Depth-first traversal with cycle detection. ### Batch Operations **Batch Approve:** ```typescript POST /approvals/batch { approval_ids: string[]; action: 'approve' | 'reject'; reason?: string; options: { skip_validation: boolean; apply_immediately: boolean; } } ``` Atomic operation: either all approvals succeed or all fail. --- ## Error Handling & Edge Cases ### Lock Acquisition Failures | Scenario | Response | Retry Strategy | |----------|----------|----------------| | Lock held by another agent | 423 Locked | Queue and wait | | Lock expired during operation | 409 Conflict | Abort, notify, retry | | Deadlock detected | 423 Deadlock | Abort, auto-retry with backoff | | Max queue depth exceeded | 503 Queue Full | Fail fast, notify operator | ### Approval Edge Cases | Scenario | Behavior | |----------|----------| | Approver leaves organization | Auto-reassign to delegate or manager | | Approval timeout (48h default) | Escalate to next level, notify on-call | | Required reviewer unavailable | Bypass with admin override + audit | | Task modified during review | Invalidate approvals, restart review | | Concurrent approvals | Last write wins, notify others of resolution | ### System Degradation | Condition | Response | |-----------|----------| | Redis unavailable | Queue in PostgreSQL, async recovery | | High lock contention | Exponential backoff, circuit breaker | | Approval queue backlog | Priority escalation, auto-approve low-risk | | WebSocket failure | Polling fallback, queued events | --- ## Security Model ### Permission Matrix | Action | Author | Reviewer | Admin | System | |--------|--------|----------|-------|--------| | Create task | ✓ | ✗ | ✓ | ✗ | | Submit for approval | ✓ | ✗ | ✓ | ✗ | | Approve/reject | ✗ | ✓ | ✓ | ✗ | | Force apply | ✗ | ✗ | ✓ | ✗ | | Cancel task | ✓ | ✗ | ✓ | ✓ | | Override policy | ✗ | ✗ | ✓* | ✗ | | View audit log | ✓ | ✓ | ✓ | ✓ | *Requires secondary approval and incident ticket ### Audit Requirements Every state transition logged: - Who (user ID, session) - What (from state, to state, action) - When (timestamp with microsecond precision) - Where (IP, user agent, service) - Why (reason, ticket reference) --- ## Scalability Considerations ### Horizontal Scaling - **API Layer:** Stateless, scale via load balancer - **Redis:** Cluster mode, hash tags for lock locality - **PostgreSQL:** Read replicas for audit queries - **WebSocket:** Sticky sessions or Redis pub/sub ### Performance Targets | Metric | Target | Peak | |--------|--------|------| | Lock acquisition | < 10ms | < 50ms @ p99 | | Approval latency | < 100ms | < 500ms @ p99 | | Task throughput | 1000/min | 5000/min burst | | Concurrent locks | 10,000 | 50,000 | --- ## Deployment Architecture ``` ┌─────────────────────────────────────────────────────────────────┐ │ KUBERNETES CLUSTER │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ API Pod │ │ API Pod │ │ API Pod │ │ │ │ (3+ replicas)│ │ │ │ │ │ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │ │ │ │ │ │ └────────────────┼────────────────┘ │ │ │ │ │ ┌──────┴──────┐ │ │ │ Ingress │ │ │ │ Controller │ │ │ └──────┬──────┘ │ │ │ │ ├──────────────────────────┼──────────────────────────────────────┤ │ ┌───────────────────────┴───────────────────────┐ │ │ │ Redis Cluster │ │ │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ │ │ │ Master │ │ Master │ │ Master │ │ │ │ │ │ + Repl │ │ + Repl │ │ + Repl │ │ │ │ │ └─────────┘ └─────────┘ └─────────┘ │ │ │ └───────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────────────────────────────────────┐ │ │ │ PostgreSQL (HA: Patroni) │ │ │ │ Primary + 2 Replicas │ │ │ └─────────────────────────────────────────────────┘ │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ Worker Pod │ │ Worker Pod │ │ Worker Pod │ │ │ │ (HPA: 2-20) │ │ │ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘ ``` --- ## Future Enhancements 1. **ML-Based Risk Scoring:** Train models on historical task outcomes 2. **Predictive Locking:** Pre-acquire locks based on task patterns 3. **Approval Simulation:** "What if" analysis before submitting 4. **Time-Based Policies:** Different rules for on-call hours 5. **Integration Marketplace:** Slack, PagerDuty, ServiceNow webhooks --- ## Glossary | Term | Definition | |------|------------| | **Clean Apply** | Applying changes only after successful lock acquisition and approval | | **Deadlock** | Circular wait condition between multiple lock holders | | **Delegation Chain** | Hierarchical approval routing | | **Lock Queue** | FIFO waiting list for lock acquisition | | **Risk Score** | Calculated metric (0-100) indicating task danger | | **Stash** | Saved state for potential rollback | | **Wait-For Graph** | Data structure for deadlock detection |