Files
community-ade/docs/design.md
Ani (Annie Tunturi) ce8dd84840 feat: Add approval system and agent config UI
- Omega (Kimi-K2.5): Approval system architecture
  - design.md: Full system architecture with state machines
  - api-spec.ts: Express routes + Zod schemas (33KB)
  - redis-schema.md: Redis key patterns (19KB)
  - ui-components.md: Dashboard UI specs (31KB)

- Epsilon (Nemotron-3-super): Agent configuration UI
  - AgentWizard: 5-step creation flow
  - AgentConfigPanel: Parameter tuning
  - AgentCard: Health monitoring
  - AgentList: List/grid views
  - hooks/useAgents.ts: WebSocket integration
  - types/agent.ts: TypeScript definitions

Total: 150KB new code, 22 components

👾 Generated with [Letta Code](https://letta.com)
2026-03-18 12:23:59 -04:00

29 KiB

Community ADE Approval System Architecture

Executive Summary

The Approval System provides governance and safety controls for the Community ADE platform. It introduces human-in-the-loop validation for task execution, distributed locking for resource protection, and a complete audit trail for compliance.


Core Concepts

1. Clean Apply Locks

Philosophy: A lock should only grant permission to attempt an operation, not guarantee success. Locks are advisory but strictly enforced by the system.

Lock Hierarchy:

Task Lock (task:{id}:lock)       - Single task execution
Resource Lock (resource:{type}:{id}:lock) - Shared resource protection
Agent Lock (agent:{id}:lock)      - Agent capacity management

Lock Properties:

  • Ownership: UUID of the lock holder
  • TTL: 30 seconds default, extendable via heartbeats
  • Queue: FIFO ordered waiting list for fairness
  • Metadata: Timestamp, purpose, agent info

2. Approval Lifecycle State Machine

┌─────────────────────────────────────────────────────────────────────────────┐
│                         APPROVAL LIFECYCLE                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   ┌──────┐    ┌──────────┐    ┌───────────┐    ┌─────────┐    ┌─────────┐  │
│   │DRAFT │───→│ SUBMITTED│───→│ REVIEWING │───→│ APPROVED│───→│ APPLYING│  │
│   └──────┘    └──────────┘    └───────────┘    └─────────┘    └────┬────┘  │
│       │              │                │              │              │      │
│       │              │                │              │              ▼      │
│       │              │                │              │         ┌─────────┐ │
│       │              │                │              │         │COMPLETED│ │
│       │              │                │              │         └─────────┘ │
│       │              │                │              │                     │
│       │              │                └──────────────┘                     │
│       │              │                       │                            │
│       │              │                       ▼                            │
│       │              │                  ┌─────────┐                        │
│       │              │                  │REJECTED │                        │
│       │              │                  └─────────┘                        │
│       │              │                       │                            │
│       │              └───────────────────────┘                            │
│       │                                                                    │
│       ▼                                                                    │
│   ┌─────────┐                                                              │
│   │ CANCELLED│                                                              │
│   └─────────┘                                                              │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

State Descriptions:

State Description Permissions
DRAFT Task created but not submitted Edit, Delete, Submit
SUBMITTED Validation complete, awaiting review None (locked)
REVIEWING Under active review by approvers Add comments
APPROVED All required approvals received Queue for apply
APPLYING Lock acquired, executing changes Read-only
COMPLETED Changes successfully applied Read-only, audit
REJECTED Approval denied Can resubmit as new
CANCELLED Aborted before apply Archive only

3. Human Gates

Review Policies:

  • Auto-approve: Tasks below risk threshold skip human review
  • Required reviewers: Based on task type, resource scope, risk score
  • Delegation chains: "If my manager approves, auto-approve for me"
  • Quorum rules: N-of-M approvals required

Risk Scoring:

RiskScore = (
  resource_criticality * 0.4 +
  change_magnitude * 0.3 +
  blast_radius * 0.2 +
  historical_failure_rate * 0.1
) // 0-100 scale

System Architecture

Component Diagram

┌──────────────────────────────────────────────────────────────────────────────┐
│                              CLIENT LAYER                                     │
├──────────────────────────────────────────────────────────────────────────────┤
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐              │
│  │   Dashboard UI  │  │   CLI Tool      │  │   Webhook API   │              │
│  │   (Delta-V2)    │  │   (Omega-CLI)   │  │   (External)    │              │
│  └────────┬────────┘  └────────┬────────┘  └────────┬────────┘              │
└───────────┼────────────────────┼────────────────────┼────────────────────────┘
            │                    │                    │
            ▼                    ▼                    ▼
┌──────────────────────────────────────────────────────────────────────────────┐
│                           API GATEWAY LAYER                                   │
├──────────────────────────────────────────────────────────────────────────────┤
│  ┌─────────────────────────────────────────────────────────────────────────┐ │
│  │                    Express API Server (Beta Patterns)                   │ │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌─────────────┐ │ │
│  │  │ Task Routes  │  │ApprovalRoutes│  │ Lock Routes  │  │  WebSocket  │ │ │
│  │  │              │  │              │  │              │  │   Handler   │ │ │
│  │  └──────────────┘  └──────────────┘  └──────────────┘  └─────────────┘ │ │
│  └─────────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────┬────────────────────────────────────────────┘
                                  │
            ┌─────────────────────┼─────────────────────┐
            │                     │                     │
            ▼                     ▼                     ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   REDIS LAYER   │    │  POSTGRESQL     │    │   EVENT BUS     │
│   (Alpha)       │    │  (Persistence)  │    │   (WebSocket)   │
├─────────────────┤    ├─────────────────┤    ├─────────────────┤
│ • Locks         │    │ • Task History  │    │ • approval:*    │
│ • Queues        │    │ • Audit Log     │    │ • lock:*        │
│ • Sessions      │    │ • User Policies │    │ • task:*        │
│ • Rate Limits   │    │ • Delegations   │    │                 │
└────────┬────────┘    └─────────────────┘    └─────────────────┘
         │
         ▼
┌──────────────────────────────────────────────────────────────────────────────┐
│                          WORKER LAYER (Gamma)                                 │
├──────────────────────────────────────────────────────────────────────────────┤
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐              │
│  │  Lock Manager   │  │  Task Executor  │  │  Heartbeat Mon  │              │
│  │                 │  │                 │  │                 │              │
│  │ • Acquire locks │  │ • Check locks   │  │ • Watchdog      │              │
│  │ • Queue waiters │  │ • Execute apply │  │ • Deadlock det  │              │
│  │ • Release/clean │  │ • Rollback on   │  │ • Auto-recovery │              │
│  │                 │  │   failure       │  │                 │              │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘              │
└──────────────────────────────────────────────────────────────────────────────┘

Data Flow: Submit → Approve → Apply

┌─────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  USER   │     │   SYSTEM    │     │   SYSTEM    │     │   WORKER    │
└────┬────┘     └──────┬──────┘     └──────┬──────┘     └──────┬──────┘
     │                 │                   │                   │
     │ POST /tasks     │                   │                   │
     │ {config}        │                   │                   │
     ├────────────────→│                   │                   │
     │                 │                   │                   │
     │                 │──┐                │                   │
     │                 │  │ Validate       │                   │
     │                 │  │ Calculate risk │                   │
     │                 │  │ Generate preview│                  │
     │                 │←─┘                │                   │
     │                 │                   │                   │
     │ 201 Created     │                   │                   │
     │ task_id: xyz    │                   │                   │
     │←────────────────│                   │                   │
     │                 │                   │                   │
     │ POST /tasks/xyz │                   │                   │
     │ /submit         │                   │                   │
     ├────────────────→│                   │                   │
     │                 │                   │                   │
     │                 │──┐                │                   │
     │                 │  │ State:SUBMITTED│                   │
     │                 │  │ Lock resources │                   │
     │                 │  │ (preview only) │                   │
     │                 │←─┘                │                   │
     │                 │                   │                   │
     │ 202 Accepted    │                   │                   │
     │←────────────────│                   │                   │
     │                 │                   │                   │
     │                 │   approval:requested                  │
     │                 ├──────────────────→│                   │
     │                 │                   │                   │
     │                 │                   │──┐                │
     │                 │                   │  │ Check policies  │
     │                 │                   │  │ Notify reviewers│
     │                 │                   │←─┘                │
     │                 │                   │                   │
     │  [Time passes]  │                   │                   │
     │                 │                   │                   │
     │ POST /approvals │                   │                   │
     │ /{id}/approve   │                   │                   │
     ├────────────────→│                   │                   │
     │                 │                   │                   │
     │                 │                   │──┐                │
     │                 │                   │  │ Record approval │
     │                 │                   │  │ Check quorum    │
     │                 │                   │←─┘                │
     │                 │                   │                   │
     │ 200 OK          │                   │                   │
     │←────────────────│                   │                   │
     │                 │                   │                   │
     │                 │                   │   approval:approved│
     │                 │←──────────────────┤                   │
     │                 │                   │                   │
     │                 │──┐                                    │
     │                 │  │ State:APPROVED                     │
     │                 │  │ Queue for apply                    │
     │                 │←─┘                                    │
     │                 │                   │                   │
     │                 │   task:approved   │                   │
     │                 ├──────────────────────────────────────→│
     │                 │                   │                   │
     │                 │                   │                   │──┐
     │                 │                   │                   │  │ Acquire apply lock
     │                 │                   │                   │  │ State:APPLYING
     │                 │                   │                   │  │ Execute changes
     │                 │                   │                   │←─┘
     │                 │                   │                   │
     │                 │   lock:acquired   │                   │
     │                 │←──────────────────────────────────────┤
     │                 │                   │                   │
     │                 │                   │                   │──┐
     │                 │                   │                   │  │ Apply succeeded
     │                 │                   │                   │  │ State:COMPLETED
     │                 │                   │                   │  │ Release locks
     │                 │                   │                   │←─┘
     │                 │                   │                   │
     │                 │   task:completed  │                   │
     │                 │←──────────────────────────────────────┤
     │                 │                   │                   │

Lock System Deep Dive

Lock Types

1. Task Lock (Exclusive)

Key: lock:task:{task_id}
Value: { holder_agent_id, acquired_at, expires_at, purpose }
TTL: 30s with automatic extension on heartbeat

Prevents concurrent execution of the same task. Released on completion or failure.

2. Resource Lock (Shared/Exclusive)

Key: lock:resource:{resource_type}:{resource_id}
Value: {
  mode: 'exclusive' | 'shared',
  holders: [{ agent_id, acquired_at }],
  queue: [{ agent_id, mode, requested_at }]
}

Allows multiple readers (shared) or single writer (exclusive). Queue ensures FIFO ordering.

3. Agent Capacity Lock

Key: lock:agent:{agent_id}:capacity
Value: { active_tasks: number, max_tasks: number }

Prevents agent overload. Each agent has configurable concurrency limits.

Deadlock Detection

Algorithm: Wait-For Graph

If Agent A holds Lock X and waits for Lock Y
And Agent B holds Lock Y and waits for Lock X
→ Deadlock detected

Resolution:

  1. Abort youngest transaction (lower cost)
  2. Release all held locks
  3. Notify owner with DEADLOCK_DETECTED error
  4. Auto-retry with exponential backoff

Lock Heartbeat Protocol

interface HeartbeatMessage {
  lock_id: string;
  agent_id: string;
  timestamp: number;
  ttl_extension: number; // seconds
}

// Client must send heartbeat every 10s (configurable)
// Server extends TTL on receipt
// If no heartbeat for 30s, lock auto-expires
// Expired locks trigger cleanup and notify waiters

Approval Engine

Reviewer Assignment

interface ReviewerPolicy {
  task_types: string[];
  resource_patterns: string[];
  min_approvers: number;
  required_roles: string[];
  risk_threshold: number;
  auto_approve_if: {
    risk_below: number;
    author_has_role: string[];
    resources_in_scope: string[];
  };
}

Assignment Algorithm:

  1. Match task against all policies
  2. Union all required reviewers from matching policies
  3. Check for delegation chains
  4. Filter out auto-approved reviewers (based on policy)
  5. Calculate minimum approvals needed
  6. Create approval requests

Delegation Chains

Alice delegates to Bob when:
  - Task type is "infrastructure"
  - Risk score > 50

Bob delegates to Carol when:
  - Resource matches "prod-*"

Result: For prod infrastructure with high risk,
        only Carol's approval is needed

Resolution: Depth-first traversal with cycle detection.

Batch Operations

Batch Approve:

POST /approvals/batch
{
  approval_ids: string[];
  action: 'approve' | 'reject';
  reason?: string;
  options: {
    skip_validation: boolean;
    apply_immediately: boolean;
  }
}

Atomic operation: either all approvals succeed or all fail.


Error Handling & Edge Cases

Lock Acquisition Failures

Scenario Response Retry Strategy
Lock held by another agent 423 Locked Queue and wait
Lock expired during operation 409 Conflict Abort, notify, retry
Deadlock detected 423 Deadlock Abort, auto-retry with backoff
Max queue depth exceeded 503 Queue Full Fail fast, notify operator

Approval Edge Cases

Scenario Behavior
Approver leaves organization Auto-reassign to delegate or manager
Approval timeout (48h default) Escalate to next level, notify on-call
Required reviewer unavailable Bypass with admin override + audit
Task modified during review Invalidate approvals, restart review
Concurrent approvals Last write wins, notify others of resolution

System Degradation

Condition Response
Redis unavailable Queue in PostgreSQL, async recovery
High lock contention Exponential backoff, circuit breaker
Approval queue backlog Priority escalation, auto-approve low-risk
WebSocket failure Polling fallback, queued events

Security Model

Permission Matrix

Action Author Reviewer Admin System
Create task
Submit for approval
Approve/reject
Force apply
Cancel task
Override policy ✓*
View audit log

*Requires secondary approval and incident ticket

Audit Requirements

Every state transition logged:

  • Who (user ID, session)
  • What (from state, to state, action)
  • When (timestamp with microsecond precision)
  • Where (IP, user agent, service)
  • Why (reason, ticket reference)

Scalability Considerations

Horizontal Scaling

  • API Layer: Stateless, scale via load balancer
  • Redis: Cluster mode, hash tags for lock locality
  • PostgreSQL: Read replicas for audit queries
  • WebSocket: Sticky sessions or Redis pub/sub

Performance Targets

Metric Target Peak
Lock acquisition < 10ms < 50ms @ p99
Approval latency < 100ms < 500ms @ p99
Task throughput 1000/min 5000/min burst
Concurrent locks 10,000 50,000

Deployment Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        KUBERNETES CLUSTER                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐             │
│  │  API Pod    │  │  API Pod    │  │  API Pod    │             │
│  │  (3+ replicas)│  │             │  │             │             │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘             │
│         │                │                │                     │
│         └────────────────┼────────────────┘                     │
│                          │                                      │
│                   ┌──────┴──────┐                              │
│                   │   Ingress   │                              │
│                   │  Controller │                              │
│                   └──────┬──────┘                              │
│                          │                                      │
├──────────────────────────┼──────────────────────────────────────┤
│  ┌───────────────────────┴───────────────────────┐             │
│  │              Redis Cluster                    │             │
│  │  ┌─────────┐  ┌─────────┐  ┌─────────┐       │             │
│  │  │ Master  │  │ Master  │  │ Master  │       │             │
│  │  │  + Repl │  │  + Repl │  │  + Repl │       │             │
│  │  └─────────┘  └─────────┘  └─────────┘       │             │
│  └───────────────────────────────────────────────┘             │
│                                                                 │
│  ┌─────────────────────────────────────────────────┐           │
│  │         PostgreSQL (HA: Patroni)                │           │
│  │         Primary + 2 Replicas                    │           │
│  └─────────────────────────────────────────────────┘           │
│                                                                 │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐             │
│  │ Worker Pod  │  │ Worker Pod  │  │ Worker Pod  │             │
│  │ (HPA: 2-20) │  │             │  │             │             │
│  └─────────────┘  └─────────────┘  └─────────────┘             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Future Enhancements

  1. ML-Based Risk Scoring: Train models on historical task outcomes
  2. Predictive Locking: Pre-acquire locks based on task patterns
  3. Approval Simulation: "What if" analysis before submitting
  4. Time-Based Policies: Different rules for on-call hours
  5. Integration Marketplace: Slack, PagerDuty, ServiceNow webhooks

Glossary

Term Definition
Clean Apply Applying changes only after successful lock acquisition and approval
Deadlock Circular wait condition between multiple lock holders
Delegation Chain Hierarchical approval routing
Lock Queue FIFO waiting list for lock acquisition
Risk Score Calculated metric (0-100) indicating task danger
Stash Saved state for potential rollback
Wait-For Graph Data structure for deadlock detection