Files

Ani (Annie Tunturi) ce8dd84840 feat: Add approval system and agent config UI

- Omega (Kimi-K2.5): Approval system architecture
  - design.md: Full system architecture with state machines
  - api-spec.ts: Express routes + Zod schemas (33KB)
  - redis-schema.md: Redis key patterns (19KB)
  - ui-components.md: Dashboard UI specs (31KB)

- Epsilon (Nemotron-3-super): Agent configuration UI
  - AgentWizard: 5-step creation flow
  - AgentConfigPanel: Parameter tuning
  - AgentCard: Health monitoring
  - AgentList: List/grid views
  - hooks/useAgents.ts: WebSocket integration
  - types/agent.ts: TypeScript definitions

Total: 150KB new code, 22 components

👾 Generated with [Letta Code](https://letta.com)

2026-03-18 12:23:59 -04:00

29 KiB

Raw Blame History

Community ADE Approval System Architecture

Executive Summary

The Approval System provides governance and safety controls for the Community ADE platform. It introduces human-in-the-loop validation for task execution, distributed locking for resource protection, and a complete audit trail for compliance.

Core Concepts

1. Clean Apply Locks

Philosophy: A lock should only grant permission to attempt an operation, not guarantee success. Locks are advisory but strictly enforced by the system.

Lock Hierarchy:

Task Lock (task:{id}:lock)       - Single task execution
Resource Lock (resource:{type}:{id}:lock) - Shared resource protection
Agent Lock (agent:{id}:lock)      - Agent capacity management

Lock Properties:

Ownership: UUID of the lock holder
TTL: 30 seconds default, extendable via heartbeats
Queue: FIFO ordered waiting list for fairness
Metadata: Timestamp, purpose, agent info

2. Approval Lifecycle State Machine

┌─────────────────────────────────────────────────────────────────────────────┐
│                         APPROVAL LIFECYCLE                                  │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   ┌──────┐    ┌──────────┐    ┌───────────┐    ┌─────────┐    ┌─────────┐  │
│   │DRAFT │───→│ SUBMITTED│───→│ REVIEWING │───→│ APPROVED│───→│ APPLYING│  │
│   └──────┘    └──────────┘    └───────────┘    └─────────┘    └────┬────┘  │
│       │              │                │              │              │      │
│       │              │                │              │              ▼      │
│       │              │                │              │         ┌─────────┐ │
│       │              │                │              │         │COMPLETED│ │
│       │              │                │              │         └─────────┘ │
│       │              │                │              │                     │
│       │              │                └──────────────┘                     │
│       │              │                       │                            │
│       │              │                       ▼                            │
│       │              │                  ┌─────────┐                        │
│       │              │                  │REJECTED │                        │
│       │              │                  └─────────┘                        │
│       │              │                       │                            │
│       │              └───────────────────────┘                            │
│       │                                                                    │
│       ▼                                                                    │
│   ┌─────────┐                                                              │
│   │ CANCELLED│                                                              │
│   └─────────┘                                                              │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

State Descriptions:

State	Description	Permissions
`DRAFT`	Task created but not submitted	Edit, Delete, Submit
`SUBMITTED`	Validation complete, awaiting review	None (locked)
`REVIEWING`	Under active review by approvers	Add comments
`APPROVED`	All required approvals received	Queue for apply
`APPLYING`	Lock acquired, executing changes	Read-only
`COMPLETED`	Changes successfully applied	Read-only, audit
`REJECTED`	Approval denied	Can resubmit as new
`CANCELLED`	Aborted before apply	Archive only

3. Human Gates

Review Policies:

Auto-approve: Tasks below risk threshold skip human review
Required reviewers: Based on task type, resource scope, risk score
Delegation chains: "If my manager approves, auto-approve for me"
Quorum rules: N-of-M approvals required

Risk Scoring:

RiskScore = (
  resource_criticality * 0.4 +
  change_magnitude * 0.3 +
  blast_radius * 0.2 +
  historical_failure_rate * 0.1
) // 0-100 scale

System Architecture

Component Diagram

┌──────────────────────────────────────────────────────────────────────────────┐
│                              CLIENT LAYER                                     │
├──────────────────────────────────────────────────────────────────────────────┤
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐              │
│  │   Dashboard UI  │  │   CLI Tool      │  │   Webhook API   │              │
│  │   (Delta-V2)    │  │   (Omega-CLI)   │  │   (External)    │              │
│  └────────┬────────┘  └────────┬────────┘  └────────┬────────┘              │
└───────────┼────────────────────┼────────────────────┼────────────────────────┘
            │                    │                    │
            ▼                    ▼                    ▼
┌──────────────────────────────────────────────────────────────────────────────┐
│                           API GATEWAY LAYER                                   │
├──────────────────────────────────────────────────────────────────────────────┤
│  ┌─────────────────────────────────────────────────────────────────────────┐ │
│  │                    Express API Server (Beta Patterns)                   │ │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐  ┌─────────────┐ │ │
│  │  │ Task Routes  │  │ApprovalRoutes│  │ Lock Routes  │  │  WebSocket  │ │ │
│  │  │              │  │              │  │              │  │   Handler   │ │ │
│  │  └──────────────┘  └──────────────┘  └──────────────┘  └─────────────┘ │ │
│  └─────────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────┬────────────────────────────────────────────┘
                                  │
            ┌─────────────────────┼─────────────────────┐
            │                     │                     │
            ▼                     ▼                     ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   REDIS LAYER   │    │  POSTGRESQL     │    │   EVENT BUS     │
│   (Alpha)       │    │  (Persistence)  │    │   (WebSocket)   │
├─────────────────┤    ├─────────────────┤    ├─────────────────┤
│ • Locks         │    │ • Task History  │    │ • approval:*    │
│ • Queues        │    │ • Audit Log     │    │ • lock:*        │
│ • Sessions      │    │ • User Policies │    │ • task:*        │
│ • Rate Limits   │    │ • Delegations   │    │                 │
└────────┬────────┘    └─────────────────┘    └─────────────────┘
         │
         ▼
┌──────────────────────────────────────────────────────────────────────────────┐
│                          WORKER LAYER (Gamma)                                 │
├──────────────────────────────────────────────────────────────────────────────┤
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────┐              │
│  │  Lock Manager   │  │  Task Executor  │  │  Heartbeat Mon  │              │
│  │                 │  │                 │  │                 │              │
│  │ • Acquire locks │  │ • Check locks   │  │ • Watchdog      │              │
│  │ • Queue waiters │  │ • Execute apply │  │ • Deadlock det  │              │
│  │ • Release/clean │  │ • Rollback on   │  │ • Auto-recovery │              │
│  │                 │  │   failure       │  │                 │              │
│  └─────────────────┘  └─────────────────┘  └─────────────────┘              │
└──────────────────────────────────────────────────────────────────────────────┘

Data Flow: Submit → Approve → Apply

┌─────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  USER   │     │   SYSTEM    │     │   SYSTEM    │     │   WORKER    │
└────┬────┘     └──────┬──────┘     └──────┬──────┘     └──────┬──────┘
     │                 │                   │                   │
     │ POST /tasks     │                   │                   │
     │ {config}        │                   │                   │
     ├────────────────→│                   │                   │
     │                 │                   │                   │
     │                 │──┐                │                   │
     │                 │  │ Validate       │                   │
     │                 │  │ Calculate risk │                   │
     │                 │  │ Generate preview│                  │
     │                 │←─┘                │                   │
     │                 │                   │                   │
     │ 201 Created     │                   │                   │
     │ task_id: xyz    │                   │                   │
     │←────────────────│                   │                   │
     │                 │                   │                   │
     │ POST /tasks/xyz │                   │                   │
     │ /submit         │                   │                   │
     ├────────────────→│                   │                   │
     │                 │                   │                   │
     │                 │──┐                │                   │
     │                 │  │ State:SUBMITTED│                   │
     │                 │  │ Lock resources │                   │
     │                 │  │ (preview only) │                   │
     │                 │←─┘                │                   │
     │                 │                   │                   │
     │ 202 Accepted    │                   │                   │
     │←────────────────│                   │                   │
     │                 │                   │                   │
     │                 │   approval:requested                  │
     │                 ├──────────────────→│                   │
     │                 │                   │                   │
     │                 │                   │──┐                │
     │                 │                   │  │ Check policies  │
     │                 │                   │  │ Notify reviewers│
     │                 │                   │←─┘                │
     │                 │                   │                   │
     │  [Time passes]  │                   │                   │
     │                 │                   │                   │
     │ POST /approvals │                   │                   │
     │ /{id}/approve   │                   │                   │
     ├────────────────→│                   │                   │
     │                 │                   │                   │
     │                 │                   │──┐                │
     │                 │                   │  │ Record approval │
     │                 │                   │  │ Check quorum    │
     │                 │                   │←─┘                │
     │                 │                   │                   │
     │ 200 OK          │                   │                   │
     │←────────────────│                   │                   │
     │                 │                   │                   │
     │                 │                   │   approval:approved│
     │                 │←──────────────────┤                   │
     │                 │                   │                   │
     │                 │──┐                                    │
     │                 │  │ State:APPROVED                     │
     │                 │  │ Queue for apply                    │
     │                 │←─┘                                    │
     │                 │                   │                   │
     │                 │   task:approved   │                   │
     │                 ├──────────────────────────────────────→│
     │                 │                   │                   │
     │                 │                   │                   │──┐
     │                 │                   │                   │  │ Acquire apply lock
     │                 │                   │                   │  │ State:APPLYING
     │                 │                   │                   │  │ Execute changes
     │                 │                   │                   │←─┘
     │                 │                   │                   │
     │                 │   lock:acquired   │                   │
     │                 │←──────────────────────────────────────┤
     │                 │                   │                   │
     │                 │                   │                   │──┐
     │                 │                   │                   │  │ Apply succeeded
     │                 │                   │                   │  │ State:COMPLETED
     │                 │                   │                   │  │ Release locks
     │                 │                   │                   │←─┘
     │                 │                   │                   │
     │                 │   task:completed  │                   │
     │                 │←──────────────────────────────────────┤
     │                 │                   │                   │

Lock System Deep Dive

Lock Types

1. Task Lock (Exclusive)

Key: lock:task:{task_id}
Value: { holder_agent_id, acquired_at, expires_at, purpose }
TTL: 30s with automatic extension on heartbeat

Prevents concurrent execution of the same task. Released on completion or failure.

2. Resource Lock (Shared/Exclusive)

Key: lock:resource:{resource_type}:{resource_id}
Value: {
  mode: 'exclusive' | 'shared',
  holders: [{ agent_id, acquired_at }],
  queue: [{ agent_id, mode, requested_at }]
}

Allows multiple readers (shared) or single writer (exclusive). Queue ensures FIFO ordering.

3. Agent Capacity Lock

Key: lock:agent:{agent_id}:capacity
Value: { active_tasks: number, max_tasks: number }

Prevents agent overload. Each agent has configurable concurrency limits.

Deadlock Detection

Algorithm: Wait-For Graph

If Agent A holds Lock X and waits for Lock Y
And Agent B holds Lock Y and waits for Lock X
→ Deadlock detected

Resolution:

Abort youngest transaction (lower cost)
Release all held locks
Notify owner with DEADLOCK_DETECTED error
Auto-retry with exponential backoff

Lock Heartbeat Protocol

interface HeartbeatMessage {
  lock_id: string;
  agent_id: string;
  timestamp: number;
  ttl_extension: number; // seconds
}

// Client must send heartbeat every 10s (configurable)
// Server extends TTL on receipt
// If no heartbeat for 30s, lock auto-expires
// Expired locks trigger cleanup and notify waiters

Approval Engine

Reviewer Assignment

interface ReviewerPolicy {
  task_types: string[];
  resource_patterns: string[];
  min_approvers: number;
  required_roles: string[];
  risk_threshold: number;
  auto_approve_if: {
    risk_below: number;
    author_has_role: string[];
    resources_in_scope: string[];
  };
}

Assignment Algorithm:

Match task against all policies
Union all required reviewers from matching policies
Check for delegation chains
Filter out auto-approved reviewers (based on policy)
Calculate minimum approvals needed
Create approval requests

Delegation Chains

Alice delegates to Bob when:
  - Task type is "infrastructure"
  - Risk score > 50

Bob delegates to Carol when:
  - Resource matches "prod-*"

Result: For prod infrastructure with high risk,
        only Carol's approval is needed

Resolution: Depth-first traversal with cycle detection.

Batch Operations

Batch Approve:

POST /approvals/batch
{
  approval_ids: string[];
  action: 'approve' | 'reject';
  reason?: string;
  options: {
    skip_validation: boolean;
    apply_immediately: boolean;
  }
}

Atomic operation: either all approvals succeed or all fail.

Error Handling & Edge Cases

Lock Acquisition Failures

Scenario	Response	Retry Strategy
Lock held by another agent	423 Locked	Queue and wait
Lock expired during operation	409 Conflict	Abort, notify, retry
Deadlock detected	423 Deadlock	Abort, auto-retry with backoff
Max queue depth exceeded	503 Queue Full	Fail fast, notify operator

Approval Edge Cases

Scenario	Behavior
Approver leaves organization	Auto-reassign to delegate or manager
Approval timeout (48h default)	Escalate to next level, notify on-call
Required reviewer unavailable	Bypass with admin override + audit
Task modified during review	Invalidate approvals, restart review
Concurrent approvals	Last write wins, notify others of resolution

System Degradation

Condition	Response
Redis unavailable	Queue in PostgreSQL, async recovery
High lock contention	Exponential backoff, circuit breaker
Approval queue backlog	Priority escalation, auto-approve low-risk
WebSocket failure	Polling fallback, queued events

Security Model

Permission Matrix

Action	Author	Reviewer	Admin	System
Create task	✓	✗	✓	✗
Submit for approval	✓	✗	✓	✗
Approve/reject	✗	✓	✓	✗
Force apply	✗	✗	✓	✗
Cancel task	✓	✗	✓	✓
Override policy	✗	✗	✓*	✗
View audit log	✓	✓	✓	✓

*Requires secondary approval and incident ticket

Audit Requirements

Every state transition logged:

Who (user ID, session)
What (from state, to state, action)
When (timestamp with microsecond precision)
Where (IP, user agent, service)
Why (reason, ticket reference)

Scalability Considerations

Horizontal Scaling

API Layer: Stateless, scale via load balancer
Redis: Cluster mode, hash tags for lock locality
PostgreSQL: Read replicas for audit queries
WebSocket: Sticky sessions or Redis pub/sub

Performance Targets

Metric	Target	Peak
Lock acquisition	< 10ms	< 50ms @ p99
Approval latency	< 100ms	< 500ms @ p99
Task throughput	1000/min	5000/min burst
Concurrent locks	10,000	50,000

Deployment Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        KUBERNETES CLUSTER                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐             │
│  │  API Pod    │  │  API Pod    │  │  API Pod    │             │
│  │  (3+ replicas)│  │             │  │             │             │
│  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘             │
│         │                │                │                     │
│         └────────────────┼────────────────┘                     │
│                          │                                      │
│                   ┌──────┴──────┐                              │
│                   │   Ingress   │                              │
│                   │  Controller │                              │
│                   └──────┬──────┘                              │
│                          │                                      │
├──────────────────────────┼──────────────────────────────────────┤
│  ┌───────────────────────┴───────────────────────┐             │
│  │              Redis Cluster                    │             │
│  │  ┌─────────┐  ┌─────────┐  ┌─────────┐       │             │
│  │  │ Master  │  │ Master  │  │ Master  │       │             │
│  │  │  + Repl │  │  + Repl │  │  + Repl │       │             │
│  │  └─────────┘  └─────────┘  └─────────┘       │             │
│  └───────────────────────────────────────────────┘             │
│                                                                 │
│  ┌─────────────────────────────────────────────────┐           │
│  │         PostgreSQL (HA: Patroni)                │           │
│  │         Primary + 2 Replicas                    │           │
│  └─────────────────────────────────────────────────┘           │
│                                                                 │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐             │
│  │ Worker Pod  │  │ Worker Pod  │  │ Worker Pod  │             │
│  │ (HPA: 2-20) │  │             │  │             │             │
│  └─────────────┘  └─────────────┘  └─────────────┘             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Future Enhancements

ML-Based Risk Scoring: Train models on historical task outcomes
Predictive Locking: Pre-acquire locks based on task patterns
Approval Simulation: "What if" analysis before submitting
Time-Based Policies: Different rules for on-call hours
Integration Marketplace: Slack, PagerDuty, ServiceNow webhooks

Glossary

Term	Definition
Clean Apply	Applying changes only after successful lock acquisition and approval
Deadlock	Circular wait condition between multiple lock holders
Delegation Chain	Hierarchical approval routing
Lock Queue	FIFO waiting list for lock acquisition
Risk Score	Calculated metric (0-100) indicating task danger
Stash	Saved state for potential rollback
Wait-For Graph	Data structure for deadlock detection

29 KiB Raw Blame History