- Omega (Kimi-K2.5): Approval system architecture - design.md: Full system architecture with state machines - api-spec.ts: Express routes + Zod schemas (33KB) - redis-schema.md: Redis key patterns (19KB) - ui-components.md: Dashboard UI specs (31KB) - Epsilon (Nemotron-3-super): Agent configuration UI - AgentWizard: 5-step creation flow - AgentConfigPanel: Parameter tuning - AgentCard: Health monitoring - AgentList: List/grid views - hooks/useAgents.ts: WebSocket integration - types/agent.ts: TypeScript definitions Total: 150KB new code, 22 components 👾 Generated with [Letta Code](https://letta.com)
29 KiB
Community ADE Approval System Architecture
Executive Summary
The Approval System provides governance and safety controls for the Community ADE platform. It introduces human-in-the-loop validation for task execution, distributed locking for resource protection, and a complete audit trail for compliance.
Core Concepts
1. Clean Apply Locks
Philosophy: A lock should only grant permission to attempt an operation, not guarantee success. Locks are advisory but strictly enforced by the system.
Lock Hierarchy:
Task Lock (task:{id}:lock) - Single task execution
Resource Lock (resource:{type}:{id}:lock) - Shared resource protection
Agent Lock (agent:{id}:lock) - Agent capacity management
Lock Properties:
- Ownership: UUID of the lock holder
- TTL: 30 seconds default, extendable via heartbeats
- Queue: FIFO ordered waiting list for fairness
- Metadata: Timestamp, purpose, agent info
2. Approval Lifecycle State Machine
┌─────────────────────────────────────────────────────────────────────────────┐
│ APPROVAL LIFECYCLE │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────┐ ┌──────────┐ ┌───────────┐ ┌─────────┐ ┌─────────┐ │
│ │DRAFT │───→│ SUBMITTED│───→│ REVIEWING │───→│ APPROVED│───→│ APPLYING│ │
│ └──────┘ └──────────┘ └───────────┘ └─────────┘ └────┬────┘ │
│ │ │ │ │ │ │
│ │ │ │ │ ▼ │
│ │ │ │ │ ┌─────────┐ │
│ │ │ │ │ │COMPLETED│ │
│ │ │ │ │ └─────────┘ │
│ │ │ │ │ │
│ │ │ └──────────────┘ │
│ │ │ │ │
│ │ │ ▼ │
│ │ │ ┌─────────┐ │
│ │ │ │REJECTED │ │
│ │ │ └─────────┘ │
│ │ │ │ │
│ │ └───────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────┐ │
│ │ CANCELLED│ │
│ └─────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
State Descriptions:
| State | Description | Permissions |
|---|---|---|
DRAFT |
Task created but not submitted | Edit, Delete, Submit |
SUBMITTED |
Validation complete, awaiting review | None (locked) |
REVIEWING |
Under active review by approvers | Add comments |
APPROVED |
All required approvals received | Queue for apply |
APPLYING |
Lock acquired, executing changes | Read-only |
COMPLETED |
Changes successfully applied | Read-only, audit |
REJECTED |
Approval denied | Can resubmit as new |
CANCELLED |
Aborted before apply | Archive only |
3. Human Gates
Review Policies:
- Auto-approve: Tasks below risk threshold skip human review
- Required reviewers: Based on task type, resource scope, risk score
- Delegation chains: "If my manager approves, auto-approve for me"
- Quorum rules: N-of-M approvals required
Risk Scoring:
RiskScore = (
resource_criticality * 0.4 +
change_magnitude * 0.3 +
blast_radius * 0.2 +
historical_failure_rate * 0.1
) // 0-100 scale
System Architecture
Component Diagram
┌──────────────────────────────────────────────────────────────────────────────┐
│ CLIENT LAYER │
├──────────────────────────────────────────────────────────────────────────────┤
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Dashboard UI │ │ CLI Tool │ │ Webhook API │ │
│ │ (Delta-V2) │ │ (Omega-CLI) │ │ (External) │ │
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
└───────────┼────────────────────┼────────────────────┼────────────────────────┘
│ │ │
▼ ▼ ▼
┌──────────────────────────────────────────────────────────────────────────────┐
│ API GATEWAY LAYER │
├──────────────────────────────────────────────────────────────────────────────┤
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
│ │ Express API Server (Beta Patterns) │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌─────────────┐ │ │
│ │ │ Task Routes │ │ApprovalRoutes│ │ Lock Routes │ │ WebSocket │ │ │
│ │ │ │ │ │ │ │ │ Handler │ │ │
│ │ └──────────────┘ └──────────────┘ └──────────────┘ └─────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────┬────────────────────────────────────────────┘
│
┌─────────────────────┼─────────────────────┐
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ REDIS LAYER │ │ POSTGRESQL │ │ EVENT BUS │
│ (Alpha) │ │ (Persistence) │ │ (WebSocket) │
├─────────────────┤ ├─────────────────┤ ├─────────────────┤
│ • Locks │ │ • Task History │ │ • approval:* │
│ • Queues │ │ • Audit Log │ │ • lock:* │
│ • Sessions │ │ • User Policies │ │ • task:* │
│ • Rate Limits │ │ • Delegations │ │ │
└────────┬────────┘ └─────────────────┘ └─────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────────────────┐
│ WORKER LAYER (Gamma) │
├──────────────────────────────────────────────────────────────────────────────┤
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Lock Manager │ │ Task Executor │ │ Heartbeat Mon │ │
│ │ │ │ │ │ │ │
│ │ • Acquire locks │ │ • Check locks │ │ • Watchdog │ │
│ │ • Queue waiters │ │ • Execute apply │ │ • Deadlock det │ │
│ │ • Release/clean │ │ • Rollback on │ │ • Auto-recovery │ │
│ │ │ │ failure │ │ │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└──────────────────────────────────────────────────────────────────────────────┘
Data Flow: Submit → Approve → Apply
┌─────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ USER │ │ SYSTEM │ │ SYSTEM │ │ WORKER │
└────┬────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │ │
│ POST /tasks │ │ │
│ {config} │ │ │
├────────────────→│ │ │
│ │ │ │
│ │──┐ │ │
│ │ │ Validate │ │
│ │ │ Calculate risk │ │
│ │ │ Generate preview│ │
│ │←─┘ │ │
│ │ │ │
│ 201 Created │ │ │
│ task_id: xyz │ │ │
│←────────────────│ │ │
│ │ │ │
│ POST /tasks/xyz │ │ │
│ /submit │ │ │
├────────────────→│ │ │
│ │ │ │
│ │──┐ │ │
│ │ │ State:SUBMITTED│ │
│ │ │ Lock resources │ │
│ │ │ (preview only) │ │
│ │←─┘ │ │
│ │ │ │
│ 202 Accepted │ │ │
│←────────────────│ │ │
│ │ │ │
│ │ approval:requested │
│ ├──────────────────→│ │
│ │ │ │
│ │ │──┐ │
│ │ │ │ Check policies │
│ │ │ │ Notify reviewers│
│ │ │←─┘ │
│ │ │ │
│ [Time passes] │ │ │
│ │ │ │
│ POST /approvals │ │ │
│ /{id}/approve │ │ │
├────────────────→│ │ │
│ │ │ │
│ │ │──┐ │
│ │ │ │ Record approval │
│ │ │ │ Check quorum │
│ │ │←─┘ │
│ │ │ │
│ 200 OK │ │ │
│←────────────────│ │ │
│ │ │ │
│ │ │ approval:approved│
│ │←──────────────────┤ │
│ │ │ │
│ │──┐ │
│ │ │ State:APPROVED │
│ │ │ Queue for apply │
│ │←─┘ │
│ │ │ │
│ │ task:approved │ │
│ ├──────────────────────────────────────→│
│ │ │ │
│ │ │ │──┐
│ │ │ │ │ Acquire apply lock
│ │ │ │ │ State:APPLYING
│ │ │ │ │ Execute changes
│ │ │ │←─┘
│ │ │ │
│ │ lock:acquired │ │
│ │←──────────────────────────────────────┤
│ │ │ │
│ │ │ │──┐
│ │ │ │ │ Apply succeeded
│ │ │ │ │ State:COMPLETED
│ │ │ │ │ Release locks
│ │ │ │←─┘
│ │ │ │
│ │ task:completed │ │
│ │←──────────────────────────────────────┤
│ │ │ │
Lock System Deep Dive
Lock Types
1. Task Lock (Exclusive)
Key: lock:task:{task_id}
Value: { holder_agent_id, acquired_at, expires_at, purpose }
TTL: 30s with automatic extension on heartbeat
Prevents concurrent execution of the same task. Released on completion or failure.
2. Resource Lock (Shared/Exclusive)
Key: lock:resource:{resource_type}:{resource_id}
Value: {
mode: 'exclusive' | 'shared',
holders: [{ agent_id, acquired_at }],
queue: [{ agent_id, mode, requested_at }]
}
Allows multiple readers (shared) or single writer (exclusive). Queue ensures FIFO ordering.
3. Agent Capacity Lock
Key: lock:agent:{agent_id}:capacity
Value: { active_tasks: number, max_tasks: number }
Prevents agent overload. Each agent has configurable concurrency limits.
Deadlock Detection
Algorithm: Wait-For Graph
If Agent A holds Lock X and waits for Lock Y
And Agent B holds Lock Y and waits for Lock X
→ Deadlock detected
Resolution:
- Abort youngest transaction (lower cost)
- Release all held locks
- Notify owner with
DEADLOCK_DETECTEDerror - Auto-retry with exponential backoff
Lock Heartbeat Protocol
interface HeartbeatMessage {
lock_id: string;
agent_id: string;
timestamp: number;
ttl_extension: number; // seconds
}
// Client must send heartbeat every 10s (configurable)
// Server extends TTL on receipt
// If no heartbeat for 30s, lock auto-expires
// Expired locks trigger cleanup and notify waiters
Approval Engine
Reviewer Assignment
interface ReviewerPolicy {
task_types: string[];
resource_patterns: string[];
min_approvers: number;
required_roles: string[];
risk_threshold: number;
auto_approve_if: {
risk_below: number;
author_has_role: string[];
resources_in_scope: string[];
};
}
Assignment Algorithm:
- Match task against all policies
- Union all required reviewers from matching policies
- Check for delegation chains
- Filter out auto-approved reviewers (based on policy)
- Calculate minimum approvals needed
- Create approval requests
Delegation Chains
Alice delegates to Bob when:
- Task type is "infrastructure"
- Risk score > 50
Bob delegates to Carol when:
- Resource matches "prod-*"
Result: For prod infrastructure with high risk,
only Carol's approval is needed
Resolution: Depth-first traversal with cycle detection.
Batch Operations
Batch Approve:
POST /approvals/batch
{
approval_ids: string[];
action: 'approve' | 'reject';
reason?: string;
options: {
skip_validation: boolean;
apply_immediately: boolean;
}
}
Atomic operation: either all approvals succeed or all fail.
Error Handling & Edge Cases
Lock Acquisition Failures
| Scenario | Response | Retry Strategy |
|---|---|---|
| Lock held by another agent | 423 Locked | Queue and wait |
| Lock expired during operation | 409 Conflict | Abort, notify, retry |
| Deadlock detected | 423 Deadlock | Abort, auto-retry with backoff |
| Max queue depth exceeded | 503 Queue Full | Fail fast, notify operator |
Approval Edge Cases
| Scenario | Behavior |
|---|---|
| Approver leaves organization | Auto-reassign to delegate or manager |
| Approval timeout (48h default) | Escalate to next level, notify on-call |
| Required reviewer unavailable | Bypass with admin override + audit |
| Task modified during review | Invalidate approvals, restart review |
| Concurrent approvals | Last write wins, notify others of resolution |
System Degradation
| Condition | Response |
|---|---|
| Redis unavailable | Queue in PostgreSQL, async recovery |
| High lock contention | Exponential backoff, circuit breaker |
| Approval queue backlog | Priority escalation, auto-approve low-risk |
| WebSocket failure | Polling fallback, queued events |
Security Model
Permission Matrix
| Action | Author | Reviewer | Admin | System |
|---|---|---|---|---|
| Create task | ✓ | ✗ | ✓ | ✗ |
| Submit for approval | ✓ | ✗ | ✓ | ✗ |
| Approve/reject | ✗ | ✓ | ✓ | ✗ |
| Force apply | ✗ | ✗ | ✓ | ✗ |
| Cancel task | ✓ | ✗ | ✓ | ✓ |
| Override policy | ✗ | ✗ | ✓* | ✗ |
| View audit log | ✓ | ✓ | ✓ | ✓ |
*Requires secondary approval and incident ticket
Audit Requirements
Every state transition logged:
- Who (user ID, session)
- What (from state, to state, action)
- When (timestamp with microsecond precision)
- Where (IP, user agent, service)
- Why (reason, ticket reference)
Scalability Considerations
Horizontal Scaling
- API Layer: Stateless, scale via load balancer
- Redis: Cluster mode, hash tags for lock locality
- PostgreSQL: Read replicas for audit queries
- WebSocket: Sticky sessions or Redis pub/sub
Performance Targets
| Metric | Target | Peak |
|---|---|---|
| Lock acquisition | < 10ms | < 50ms @ p99 |
| Approval latency | < 100ms | < 500ms @ p99 |
| Task throughput | 1000/min | 5000/min burst |
| Concurrent locks | 10,000 | 50,000 |
Deployment Architecture
┌─────────────────────────────────────────────────────────────────┐
│ KUBERNETES CLUSTER │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ API Pod │ │ API Pod │ │ API Pod │ │
│ │ (3+ replicas)│ │ │ │ │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │ │
│ └────────────────┼────────────────┘ │
│ │ │
│ ┌──────┴──────┐ │
│ │ Ingress │ │
│ │ Controller │ │
│ └──────┬──────┘ │
│ │ │
├──────────────────────────┼──────────────────────────────────────┤
│ ┌───────────────────────┴───────────────────────┐ │
│ │ Redis Cluster │ │
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
│ │ │ Master │ │ Master │ │ Master │ │ │
│ │ │ + Repl │ │ + Repl │ │ + Repl │ │ │
│ │ └─────────┘ └─────────┘ └─────────┘ │ │
│ └───────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ PostgreSQL (HA: Patroni) │ │
│ │ Primary + 2 Replicas │ │
│ └─────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Worker Pod │ │ Worker Pod │ │ Worker Pod │ │
│ │ (HPA: 2-20) │ │ │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Future Enhancements
- ML-Based Risk Scoring: Train models on historical task outcomes
- Predictive Locking: Pre-acquire locks based on task patterns
- Approval Simulation: "What if" analysis before submitting
- Time-Based Policies: Different rules for on-call hours
- Integration Marketplace: Slack, PagerDuty, ServiceNow webhooks
Glossary
| Term | Definition |
|---|---|
| Clean Apply | Applying changes only after successful lock acquisition and approval |
| Deadlock | Circular wait condition between multiple lock holders |
| Delegation Chain | Hierarchical approval routing |
| Lock Queue | FIFO waiting list for lock acquisition |
| Risk Score | Calculated metric (0-100) indicating task danger |
| Stash | Saved state for potential rollback |
| Wait-For Graph | Data structure for deadlock detection |