- Omega (Kimi-K2.5): Approval system architecture - design.md: Full system architecture with state machines - api-spec.ts: Express routes + Zod schemas (33KB) - redis-schema.md: Redis key patterns (19KB) - ui-components.md: Dashboard UI specs (31KB) - Epsilon (Nemotron-3-super): Agent configuration UI - AgentWizard: 5-step creation flow - AgentConfigPanel: Parameter tuning - AgentCard: Health monitoring - AgentList: List/grid views - hooks/useAgents.ts: WebSocket integration - types/agent.ts: TypeScript definitions Total: 150KB new code, 22 components 👾 Generated with [Letta Code](https://letta.com)
511 lines
29 KiB
Markdown
511 lines
29 KiB
Markdown
# Community ADE Approval System Architecture
|
|
|
|
## Executive Summary
|
|
|
|
The Approval System provides governance and safety controls for the Community ADE platform. It introduces human-in-the-loop validation for task execution, distributed locking for resource protection, and a complete audit trail for compliance.
|
|
|
|
---
|
|
|
|
## Core Concepts
|
|
|
|
### 1. Clean Apply Locks
|
|
|
|
**Philosophy:** A lock should only grant permission to attempt an operation, not guarantee success. Locks are advisory but strictly enforced by the system.
|
|
|
|
**Lock Hierarchy:**
|
|
```
|
|
Task Lock (task:{id}:lock) - Single task execution
|
|
Resource Lock (resource:{type}:{id}:lock) - Shared resource protection
|
|
Agent Lock (agent:{id}:lock) - Agent capacity management
|
|
```
|
|
|
|
**Lock Properties:**
|
|
- **Ownership:** UUID of the lock holder
|
|
- **TTL:** 30 seconds default, extendable via heartbeats
|
|
- **Queue:** FIFO ordered waiting list for fairness
|
|
- **Metadata:** Timestamp, purpose, agent info
|
|
|
|
### 2. Approval Lifecycle State Machine
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────────────────┐
|
|
│ APPROVAL LIFECYCLE │
|
|
├─────────────────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ ┌──────┐ ┌──────────┐ ┌───────────┐ ┌─────────┐ ┌─────────┐ │
|
|
│ │DRAFT │───→│ SUBMITTED│───→│ REVIEWING │───→│ APPROVED│───→│ APPLYING│ │
|
|
│ └──────┘ └──────────┘ └───────────┘ └─────────┘ └────┬────┘ │
|
|
│ │ │ │ │ │ │
|
|
│ │ │ │ │ ▼ │
|
|
│ │ │ │ │ ┌─────────┐ │
|
|
│ │ │ │ │ │COMPLETED│ │
|
|
│ │ │ │ │ └─────────┘ │
|
|
│ │ │ │ │ │
|
|
│ │ │ └──────────────┘ │
|
|
│ │ │ │ │
|
|
│ │ │ ▼ │
|
|
│ │ │ ┌─────────┐ │
|
|
│ │ │ │REJECTED │ │
|
|
│ │ │ └─────────┘ │
|
|
│ │ │ │ │
|
|
│ │ └───────────────────────┘ │
|
|
│ │ │
|
|
│ ▼ │
|
|
│ ┌─────────┐ │
|
|
│ │ CANCELLED│ │
|
|
│ └─────────┘ │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
**State Descriptions:**
|
|
|
|
| State | Description | Permissions |
|
|
|-------|-------------|-------------|
|
|
| `DRAFT` | Task created but not submitted | Edit, Delete, Submit |
|
|
| `SUBMITTED` | Validation complete, awaiting review | None (locked) |
|
|
| `REVIEWING` | Under active review by approvers | Add comments |
|
|
| `APPROVED` | All required approvals received | Queue for apply |
|
|
| `APPLYING` | Lock acquired, executing changes | Read-only |
|
|
| `COMPLETED` | Changes successfully applied | Read-only, audit |
|
|
| `REJECTED` | Approval denied | Can resubmit as new |
|
|
| `CANCELLED` | Aborted before apply | Archive only |
|
|
|
|
### 3. Human Gates
|
|
|
|
**Review Policies:**
|
|
- **Auto-approve:** Tasks below risk threshold skip human review
|
|
- **Required reviewers:** Based on task type, resource scope, risk score
|
|
- **Delegation chains:** "If my manager approves, auto-approve for me"
|
|
- **Quorum rules:** N-of-M approvals required
|
|
|
|
**Risk Scoring:**
|
|
```typescript
|
|
RiskScore = (
|
|
resource_criticality * 0.4 +
|
|
change_magnitude * 0.3 +
|
|
blast_radius * 0.2 +
|
|
historical_failure_rate * 0.1
|
|
) // 0-100 scale
|
|
```
|
|
|
|
---
|
|
|
|
## System Architecture
|
|
|
|
### Component Diagram
|
|
|
|
```
|
|
┌──────────────────────────────────────────────────────────────────────────────┐
|
|
│ CLIENT LAYER │
|
|
├──────────────────────────────────────────────────────────────────────────────┤
|
|
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
|
|
│ │ Dashboard UI │ │ CLI Tool │ │ Webhook API │ │
|
|
│ │ (Delta-V2) │ │ (Omega-CLI) │ │ (External) │ │
|
|
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
|
|
└───────────┼────────────────────┼────────────────────┼────────────────────────┘
|
|
│ │ │
|
|
▼ ▼ ▼
|
|
┌──────────────────────────────────────────────────────────────────────────────┐
|
|
│ API GATEWAY LAYER │
|
|
├──────────────────────────────────────────────────────────────────────────────┤
|
|
│ ┌─────────────────────────────────────────────────────────────────────────┐ │
|
|
│ │ Express API Server (Beta Patterns) │ │
|
|
│ │ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌─────────────┐ │ │
|
|
│ │ │ Task Routes │ │ApprovalRoutes│ │ Lock Routes │ │ WebSocket │ │ │
|
|
│ │ │ │ │ │ │ │ │ Handler │ │ │
|
|
│ │ └──────────────┘ └──────────────┘ └──────────────┘ └─────────────┘ │ │
|
|
│ └─────────────────────────────────────────────────────────────────────────┘ │
|
|
└─────────────────────────────────┬────────────────────────────────────────────┘
|
|
│
|
|
┌─────────────────────┼─────────────────────┐
|
|
│ │ │
|
|
▼ ▼ ▼
|
|
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
|
│ REDIS LAYER │ │ POSTGRESQL │ │ EVENT BUS │
|
|
│ (Alpha) │ │ (Persistence) │ │ (WebSocket) │
|
|
├─────────────────┤ ├─────────────────┤ ├─────────────────┤
|
|
│ • Locks │ │ • Task History │ │ • approval:* │
|
|
│ • Queues │ │ • Audit Log │ │ • lock:* │
|
|
│ • Sessions │ │ • User Policies │ │ • task:* │
|
|
│ • Rate Limits │ │ • Delegations │ │ │
|
|
└────────┬────────┘ └─────────────────┘ └─────────────────┘
|
|
│
|
|
▼
|
|
┌──────────────────────────────────────────────────────────────────────────────┐
|
|
│ WORKER LAYER (Gamma) │
|
|
├──────────────────────────────────────────────────────────────────────────────┤
|
|
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
|
|
│ │ Lock Manager │ │ Task Executor │ │ Heartbeat Mon │ │
|
|
│ │ │ │ │ │ │ │
|
|
│ │ • Acquire locks │ │ • Check locks │ │ • Watchdog │ │
|
|
│ │ • Queue waiters │ │ • Execute apply │ │ • Deadlock det │ │
|
|
│ │ • Release/clean │ │ • Rollback on │ │ • Auto-recovery │ │
|
|
│ │ │ │ failure │ │ │ │
|
|
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
|
|
└──────────────────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Data Flow: Submit → Approve → Apply
|
|
|
|
```
|
|
┌─────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
|
│ USER │ │ SYSTEM │ │ SYSTEM │ │ WORKER │
|
|
└────┬────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
|
|
│ │ │ │
|
|
│ POST /tasks │ │ │
|
|
│ {config} │ │ │
|
|
├────────────────→│ │ │
|
|
│ │ │ │
|
|
│ │──┐ │ │
|
|
│ │ │ Validate │ │
|
|
│ │ │ Calculate risk │ │
|
|
│ │ │ Generate preview│ │
|
|
│ │←─┘ │ │
|
|
│ │ │ │
|
|
│ 201 Created │ │ │
|
|
│ task_id: xyz │ │ │
|
|
│←────────────────│ │ │
|
|
│ │ │ │
|
|
│ POST /tasks/xyz │ │ │
|
|
│ /submit │ │ │
|
|
├────────────────→│ │ │
|
|
│ │ │ │
|
|
│ │──┐ │ │
|
|
│ │ │ State:SUBMITTED│ │
|
|
│ │ │ Lock resources │ │
|
|
│ │ │ (preview only) │ │
|
|
│ │←─┘ │ │
|
|
│ │ │ │
|
|
│ 202 Accepted │ │ │
|
|
│←────────────────│ │ │
|
|
│ │ │ │
|
|
│ │ approval:requested │
|
|
│ ├──────────────────→│ │
|
|
│ │ │ │
|
|
│ │ │──┐ │
|
|
│ │ │ │ Check policies │
|
|
│ │ │ │ Notify reviewers│
|
|
│ │ │←─┘ │
|
|
│ │ │ │
|
|
│ [Time passes] │ │ │
|
|
│ │ │ │
|
|
│ POST /approvals │ │ │
|
|
│ /{id}/approve │ │ │
|
|
├────────────────→│ │ │
|
|
│ │ │ │
|
|
│ │ │──┐ │
|
|
│ │ │ │ Record approval │
|
|
│ │ │ │ Check quorum │
|
|
│ │ │←─┘ │
|
|
│ │ │ │
|
|
│ 200 OK │ │ │
|
|
│←────────────────│ │ │
|
|
│ │ │ │
|
|
│ │ │ approval:approved│
|
|
│ │←──────────────────┤ │
|
|
│ │ │ │
|
|
│ │──┐ │
|
|
│ │ │ State:APPROVED │
|
|
│ │ │ Queue for apply │
|
|
│ │←─┘ │
|
|
│ │ │ │
|
|
│ │ task:approved │ │
|
|
│ ├──────────────────────────────────────→│
|
|
│ │ │ │
|
|
│ │ │ │──┐
|
|
│ │ │ │ │ Acquire apply lock
|
|
│ │ │ │ │ State:APPLYING
|
|
│ │ │ │ │ Execute changes
|
|
│ │ │ │←─┘
|
|
│ │ │ │
|
|
│ │ lock:acquired │ │
|
|
│ │←──────────────────────────────────────┤
|
|
│ │ │ │
|
|
│ │ │ │──┐
|
|
│ │ │ │ │ Apply succeeded
|
|
│ │ │ │ │ State:COMPLETED
|
|
│ │ │ │ │ Release locks
|
|
│ │ │ │←─┘
|
|
│ │ │ │
|
|
│ │ task:completed │ │
|
|
│ │←──────────────────────────────────────┤
|
|
│ │ │ │
|
|
```
|
|
|
|
---
|
|
|
|
## Lock System Deep Dive
|
|
|
|
### Lock Types
|
|
|
|
#### 1. Task Lock (Exclusive)
|
|
```
|
|
Key: lock:task:{task_id}
|
|
Value: { holder_agent_id, acquired_at, expires_at, purpose }
|
|
TTL: 30s with automatic extension on heartbeat
|
|
```
|
|
|
|
Prevents concurrent execution of the same task. Released on completion or failure.
|
|
|
|
#### 2. Resource Lock (Shared/Exclusive)
|
|
```
|
|
Key: lock:resource:{resource_type}:{resource_id}
|
|
Value: {
|
|
mode: 'exclusive' | 'shared',
|
|
holders: [{ agent_id, acquired_at }],
|
|
queue: [{ agent_id, mode, requested_at }]
|
|
}
|
|
```
|
|
|
|
Allows multiple readers (shared) or single writer (exclusive). Queue ensures FIFO ordering.
|
|
|
|
#### 3. Agent Capacity Lock
|
|
```
|
|
Key: lock:agent:{agent_id}:capacity
|
|
Value: { active_tasks: number, max_tasks: number }
|
|
```
|
|
|
|
Prevents agent overload. Each agent has configurable concurrency limits.
|
|
|
|
### Deadlock Detection
|
|
|
|
**Algorithm:** Wait-For Graph
|
|
|
|
```
|
|
If Agent A holds Lock X and waits for Lock Y
|
|
And Agent B holds Lock Y and waits for Lock X
|
|
→ Deadlock detected
|
|
```
|
|
|
|
**Resolution:**
|
|
1. Abort youngest transaction (lower cost)
|
|
2. Release all held locks
|
|
3. Notify owner with `DEADLOCK_DETECTED` error
|
|
4. Auto-retry with exponential backoff
|
|
|
|
### Lock Heartbeat Protocol
|
|
|
|
```typescript
|
|
interface HeartbeatMessage {
|
|
lock_id: string;
|
|
agent_id: string;
|
|
timestamp: number;
|
|
ttl_extension: number; // seconds
|
|
}
|
|
|
|
// Client must send heartbeat every 10s (configurable)
|
|
// Server extends TTL on receipt
|
|
// If no heartbeat for 30s, lock auto-expires
|
|
// Expired locks trigger cleanup and notify waiters
|
|
```
|
|
|
|
---
|
|
|
|
## Approval Engine
|
|
|
|
### Reviewer Assignment
|
|
|
|
```typescript
|
|
interface ReviewerPolicy {
|
|
task_types: string[];
|
|
resource_patterns: string[];
|
|
min_approvers: number;
|
|
required_roles: string[];
|
|
risk_threshold: number;
|
|
auto_approve_if: {
|
|
risk_below: number;
|
|
author_has_role: string[];
|
|
resources_in_scope: string[];
|
|
};
|
|
}
|
|
```
|
|
|
|
**Assignment Algorithm:**
|
|
1. Match task against all policies
|
|
2. Union all required reviewers from matching policies
|
|
3. Check for delegation chains
|
|
4. Filter out auto-approved reviewers (based on policy)
|
|
5. Calculate minimum approvals needed
|
|
6. Create approval requests
|
|
|
|
### Delegation Chains
|
|
|
|
```
|
|
Alice delegates to Bob when:
|
|
- Task type is "infrastructure"
|
|
- Risk score > 50
|
|
|
|
Bob delegates to Carol when:
|
|
- Resource matches "prod-*"
|
|
|
|
Result: For prod infrastructure with high risk,
|
|
only Carol's approval is needed
|
|
```
|
|
|
|
**Resolution:** Depth-first traversal with cycle detection.
|
|
|
|
### Batch Operations
|
|
|
|
**Batch Approve:**
|
|
```typescript
|
|
POST /approvals/batch
|
|
{
|
|
approval_ids: string[];
|
|
action: 'approve' | 'reject';
|
|
reason?: string;
|
|
options: {
|
|
skip_validation: boolean;
|
|
apply_immediately: boolean;
|
|
}
|
|
}
|
|
```
|
|
|
|
Atomic operation: either all approvals succeed or all fail.
|
|
|
|
---
|
|
|
|
## Error Handling & Edge Cases
|
|
|
|
### Lock Acquisition Failures
|
|
|
|
| Scenario | Response | Retry Strategy |
|
|
|----------|----------|----------------|
|
|
| Lock held by another agent | 423 Locked | Queue and wait |
|
|
| Lock expired during operation | 409 Conflict | Abort, notify, retry |
|
|
| Deadlock detected | 423 Deadlock | Abort, auto-retry with backoff |
|
|
| Max queue depth exceeded | 503 Queue Full | Fail fast, notify operator |
|
|
|
|
### Approval Edge Cases
|
|
|
|
| Scenario | Behavior |
|
|
|----------|----------|
|
|
| Approver leaves organization | Auto-reassign to delegate or manager |
|
|
| Approval timeout (48h default) | Escalate to next level, notify on-call |
|
|
| Required reviewer unavailable | Bypass with admin override + audit |
|
|
| Task modified during review | Invalidate approvals, restart review |
|
|
| Concurrent approvals | Last write wins, notify others of resolution |
|
|
|
|
### System Degradation
|
|
|
|
| Condition | Response |
|
|
|-----------|----------|
|
|
| Redis unavailable | Queue in PostgreSQL, async recovery |
|
|
| High lock contention | Exponential backoff, circuit breaker |
|
|
| Approval queue backlog | Priority escalation, auto-approve low-risk |
|
|
| WebSocket failure | Polling fallback, queued events |
|
|
|
|
---
|
|
|
|
## Security Model
|
|
|
|
### Permission Matrix
|
|
|
|
| Action | Author | Reviewer | Admin | System |
|
|
|--------|--------|----------|-------|--------|
|
|
| Create task | ✓ | ✗ | ✓ | ✗ |
|
|
| Submit for approval | ✓ | ✗ | ✓ | ✗ |
|
|
| Approve/reject | ✗ | ✓ | ✓ | ✗ |
|
|
| Force apply | ✗ | ✗ | ✓ | ✗ |
|
|
| Cancel task | ✓ | ✗ | ✓ | ✓ |
|
|
| Override policy | ✗ | ✗ | ✓* | ✗ |
|
|
| View audit log | ✓ | ✓ | ✓ | ✓ |
|
|
|
|
*Requires secondary approval and incident ticket
|
|
|
|
### Audit Requirements
|
|
|
|
Every state transition logged:
|
|
- Who (user ID, session)
|
|
- What (from state, to state, action)
|
|
- When (timestamp with microsecond precision)
|
|
- Where (IP, user agent, service)
|
|
- Why (reason, ticket reference)
|
|
|
|
---
|
|
|
|
## Scalability Considerations
|
|
|
|
### Horizontal Scaling
|
|
|
|
- **API Layer:** Stateless, scale via load balancer
|
|
- **Redis:** Cluster mode, hash tags for lock locality
|
|
- **PostgreSQL:** Read replicas for audit queries
|
|
- **WebSocket:** Sticky sessions or Redis pub/sub
|
|
|
|
### Performance Targets
|
|
|
|
| Metric | Target | Peak |
|
|
|--------|--------|------|
|
|
| Lock acquisition | < 10ms | < 50ms @ p99 |
|
|
| Approval latency | < 100ms | < 500ms @ p99 |
|
|
| Task throughput | 1000/min | 5000/min burst |
|
|
| Concurrent locks | 10,000 | 50,000 |
|
|
|
|
---
|
|
|
|
## Deployment Architecture
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ KUBERNETES CLUSTER │
|
|
├─────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
|
│ │ API Pod │ │ API Pod │ │ API Pod │ │
|
|
│ │ (3+ replicas)│ │ │ │ │ │
|
|
│ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │
|
|
│ │ │ │ │
|
|
│ └────────────────┼────────────────┘ │
|
|
│ │ │
|
|
│ ┌──────┴──────┐ │
|
|
│ │ Ingress │ │
|
|
│ │ Controller │ │
|
|
│ └──────┬──────┘ │
|
|
│ │ │
|
|
├──────────────────────────┼──────────────────────────────────────┤
|
|
│ ┌───────────────────────┴───────────────────────┐ │
|
|
│ │ Redis Cluster │ │
|
|
│ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │
|
|
│ │ │ Master │ │ Master │ │ Master │ │ │
|
|
│ │ │ + Repl │ │ + Repl │ │ + Repl │ │ │
|
|
│ │ └─────────┘ └─────────┘ └─────────┘ │ │
|
|
│ └───────────────────────────────────────────────┘ │
|
|
│ │
|
|
│ ┌─────────────────────────────────────────────────┐ │
|
|
│ │ PostgreSQL (HA: Patroni) │ │
|
|
│ │ Primary + 2 Replicas │ │
|
|
│ └─────────────────────────────────────────────────┘ │
|
|
│ │
|
|
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
|
│ │ Worker Pod │ │ Worker Pod │ │ Worker Pod │ │
|
|
│ │ (HPA: 2-20) │ │ │ │ │ │
|
|
│ └─────────────┘ └─────────────┘ └─────────────┘ │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## Future Enhancements
|
|
|
|
1. **ML-Based Risk Scoring:** Train models on historical task outcomes
|
|
2. **Predictive Locking:** Pre-acquire locks based on task patterns
|
|
3. **Approval Simulation:** "What if" analysis before submitting
|
|
4. **Time-Based Policies:** Different rules for on-call hours
|
|
5. **Integration Marketplace:** Slack, PagerDuty, ServiceNow webhooks
|
|
|
|
---
|
|
|
|
## Glossary
|
|
|
|
| Term | Definition |
|
|
|------|------------|
|
|
| **Clean Apply** | Applying changes only after successful lock acquisition and approval |
|
|
| **Deadlock** | Circular wait condition between multiple lock holders |
|
|
| **Delegation Chain** | Hierarchical approval routing |
|
|
| **Lock Queue** | FIFO waiting list for lock acquisition |
|
|
| **Risk Score** | Calculated metric (0-100) indicating task danger |
|
|
| **Stash** | Saved state for potential rollback |
|
|
| **Wait-For Graph** | Data structure for deadlock detection |
|