# P4-006: Architecture Documentation Gaps and Validation **Priority:** P4 (Technical Debt) **Source Reference:** From analysis of ARCHITECTURE.md and implementation gaps **Date Identified:** 2025-11-12 ## Problem Description Architecture documentation exists but lacks detailed implementation specifications, design decision documentation, and verification procedures. Critical architectural components like security systems, data flows, and deployment patterns are documented at a high level but lack the detail needed for implementation validation and team alignment. ## Impact - **Implementation Drift:** Code implementations diverge from documented architecture - **Knowledge Silos:** Architectural decisions exist only in team members' heads - **Onboarding Challenges:** New developers cannot understand system design - **Maintenance Complexity:** Changes may violate architectural principles - **Design Rationale Lost:** Future teams cannot understand decision context ## Current Architecture Documentation Analysis ### ✅ Existing Documentation - **ARCHITECTURE.md**: High-level system overview and component relationships - **SECURITY.md**: Detailed security architecture and threat model - Basic database schema documentation - API endpoint documentation in code comments ### ❌ Missing Critical Documentation - Detailed component interaction diagrams - Data flow specifications - Security implementation details - Deployment architecture patterns - Performance characteristics documentation - Error handling and resilience patterns - Technology selection rationale - Integration patterns and contracts ## Specific Gaps Identified ### 1. Component Interaction Details ```markdown # MISSING: Detailed Component Interaction Specifications ## Current Status: High-level overview only ## Needed: Detailed interaction patterns, contracts, and error handling ``` ### 2. Data Flow Documentation ```markdown # MISSING: Comprehensive Data Flow Documentation ## Current Status: Basic agent check-in flow documented ## Needed: Complete data lifecycle, transformation, and persistence patterns ``` ### 3. Security Implementation Details ```markdown # MISSING: Security Implementation Specifications ## Current Status: High-level security model documented ## Needed: Implementation details, key management, and validation procedures ``` ## Proposed Solution Create comprehensive architecture documentation suite: ### 1. System Architecture Specification ```markdown # RedFlag System Architecture Specification ## Executive Summary RedFlag is a distributed update management system consisting of three primary components: a central server, cross-platform agents, and a web dashboard. The system uses a secure agent-server communication model with cryptographic verification and authorization. ## Component Architecture ### Server Component (`aggregator-server`) **Purpose**: Central management and coordination **Technology Stack**: Go + Gin HTTP Framework + PostgreSQL **Key Responsibilities**: - Agent registration and authentication - Command distribution and orchestration - Update package management and signing - Web API and authentication - Audit logging and monitoring **Critical Subcomponents**: - API Layer: RESTful endpoints with authentication middleware - Business Logic: Command processing, agent management - Data Layer: PostgreSQL with event sourcing patterns - Security Layer: Ed25519 signing, JWT authentication - Scheduler: Priority-based job scheduling ### Agent Component (`aggregator-agent`) **Purpose**: Distributed update scanning and execution **Technology Stack**: Go with platform-specific integrations **Key Responsibilities**: - System update scanning (multiple package managers) - Command execution and reporting - Secure communication with server - Local state management and persistence - Service integration (systemd/Windows Services) **Critical Subcomponents**: - Scanner Factory: Platform-specific update scanners - Installer Factory: Package manager installers - Orchestrator: Command execution and coordination - Communication Layer: Secure HTTP client with retry logic - State Management: Local file persistence and recovery ### Web Dashboard (`aggregator-web`) **Purpose**: Administrative interface and visualization **Technology Stack**: React + TypeScript + Vite **Key Responsibilities**: - Agent management and monitoring - Command creation and scheduling - System metrics visualization - User authentication and settings ## Interaction Patterns ### Agent Registration Flow ``` 1. Agent Discovery → 2. Token Validation → 3. Machine Binding → 4. Key Exchange → 5. Persistent Session ``` ### Command Distribution Flow ``` 1. Command Creation → 2. Security Signing → 3. Agent Distribution → 4. Secure Execution → 5. Acknowledgment Processing ``` ### Update Package Flow ``` 1. Package Build → 2. Cryptographic Signing → 3. Secure Distribution → 4. Signature Verification → 5. Atomic Installation ``` ## Data Architecture ### Data Flow Patterns - **Command Flow**: Server → Agent → Server (acknowledgment) - **Update Data Flow**: Agent → Server → Web Dashboard - **Authentication Flow**: Client → Server → JWT Token → Protected Resources - **Update Package Flow**: Server → Agent (with verification) ### Data Persistence Patterns - **Event Sourcing**: Complete audit trail for all operations - **State Snapshots**: Current system state in normalized tables - **Temporal Data**: Time-series metrics and historical data - **File-based State**: Agent local state with conflict resolution ### Data Consistency Models - **Strong Consistency**: Database operations within transactions - **Eventual Consistency**: Agent synchronization with server - **Conflict Resolution**: Last-write-wins with version validation ``` ### 2. Security Architecture Implementation ```markdown # Security Architecture Implementation Guide ## Cryptographic Operations ### Ed25519 Signing System **Purpose**: Authenticity verification for update packages and commands **Implementation Details**: - Key generation using `crypto/ed25519` with `crypto/rand.Reader` - Private key storage in environment variables (HSM recommended) - Public key distribution via `/api/v1/public-key` endpoint - Signature verification on agent before package installation **Key Management**: ```go // Key generation privateKey, publicKey, err := ed25519.GenerateKey(rand.Reader) // Signing signature := ed25519.Sign(privateKey, message) // Verification valid := ed25519.Verify(publicKey, message, signature) ``` ### Nonce-Based Replay Protection **Purpose**: Prevent command replay attacks **Implementation Details**: - UUID-based nonce with Unix timestamp - Ed25519 signature for nonce authenticity - 5-minute freshness window - Server-side nonce tracking and validation **Nonce Structure**: ```json { "nonce_uuid": "550e8400-e29b-41d4-a716-446655440000", "nonce_timestamp": 1704067200, "nonce_signature": "ed25519-signature-hex" } ``` ### Machine Binding System **Purpose**: Prevent agent configuration sharing **Implementation Details**: - Hardware fingerprint using `github.com/denisbrodbeck/machineid` - Database constraint enforcement for uniqueness - Version enforcement for minimum security requirements - Migration handling for agent upgrades **Fingerprint Components**: - Machine ID (primary identifier) - CPU information - Memory configuration - System UUID - Network interface MAC addresses ## Authentication Architecture ### JWT Token System **Access Tokens**: 24-hour lifetime for API operations **Refresh Tokens**: 90-day sliding window for agent continuity **Token Storage**: SHA-256 hashed tokens in database **Sliding Window**: Active agents never expire, inactive agents auto-expire ### Multi-Tier Authentication ```mermaid graph LR A[Registration Token] --> B[Initial JWT Access Token] B --> C[Refresh Token Flow] C --> D[Continuous JWT Renewal] ``` ### Session Management - **Agent Sessions**: Long-lived with sliding window renewal - **User Sessions**: Standard web session with timeout - **Token Revocation**: Immediate revocation capability - **Audit Trail**: Complete token lifecycle logging ## Network Security ### Transport Security - **HTTPS/TLS**: All communications encrypted - **Certificate Validation**: Proper certificate chain verification - **HSTS Headers**: HTTP Strict Transport Security - **Certificate Pinning**: Optional for enhanced security ### API Security - **Rate Limiting**: Endpoint-specific rate limiting - **Input Validation**: Comprehensive input sanitization - **CORS Protection**: Proper cross-origin resource sharing - **Security Headers**: X-Frame-Options, X-Content-Type-Options ### Agent Communication Security - **Mutual Authentication**: Both ends verify identity - **Command Signing**: Cryptographic command verification - **Replay Protection**: Nonce-based freshness validation - **Secure Storage**: Local state encrypted at rest ``` ### 3. Deployment Architecture Patterns ```markdown # Deployment Architecture Guide ## Deployment Topologies ### Single-Node Deployment **Use Case**: Homelab, small environments (<50 agents) **Architecture**: All components on single host **Requirements**: - Docker and Docker Compose - PostgreSQL database - SSL certificates (optional for homelab) **Deployment Pattern**: ``` Host ├── Docker Containers │ ├── PostgreSQL (port 5432) │ ├── RedFlag Server (port 8080) │ ├── RedFlag Web (port 3000) │ └── Nginx Reverse Proxy (port 443/80) └── System Resources ├── Data Volume (PostgreSQL) ├── Log Volume (Containers) └── SSL Certificates ``` ### Multi-Node Deployment **Use Case**: Medium environments (50-1000 agents) **Architecture**: Separated database and application servers **Requirements**: - Separate database server - Load balancer for web traffic - SSL certificates - Backup infrastructure **Deployment Pattern**: ``` Load Balancer (HTTPS) ↓ Web Servers (2+ instances) ↓ Application Servers (2+ instances) ↓ Database Cluster (Primary + Replica) ``` ### High-Availability Deployment **Use Case**: Large environments (1000+ agents) **Architecture**: Fully redundant with failover **Requirements**: - Database clustering - Application load balancing - Geographic distribution - Disaster recovery planning ## Scaling Patterns ### Horizontal Scaling - **Stateless Application Servers**: Easy horizontal scaling - **Database Read Replicas**: Read scaling for API calls - **Agent Load Distribution**: Natural geographic distribution - **Web Frontend Scaling**: CDN and static asset optimization ### Vertical Scaling - **Database Performance**: Connection pooling, query optimization - **Memory Usage**: Efficient in-memory operations - **CPU Optimization**: Go's concurrency for handling many agents - **Storage Performance**: SSD for database, appropriate sizing ## Security Deployment Patterns ### Network Isolation - **Database Access**: Restricted to application servers only - **Agent Access**: VPN or dedicated network paths - **Admin Access**: Bastion hosts or VPN requirements - **Monitoring**: Isolated monitoring network ### Secret Management - **Environment Variables**: Sensitive configuration - **Key Management**: Hardware security modules for production - **Certificate Management**: Automated certificate rotation - **Backup Encryption**: Encrypted backup storage ## Infrastructure as Code ### Docker Compose Configuration ```yaml version: '3.8' services: postgres: image: postgres:16 environment: POSTGRES_DB: aggregator POSTGRES_USER: aggregator POSTGRES_PASSWORD: ${DB_PASSWORD} volumes: - postgres_data:/var/lib/postgresql/data restart: unless-stopped server: build: ./aggregator-server environment: DATABASE_URL: postgres://aggregator:${DB_PASSWORD}@postgres:5432/aggregator REDFLAG_SIGNING_PRIVATE_KEY: ${REDFLAG_SIGNING_PRIVATE_KEY} depends_on: - postgres restart: unless-stopped web: build: ./aggregator-web environment: VITE_API_URL: http://localhost:8080/api/v1 restart: unless-stopped nginx: image: nginx:alpine ports: - "443:443" - "80:80" volumes: - ./nginx.conf:/etc/nginx/nginx.conf - ./ssl:/etc/ssl/certs depends_on: - server - web restart: unless-stopped ``` ### Kubernetes Deployment ```yaml apiVersion: apps/v1 kind: Deployment metadata: name: redflag-server spec: replicas: 3 selector: matchLabels: app: redflag-server template: metadata: labels: app: redflag-server spec: containers: - name: server image: redflag/server:latest env: - name: DATABASE_URL valueFrom: secretKeyRef: name: redflag-secrets key: database-url ports: - containerPort: 8080 ``` ``` ### 4. Performance and Scalability Documentation ```markdown # Performance and Scalability Guide ## Performance Characteristics ### Agent Performance - **Memory Usage**: 50-200MB typical (depends on scan results) - **CPU Usage**: <5% during normal operations, spikes during scans - **Network Usage**: 300 bytes per check-in (typical) - **Storage Usage**: State files proportional to update count ### Server Performance - **Memory Usage**: ~100MB base + queue overhead - **CPU Usage**: Low for API calls, moderate during command processing - **Database Performance**: Optimized for concurrent agent check-ins - **Network Usage**: Scales with agent count and command frequency ### Web Dashboard Performance - **Load Time**: <2 seconds for typical pages - **API Response**: <500ms for most endpoints - **Memory Usage**: Browser-dependent, typically <50MB - **Concurrent Users**: Supports 50+ simultaneous users ## Scalability Targets ### Agent Scaling - **Target**: 10,000+ agents per server instance - **Check-in Pattern**: 5-minute intervals with jitter - **Database Connections**: Connection pooling for efficiency - **Memory Requirements**: 1MB per 4,000 active jobs in queue ### Database Scaling - **Read Scaling**: Read replicas for dashboard queries - **Write Scaling**: Optimized for concurrent check-ins - **Storage Growth**: Linear with agent count and history retention - **Backup Performance**: Incremental backups for large datasets ### Web Interface Scaling - **User Scaling**: 100+ concurrent administrators - **API Rate Limiting**: Prevents abuse and ensures fairness - **Caching Strategy**: Browser caching for static assets - **CDN Integration**: Optional for large deployments ## Performance Optimization ### Database Optimization - **Indexing Strategy**: Optimized indexes for common queries - **Connection Pooling**: Efficient database connection reuse - **Query Optimization**: Minimize N+1 query patterns - **Partitioning**: Time-based partitioning for historical data ### Application Optimization - **In-Memory Operations**: Priority queue for job scheduling - **Efficient Serialization**: JSON with minimal overhead - **Batch Operations**: Bulk database operations where possible - **Caching**: Appropriate caching for frequently accessed data ### Network Optimization - **Compression**: Gzip compression for API responses - **Keep-Alive**: Persistent HTTP connections - **Efficient Protocols**: HTTP/2 support where beneficial - **Geographic Distribution**: Edge caching for agent downloads ## Monitoring and Alerting ### Key Performance Indicators - **Agent Check-in Rate**: Should be >95% success - **API Response Times**: <500ms for 95th percentile - **Database Query Performance**: <100ms for critical queries - **Memory Usage**: Alert on >80% usage - **CPU Usage**: Alert on >80% sustained usage ### Alert Thresholds - **Agent Connectivity**: <90% check-in success rate - **API Error Rate**: >5% error rate - **Database Performance**: >1 second for any query - **System Resources**: >80% usage for sustained periods - **Security Events**: Any authentication failures ## Capacity Planning ### Resource Requirements by Scale ### Small Deployment (<100 agents) - **CPU**: 2 cores - **Memory**: 4GB RAM - **Storage**: 20GB SSD - **Network**: 10 Mbps ### Medium Deployment (100-1000 agents) - **CPU**: 4 cores - **Memory**: 8GB RAM - **Storage**: 100GB SSD - **Network**: 100 Mbps ### Large Deployment (1000-10000 agents) - **CPU**: 8+ cores - **Memory**: 16GB+ RAM - **Storage**: 500GB+ SSD - **Network**: 1 Gbps ### Performance Testing ### Load Testing Scenarios - **Agent Check-in Load**: Simulate 10,000 concurrent agents - **API Stress Testing**: High-volume dashboard usage - **Database Performance**: Concurrent query testing - **Memory Leak Testing**: Long-running stability tests ### Benchmark Results - **Agent Check-ins**: 1000+ agents per minute - **API Requests**: 500+ requests per second - **Database Queries**: 10,000+ queries per second - **Memory Stability**: No leaks over 7-day runs ``` ## Definition of Done - [ ] System architecture specification created - [ ] Security implementation guide documented - [ ] Deployment architecture patterns defined - [ ] Performance characteristics documented - [ ] Component interaction diagrams created - [ ] Design decision rationale documented - [ ] Technology selection justification documented - [ ] Integration patterns and contracts specified ## Implementation Details ### Documentation Structure ``` docs/ ├── architecture/ │ ├── system-overview.md │ ├── components/ │ │ ├── server.md │ │ ├── agent.md │ │ └── web-dashboard.md │ ├── security/ │ │ ├── authentication.md │ │ ├── cryptographic-operations.md │ │ └── network-security.md │ ├── deployment/ │ │ ├── single-node.md │ │ ├── multi-node.md │ │ └── high-availability.md │ ├── performance/ │ │ ├── scalability.md │ │ ├── optimization.md │ │ └── monitoring.md │ └── decisions/ │ ├── technology-choices.md │ ├── design-patterns.md │ └── trade-offs.md └── diagrams/ ├── system-architecture.drawio ├── data-flow.drawio ├── security-model.drawio └── deployment-patterns.drawio ``` ### Architecture Decision Records (ADRs) ```markdown # ADR-001: Technology Stack Selection ## Status Accepted ## Context Need to select technology stack for RedFlag update management system. ## Decision - Backend: Go + Gin HTTP Framework - Database: PostgreSQL - Frontend: React + TypeScript - Agent: Go (cross-platform) ## Rationale - Go: Cross-platform compilation, strong cryptography, good performance - PostgreSQL: Strong consistency, mature, good tooling - React: Component-based, good ecosystem, TypeScript support - Gin: High performance, good middleware support ## Consequences - Single language across backend and agent - Strong typing with TypeScript - PostgreSQL expertise required - Go ecosystem for security libraries ``` ## Prerequisites - Architecture review process established - Design documentation templates created - Diagram creation tools available - Technical writing resources allocated - Review and approval workflow defined ## Effort Estimate **Complexity:** High **Effort:** 3-4 weeks - Week 1: System architecture and component documentation - Week 2: Security and deployment architecture - Week 3: Performance and scalability documentation - Week 4: Review, diagrams, and ADRs ## Success Metrics - Implementation alignment with documented architecture - New developer understanding of system design - Reduced architectural drift in codebase - Easier system maintenance and evolution - Better decision making for future changes - Improved team communication about design ## Monitoring Track these metrics after implementation: - Architecture compliance in code reviews - Developer understanding assessments - Implementation decision documentation coverage - System design change tracking - Team feedback on documentation usefulness