# RedFlag Clean Architecture Implementation Master Plan **Date**: 2025-12-19 **Version**: v1.0 **Total Implementation Time**: 3-4 hours (including migration fixes and command deduplication) **Status**: READY FOR EXECUTION --- ## Executive Summary Complete implementation plan for fixing critical ETHOS violations and implementing clean architecture patterns across RedFlag v0.1.27. Addresses duplicate command generation, lost frontend errors, and migration system bugs. **Three Core Objectives:** 1. ✅ Fix migration system (blocks everything else) 2. ✅ Implement command factory pattern (prevents duplicate key violations) 3. ✅ Build frontend error logging system (ETHOS #1 compliance) --- ## Table of Contents 1. [Pre-Implementation: Migration System Fix](#pre-implementation-migration-system-fix) 2. [Phase 1: Command Factory Pattern](#phase-1-command-factory-pattern) 3. [Phase 2: Database Schema](#phase-2-database-schema) 4. [Phase 3: Backend Error Handler](#phase-3-backend-error-handler) 5. [Phase 4: Frontend Error Logger](#phase-4-frontend-error-logger) 6. [Phase 5: Toast Integration](#phase-5-toast-integration) 7. [Phase 6: Verification & Testing](#phase-6-verification-and-testing) 8. [Implementation Checklist](#implementation-checklist) 9. [Risk Mitigation](#risk-mitigation) 10. [Post-Implementation Review](#post-implementation-review) --- ## Pre-Implementation: Migration System Fix **⚠️ CRITICAL: Must be completed first - blocks all other work** ### Problem Migration runner has duplicate INSERT logic causing "duplicate key value violates unique constraint" errors on fresh installations. ### Root Cause File: `aggregator-server/internal/database/db.go` - Line 103: Executes `INSERT INTO schema_migrations (version) VALUES ($1)` - Line 116: Executes the exact same INSERT statement - Result: Every migration filename gets inserted twice ### Solution ```go // File: aggregator-server/internal/database/db.go // Lines 95-120: Fix duplicate INSERT logic func (db *DB) Migrate() error { // ... existing code ... for _, file := range files { filename := file.Name() // ❌ REMOVE THIS - Line 103 duplicates line 116 // if _, err = tx.Exec("INSERT INTO schema_migrations (version) VALUES ($1)", filename); err != nil { // return fmt.Errorf("failed to mark migration %s as applied: %w", filename, err) // } // Keep only the EXECUTE + INSERT combo at lines 110-116 if _, err = tx.Exec(string(content)); err != nil { log.Printf("Migration %s failed, marking as applied: %v", filename, err) } // ✅ Keep this INSERT - it's the correct location if _, err = tx.Exec("INSERT INTO schema_migrations (version) VALUES ($1)", filename); err != nil { return fmt.Errorf("failed to mark migration %s as applied: %w", filename, err) } } // ... rest of function ... } ``` ### Validation Steps 1. Wipe database completely: `docker-compose down -v` 2. Start fresh: `docker-compose up -d` 3. Check migration logs: Should see all migrations apply without duplicate key errors 4. Verify: `SELECT COUNT(DISTINCT version) = COUNT(version) FROM schema_migrations` ### Time Required: 5 minutes **Blocker Status**: 🔴 CRITICAL - Do not proceed until fixed --- ## Phase 1: Command Factory Pattern ### Objective Prevent duplicate command key violations by ensuring all commands have properly generated UUIDs at creation time. ### Files to Create #### 1.1 Command Factory **File**: `aggregator-server/internal/command/factory.go` ```go package command import ( "errors" "fmt" "time" "github.com/google/uuid" "github.com/Fimeg/RedFlag/aggregator-server/internal/models" ) // Factory creates validated AgentCommand instances type Factory struct { validator *Validator } // NewFactory creates a new command factory func NewFactory() *Factory { return &Factory{ validator: NewValidator(), } } // Create generates a new validated AgentCommand with unique ID func (f *Factory) Create(agentID uuid.UUID, commandType string, params map[string]interface{}) (*models.AgentCommand, error) { cmd := &models.AgentCommand{ ID: uuid.New(), // Immediate, explicit generation AgentID: agentID, CommandType: commandType, Status: "pending", Source: determineSource(commandType), Params: params, CreatedAt: time.Now(), UpdatedAt: time.Now(), } if err := f.validator.Validate(cmd); err != nil { return nil, fmt.Errorf("command validation failed: %w", err) } return cmd, nil } // CreateWithIdempotency generates a command with idempotency protection func (f *Factory) CreateWithIdempotency(agentID uuid.UUID, commandType string, params map[string]interface{}, idempotencyKey string) (*models.AgentCommand, error) { // Check for existing command with same idempotency key // (Implementation depends on database query layer) existing, err := f.findByIdempotencyKey(agentID, idempotencyKey) if err != nil && err != sql.ErrNoRows { return nil, fmt.Errorf("failed to check idempotency: %w", err) } if existing != nil { return existing, nil // Return existing command instead of creating duplicate } cmd, err := f.Create(agentID, commandType, params) if err != nil { return nil, err } // Store idempotency key with command if err := f.storeIdempotencyKey(cmd.ID, agentID, idempotencyKey); err != nil { return nil, fmt.Errorf("failed to store idempotency key: %w", err) } return cmd, nil } // determineSource classifies command source based on type func determineSource(commandType string) string { if isSystemCommand(commandType) { return "system" } return "manual" } func isSystemCommand(commandType string) bool { systemCommands := []string{ "enable_heartbeat", "disable_heartbeat", "update_check", "cleanup_old_logs", } for _, cmd := range systemCommands { if commandType == cmd { return true } } return false } ``` #### 1.2 Command Validator **File**: `aggregator-server/internal/command/validator.go` ```go package command import ( "errors" "fmt" "github.com/google/uuid" "github.com/Fimeg/RedFlag/aggregator-server/internal/models" ) // Validator validates command parameters type Validator struct { minCheckInSeconds int maxCheckInSeconds int minScannerMinutes int maxScannerMinutes int } // NewValidator creates a new command validator func NewValidator() *Validator { return &Validator{ minCheckInSeconds: 60, // 1 minute minimum maxCheckInSeconds: 3600, // 1 hour maximum minScannerMinutes: 1, // 1 minute minimum maxScannerMinutes: 1440, // 24 hours maximum } } // Validate performs comprehensive command validation func (v *Validator) Validate(cmd *models.AgentCommand) error { if cmd == nil { return errors.New("command cannot be nil") } if cmd.ID == uuid.Nil { return errors.New("command ID cannot be zero UUID") } if cmd.AgentID == uuid.Nil { return errors.New("agent ID is required") } if cmd.CommandType == "" { return errors.New("command type is required") } if cmd.Status == "" { return errors.New("status is required") } validStatuses := []string{"pending", "running", "completed", "failed", "cancelled"} if !contains(validStatuses, cmd.Status) { return fmt.Errorf("invalid status: %s", cmd.Status) } if cmd.Source != "manual" && cmd.Source != "system" { return fmt.Errorf("source must be 'manual' or 'system', got: %s", cmd.Source) } // Validate command type format if err := v.validateCommandType(cmd.CommandType); err != nil { return err } return nil } // ValidateSubsystemAction validates subsystem-specific actions func (v *Validator) ValidateSubsystemAction(subsystem string, action string) error { validActions := map[string][]string{ "storage": {"trigger", "enable", "disable", "set_interval"}, "system": {"trigger", "enable", "disable", "set_interval"}, "docker": {"trigger", "enable", "disable", "set_interval"}, "updates": {"trigger", "enable", "disable", "set_interval"}, } actions, ok := validActions[subsystem] if !ok { return fmt.Errorf("unknown subsystem: %s", subsystem) } if !contains(actions, action) { return fmt.Errorf("invalid action '%s' for subsystem '%s'", action, subsystem) } return nil } // ValidateInterval ensures scanner intervals are within bounds func (v *Validator) ValidateInterval(subsystem string, minutes int) error { if minutes < v.minScannerMinutes { return fmt.Errorf("interval %d minutes below minimum %d for subsystem %s", minutes, v.minScannerMinutes, subsystem) } if minutes > v.maxScannerMinutes { return fmt.Errorf("interval %d minutes above maximum %d for subsystem %s", minutes, v.maxScannerMinutes, subsystem) } return nil } func (v *Validator) validateCommandType(commandType string) error { validPrefixes := []string{"scan_", "install_", "update_", "enable_", "disable_", "reboot"} for _, prefix := range validPrefixes { if len(commandType) >= len(prefix) && commandType[:len(prefix)] == prefix { return nil } } return fmt.Errorf("invalid command type format: %s", commandType) } func contains(slice []string, item string) bool { for _, s := range slice { if s == item { return true } } return false } ``` #### 1.3 Update AgentCommand Model **File**: `aggregator-server/internal/models/command.go` ```go package models import ( "database/sql" "time" "github.com/google/uuid" "github.com/lib/pq" ) // AgentCommand represents a command sent to an agent type AgentCommand struct { ID uuid.UUID `db:"id" json:"id"` AgentID uuid.UUID `db:"agent_id" json:"agent_id"` CommandType string `db:"command_type" json:"command_type"` Status string `db:"status" json:"status"` Source string `db:"source" json:"source"` Params pq.ByteaArray `db:"params" json:"params"` Result sql.NullString `db:"result" json:"result,omitempty"` Error sql.NullString `db:"error" json:"error,omitempty"` RetryCount int `db:"retry_count" json:"retry_count"` CreatedAt time.Time `db:"created_at" json:"created_at"` UpdatedAt time.Time `db:"updated_at" json:"updated_at"` CompletedAt pq.NullTime `db:"completed_at" json:"completed_at,omitempty"` // Idempotency support IdempotencyKey uuid.NullUUID `db:"idempotency_key" json:"-"` } // Validate checks if the command is valid func (c *AgentCommand) Validate() error { if c.ID == uuid.Nil { return ErrCommandIDRequired } if c.AgentID == uuid.Nil { return ErrAgentIDRequired } if c.CommandType == "" { return ErrCommandTypeRequired } if c.Status == "" { return ErrStatusRequired } if c.Source != "manual" && c.Source != "system" { return ErrInvalidSource } return nil } // IsTerminal returns true if the command is in a terminal state func (c *AgentCommand) IsTerminal() bool { return c.Status == "completed" || c.Status == "failed" || c.Status == "cancelled" } // CanRetry returns true if the command can be retried func (c *AgentCommand) CanRetry() bool { return c.Status == "failed" && c.RetryCount < 3 } // Predefined errors for validation var ( ErrCommandIDRequired = errors.New("command ID cannot be zero UUID") ErrAgentIDRequired = errors.New("agent ID is required") ErrCommandTypeRequired = errors.New("command type is required") ErrStatusRequired = errors.New("status is required") ErrInvalidSource = errors.New("source must be 'manual' or 'system'") ) ``` #### 1.4 Update Subsystem Handler **File**: `aggregator-server/internal/api/handlers/subsystems.go` ```go package handlers import ( "log" "net/http" "github.com/gin-gonic/gin" "github.com/google/uuid" "github.com/jmoiron/sqlx" "github.com/Fimeg/RedFlag/aggregator-server/internal/command" "github.com/Fimeg/RedFlag/aggregator-server/internal/models" ) type SubsystemHandler struct { db *sqlx.DB commandFactory *command.Factory } func NewSubsystemHandler(db *sqlx.DB) *SubsystemHandler { return &SubsystemHandler{ db: db, commandFactory: command.NewFactory(), } } // TriggerSubsystem creates and enqueues a subsystem command func (h *SubsystemHandler) TriggerSubsystem(c *gin.Context) { agentID, err := uuid.Parse(c.Param("id")) if err != nil { log.Printf("[ERROR] [server] [subsystem] invalid_agent_id error=%v", err) c.JSON(http.StatusBadRequest, gin.H{"error": "invalid agent ID"}) return } subsystem := c.Param("subsystem") if err := h.validateSubsystem(subsystem); err != nil { log.Printf("[ERROR] [server] [subsystem] invalid_subsystem subsystem=%s error=%v", subsystem, err) c.JSON(http.StatusBadRequest, gin.H{"error": err.Error()}) return } // DEDUPLICATION CHECK: Prevent multiple pending scans existingCmd, err := h.getPendingScanCommand(agentID, subsystem) if err != nil { log.Printf("[ERROR] [server] [subsystem] query_failed agent_id=%s subsystem=%s error=%v", agentID, subsystem, err) c.JSON(http.StatusInternalServerError, gin.H{"error": "internal error"}) return } if existingCmd != nil { log.Printf("[INFO] [server] [subsystem] scan_already_pending agent_id=%s subsystem=%s command_id=%s", agentID, subsystem, existingCmd.ID) log.Printf("[HISTORY] [server] [scan_%s] duplicate_request_prevented agent_id=%s command_id=%s timestamp=%s", subsystem, agentID, existingCmd.ID, time.Now().Format(time.RFC3339)) c.JSON(http.StatusConflict, gin.H{ "error": "Scan already in progress", "command_id": existingCmd.ID.String(), "subsystem": subsystem, "status": existingCmd.Status, "created_at": existingCmd.CreatedAt, }) return } // Generate idempotency key from request context idempotencyKey := h.generateIdempotencyKey(c, agentID, subsystem) // Create command using factory cmd, err := h.commandFactory.CreateWithIdempotency( agentID, "scan_"+subsystem, map[string]interface{}{"subsystem": subsystem}, idempotencyKey, ) if err != nil { log.Printf("[ERROR] [server] [subsystem] command_creation_failed agent_id=%s subsystem=%s error=%v", agentID, subsystem, err) c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to create command"}) return } // Store command in database if err := h.storeCommand(cmd); err != nil { log.Printf("[ERROR] [server] [subsystem] command_store_failed agent_id=%s command_id=%s error=%v", agentID, cmd.ID, err) c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to store command"}) return } log.Printf("[INFO] [server] [subsystem] command_created agent_id=%s command_id=%s subsystem=%s", agentID, cmd.ID, subsystem) log.Printf("[HISTORY] [server] [scan_%s] command_created agent_id=%s command_id=%s source=manual timestamp=%s", subsystem, agentID, cmd.ID, time.Now().Format(time.RFC3339)) c.JSON(http.StatusOK, gin.H{ "message": "Command created successfully", "command_id": cmd.ID.String(), "subsystem": subsystem, }) } // getPendingScanCommand checks for existing pending scan commands func (h *SubsystemHandler) getPendingScanCommand(agentID uuid.UUID, subsystem string) (*models.AgentCommand, error) { var cmd models.AgentCommand query := ` SELECT id, command_type, status, created_at FROM agent_commands WHERE agent_id = $1 AND command_type = $2 AND status = 'pending' LIMIT 1` commandType := "scan_" + subsystem err := h.db.Get(&cmd, query, agentID, commandType) if err != nil { if err == sql.ErrNoRows { return nil, nil // No pending command found } return nil, fmt.Errorf("query failed: %w", err) } return &cmd, nil } // validateSubsystem checks if subsystem is recognized func (h *SubsystemHandler) validateSubsystem(subsystem string) error { validSubsystems := []string{"apt", "dnf", "windows", "winget", "storage", "system", "docker"} for _, valid := range validSubsystems { if subsystem == valid { return nil } } return fmt.Errorf("unknown subsystem: %s", subsystem) } // generateIdempotencyKey creates a key to prevent duplicate submissions func (h *SubsystemHandler) generateIdempotencyKey(c *gin.Context, agentID uuid.UUID, subsystem string) string { // Use timestamp rounded to nearest minute for idempotency window // This allows retries within same minute but prevents duplicates across minutes timestampWindow := time.Now().Unix() / 60 // Round to minute return fmt.Sprintf("%s:%s:%d", agentID.String(), subsystem, timestampWindow) } // storeCommand persists command to database func (h *SubsystemHandler) storeCommand(cmd *models.AgentCommand) error { // Implementation depends on your command storage layer // Use NamedExec or similar to insert command query := ` INSERT INTO agent_commands (id, agent_id, command_type, status, source, params, created_at) VALUES (:id, :agent_id, :command_type, :status, :source, :params, NOW())` _, err := h.db.NamedExec(query, cmd) return err } ``` ### Time Required: 30 minutes --- ## Phase 2: Database Schema ### Migration 023a: Command Deduplication **File**: `aggregator-server/internal/database/migrations/023a_command_deduplication.up.sql` ```sql -- Command Deduplication Schema -- Prevents multiple pending scan commands per subsystem per agent -- Add unique constraint to enforce single pending command per subsystem CREATE UNIQUE INDEX idx_agent_pending_subsystem ON agent_commands(agent_id, command_type, status) WHERE status = 'pending'; -- Add idempotency key support for retry scenarios ALTER TABLE agent_commands ADD COLUMN idempotency_key VARCHAR(64) UNIQUE NULL; CREATE INDEX idx_agent_commands_idempotency_key ON agent_commands(idempotency_key); COMMENT ON COLUMN agent_commands.idempotency_key IS 'Prevents duplicate command creation from retry logic. Based on (agent_id + subsystem + timestamp window).'; ``` **File**: `aggregator-server/internal/database/migrations/023a_command_deduplication.down.sql` ```sql DROP INDEX IF EXISTS idx_agent_pending_subsystem; ALTER TABLE agent_commands DROP COLUMN IF EXISTS idempotency_key; ``` ### Migration 023: Client Error Logging Table **File**: `aggregator-server/internal/database/migrations/023_client_error_logging.up.sql` ```sql -- Client Error Logging Schema -- Implements ETHOS #1: Errors are History, Not /dev/null CREATE TABLE client_errors ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), agent_id UUID REFERENCES agents(id) ON DELETE SET NULL, subsystem VARCHAR(50) NOT NULL, error_type VARCHAR(50) NOT NULL, message TEXT NOT NULL, stack_trace TEXT, metadata JSONB, url TEXT NOT NULL, user_agent TEXT, created_at TIMESTAMP DEFAULT NOW() ); -- Indexes for efficient querying CREATE INDEX idx_client_errors_agent_time ON client_errors(agent_id, created_at DESC); CREATE INDEX idx_client_errors_subsystem_time ON client_errors(subsystem, created_at DESC); CREATE INDEX idx_client_errors_error_type_time ON client_errors(error_type, created_at DESC); CREATE INDEX idx_client_errors_created_at ON client_errors(created_at DESC); -- Comments for documentation COMMENT ON TABLE client_errors IS 'Frontend error logs for debugging and auditing. Implements ETHOS #1.'; COMMENT ON COLUMN client_errors.agent_id IS 'Agent active when error occurred (NULL for pre-auth errors)'; COMMENT ON COLUMN client_errors.subsystem IS 'RedFlag subsystem being used (storage, system, docker, etc.)'; COMMENT ON COLUMN client_errors.error_type IS 'Error category: javascript_error, api_error, ui_error, validation_error'; COMMENT ON COLUMN client_errors.metadata IS 'Additional context (component, API response, user actions)'; -- Add idempotency support to agent_commands ALTER TABLE agent_commands ADD COLUMN idempotency_key UUID UNIQUE NULL; CREATE INDEX idx_agent_commands_idempotency_key ON agent_commands(idempotency_key); COMMENT ON COLUMN agent_commands.idempotency_key IS 'Prevents duplicate command creation from retry logic'; ``` **File**: `aggregator-server/internal/database/migrations/023_client_error_logging.down.sql` ```sql DROP TABLE IF EXISTS client_errors; ALTER TABLE agent_commands DROP COLUMN IF EXISTS idempotency_key; ``` ### Time Required: 5 minutes --- ## Phase 3: Backend Error Handler ### Files to Create #### 3.1 Error Handler **File**: `aggregator-server/internal/api/handlers/client_errors.go` ```go package handlers import ( "database/sql" "fmt" "log" "net/http" "time" "github.com/gin-gonic/gin" "github.com/google/uuid" "github.com/jmoiron/sqlx" ) // ClientErrorHandler handles frontend error logging per ETHOS #1 type ClientErrorHandler struct { db *sqlx.DB } // NewClientErrorHandler creates a new error handler func NewClientErrorHandler(db *sqlx.DB) *ClientErrorHandler { return &ClientErrorHandler{db: db} } // LogErrorRequest represents a client error log entry type LogErrorRequest struct { Subsystem string `json:"subsystem" binding:"required"` ErrorType string `json:"error_type" binding:"required,oneof=javascript_error api_error ui_error validation_error"` Message string `json:"message" binding:"required,max=10000"` StackTrace string `json:"stack_trace,omitempty"` Metadata map[string]interface{} `json:"metadata,omitempty"` URL string `json:"url" binding:"required"` } // LogError processes and stores frontend errors func (h *ClientErrorHandler) LogError(c *gin.Context) { var req LogErrorRequest if err := c.ShouldBindJSON(&req); err != nil { log.Printf("[ERROR] [server] [client_error] validation_failed error=\"%v\"", err) c.JSON(http.StatusBadRequest, gin.H{"error": "invalid request data"}) return } // Extract agent ID from auth middleware if available var agentID interface{} if agentIDValue, exists := c.Get("agentID"); exists { if id, ok := agentIDValue.(uuid.UUID); ok { agentID = id } } // Log to console with HISTORY prefix log.Printf("[ERROR] [server] [client] [%s] agent_id=%v subsystem=%s message=\"%s\"", req.ErrorType, agentID, req.Subsystem, truncate(req.Message, 200)) log.Printf("[HISTORY] [server] [client_error] agent_id=%v subsystem=%s type=%s url=\"%s\" message=\"%s\" timestamp=%s", agentID, req.Subsystem, req.ErrorType, req.URL, req.Message, time.Now().Format(time.RFC3339)) // Store in database with retry logic if err := h.storeError(agentID, req); err != nil { log.Printf("[ERROR] [server] [client_error] store_failed error=\"%v\"", err) c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to store error"}) return } c.JSON(http.StatusOK, gin.H{"logged": true}) } // storeError persists error to database with retry func (h *ClientErrorHandler) storeError(agentID interface{}, req LogErrorRequest) error { const maxRetries = 3 var lastErr error for attempt := 1; attempt <= maxRetries; attempt++ { query := `INSERT INTO client_errors (agent_id, subsystem, error_type, message, stack_trace, metadata, url, user_agent) VALUES (:agent_id, :subsystem, :error_type, :message, :stack_trace, :metadata, :url, :user_agent)` _, err := h.db.NamedExec(query, map[string]interface{}{ "agent_id": agentID, "subsystem": req.Subsystem, "error_type": req.ErrorType, "message": req.Message, "stack_trace": req.StackTrace, "metadata": req.Metadata, "url": req.URL, "user_agent": c.GetHeader("User-Agent"), }) if err == nil { return nil } lastErr = err if attempt < maxRetries { time.Sleep(time.Duration(attempt) * time.Second) continue } } return fmt.Errorf("failed after %d attempts: %w", maxRetries, lastErr) } func truncate(s string, maxLen int) string { if len(s) <= maxLen { return s } return s[:maxLen] + "..." } func hash(s string) string { // Simple hash for message deduplication detection h := sha256.Sum256([]byte(s)) return fmt.Sprintf("%x", h)[:16] } ``` #### 3.2 Query Client Errors **File**: `aggregator-server/internal/database/queries/client_errors.sql` ```sql -- name: GetClientErrorsByAgent :many SELECT * FROM client_errors WHERE agent_id = $1 ORDER BY created_at DESC LIMIT $2; -- name: GetClientErrorsBySubsystem :many SELECT * FROM client_errors WHERE subsystem = $1 ORDER BY created_at DESC LIMIT $2; -- name: GetClientErrorStats :one SELECT subsystem, error_type, COUNT(*) as count, MIN(created_at) as first_occurrence, MAX(created_at) as last_occurrence FROM client_errors WHERE created_at > NOW() - INTERVAL '24 hours' GROUP BY subsystem, error_type ORDER BY count DESC; ``` #### 3.3 Update Router **File**: `aggregator-server/internal/api/router.go` ```go // Add to router setup function func SetupRouter(db *sqlx.DB, cfg *config.Config) *gin.Engine { // ... existing setup ... // Error logging endpoint (authenticated) errorHandler := handlers.NewClientErrorHandler(db) apiV1.POST("/logs/client-error", middleware.AuthMiddleware(), errorHandler.LogError, ) // Admin endpoint for viewing errors apiV1.GET("/logs/client-errors", middleware.AuthMiddleware(), middleware.AdminMiddleware(), errorHandler.GetErrors, ) // ... rest of setup ... } ``` ### Time Required: 20 minutes --- ## Phase 4: Frontend Error Logger ### Files to Create #### 4.1 Client Error Logger **File**: `aggregator-web/src/lib/client-error-logger.ts` ```typescript import { api, ApiError } from './api'; export interface ClientErrorLog { subsystem: string; error_type: 'javascript_error' | 'api_error' | 'ui_error' | 'validation_error'; message: string; stack_trace?: string; metadata?: Record; url: string; timestamp: string; } /** * ClientErrorLogger provides reliable frontend error logging with retry logic * Implements ETHOS #3: Assume Failure; Build for Resilience */ export class ClientErrorLogger { private maxRetries = 3; private baseDelayMs = 1000; private localStorageKey = 'redflag-error-queue'; private offlineBuffer: ClientErrorLog[] = []; private isOnline = navigator.onLine; constructor() { // Listen for online/offline events window.addEventListener('online', () => this.flushOfflineBuffer()); window.addEventListener('offline', () => { this.isOnline = false; }); } /** * Log an error with automatic retry and offline queuing */ async logError(errorData: Omit): Promise { const fullError: ClientErrorLog = { ...errorData, url: window.location.href, timestamp: new Date().toISOString(), }; // Try to send immediately try { await this.sendWithRetry(fullError); return; } catch (error) { // If failed after retries, queue for later this.queueForRetry(fullError); } } /** * Send error to backend with exponential backoff retry */ private async sendWithRetry(error: ClientErrorLog): Promise { for (let attempt = 1; attempt <= this.maxRetries; attempt++) { try { await api.post('/logs/client-error', error, { headers: { 'X-Error-Logger-Request': 'true' }, }); // Success, remove from queue if it was there this.removeFromQueue(error); return; } catch (error) { if (attempt === this.maxRetries) { throw error; // Rethrow after final attempt } // Exponential backoff await this.sleep(this.baseDelayMs * Math.pow(2, attempt - 1)); } } } /** * Queue error for retry when network is available */ private queueForRetry(error: ClientErrorLog): void { try { const queue = this.getQueue(); queue.push({ ...error, retryCount: (error as any).retryCount || 0, queuedAt: new Date().toISOString(), }); // Save to localStorage for persistence localStorage.setItem(this.localStorageKey, JSON.stringify(queue)); // Also keep in memory buffer this.offlineBuffer.push(error); } catch (storageError) { // localStorage might be full or unavailable console.warn('Failed to queue error for retry:', storageError); } } /** * Flush offline buffer when coming back online */ private async flushOfflineBuffer(): Promise { if (!this.isOnline) return; const queue = this.getQueue(); if (queue.length === 0) return; const failed: typeof queue = []; for (const queuedError of queue) { try { await this.sendWithRetry(queuedError); } catch (error) { failed.push(queuedError); } } // Update queue with remaining failed items if (failed.length < queue.length) { localStorage.setItem(this.localStorageKey, JSON.stringify(failed)); } } /** * Get current error queue from localStorage */ private getQueue(): any[] { try { const stored = localStorage.getItem(this.localStorageKey); return stored ? JSON.parse(stored) : []; } catch { return []; } } /** * Remove successfully sent error from queue */ private removeFromQueue(sentError: ClientErrorLog): void { try { const queue = this.getQueue(); const filtered = queue.filter(queued => queued.timestamp !== sentError.timestamp || queued.message !== sentError.message ); if (filtered.length < queue.length) { localStorage.setItem(this.localStorageKey, JSON.stringify(filtered)); } } catch { // Best effort cleanup } } /** * Capture unhandled JavaScript errors */ captureUnhandledErrors(): void { // Global error handler window.addEventListener('error', (event) => { this.logError({ subsystem: 'global', error_type: 'javascript_error', message: event.message, stack_trace: event.error?.stack, metadata: { filename: event.filename, lineno: event.lineno, colno: event.colno, }, }).catch(() => { // Silently ignore logging failures }); }); // Unhandled promise rejections window.addEventListener('unhandledrejection', (event) => { this.logError({ subsystem: 'global', error_type: 'javascript_error', message: event.reason?.message || String(event.reason), stack_trace: event.reason?.stack, }).catch(() => {}); }); } private sleep(ms: number): Promise { return new Promise(resolve => setTimeout(resolve, ms)); } } // Singleton instance export const clientErrorLogger = new ClientErrorLogger(); // Auto-capture unhandled errors if (typeof window !== 'undefined') { clientErrorLogger.captureUnhandledErrors(); } ``` #### 4.2 Toast Wrapper with Logging **File**: `aggregator-web/src/lib/toast-with-logging.ts` ```typescript import toast, { ToastOptions } from 'react-hot-toast'; import { clientErrorLogger } from './client-error-logger'; import { useLocation } from 'react-router-dom'; /** * Extract subsystem from current route */ function getCurrentSubsystem(): string { if (typeof window === 'undefined') return 'unknown'; const path = window.location.pathname; // Map routes to subsystems if (path.includes('/storage')) return 'storage'; if (path.includes('/system')) return 'system'; if (path.includes('/docker')) return 'docker'; if (path.includes('/updates')) return 'updates'; if (path.includes('/agent/')) return 'agent'; return 'unknown'; } /** * Wrap toast.error to automatically log errors to backend * Implements ETHOS #1: Errors are History */ export const toastWithLogging = { error: (message: string, options?: ToastOptions & { subsystem?: string }) => { const subsystem = options?.subsystem || getCurrentSubsystem(); // Log to backend asynchronously - don't block UI clientErrorLogger.logError({ subsystem, error_type: 'ui_error', message: message.substring(0, 5000), // Prevent excessively long messages metadata: { component: options?.id, duration: options?.duration, position: options?.position, timestamp: new Date().toISOString(), }, }).catch(() => { // Silently ignore logging failures - don't crash the UI }); // Show toast to user return toast.error(message, options); }, // Passthrough methods success: toast.success, info: toast.info, warning: toast.warning, loading: toast.loading, dismiss: toast.dismiss, remove: toast.remove, promise: toast.promise, }; /** * React hook for toast with automatic subsystem detection */ export function useToastWithLogging() { const location = useLocation(); return { error: (message: string, options?: ToastOptions) => { return toastWithLogging.error(message, { ...options, subsystem: getSubsystemFromPath(location.pathname), }); }, success: toast.success, info: toast.info, warning: toast.warning, loading: toast.loading, dismiss: toast.dismiss, }; } function getSubsystemFromPath(pathname: string): string { const matches = pathname.match(/\/(storage|system|docker|updates|agent)/); return matches ? matches[1] : 'unknown'; } ``` #### 4.3 API Integration **Update**: `aggregator-web/src/lib/api.ts` ```typescript // Add error logging to axios interceptor api.interceptors.response.use( (response) => response, async (error) => { // Don't log errors from the error logger itself if (error.config?.headers?.['X-Error-Logger-Request']) { return Promise.reject(error); } // Extract subsystem from URL const subsystem = extractSubsystem(error.config?.url); // Log API errors clientErrorLogger.logError({ subsystem, error_type: 'api_error', message: error.message, metadata: { status_code: error.response?.status, endpoint: error.config?.url, method: error.config?.method, response_data: error.response?.data, }, }).catch(() => { // Don't let logging errors hide the original error }); return Promise.reject(error); } ); function extractSubsystem(url: string = ''): string { const matches = url.match(/\/(storage|system|docker|updates|agent)/); return matches ? matches[1] : 'unknown'; } ``` ### Time Required: 20 minutes --- ## Phase 5: Toast Integration ### Update Existing Error Calls **Pattern**: Update error toast calls to use new logger **Before**: ```typescript import toast from 'react-hot-toast'; toast.error(`Failed to trigger scan: ${error.message}`); ``` **After**: ```typescript import { toastWithLogging } from '@/lib/toast-with-logging'; toastWithLogging.error(`Failed to trigger scan: ${error.message}`, { subsystem: 'storage', // Specify subsystem id: 'trigger-scan-error', // Optional component ID }); ``` ### Priority Files to Update #### 5.1 React State Management for Scan Buttons **File**: Create `aggregator-web/src/hooks/useScanState.ts` ```typescript import { useState, useCallback } from 'react'; import { api } from '@/lib/api'; import { toastWithLogging } from '@/lib/toast-with-logging'; interface ScanState { isScanning: boolean; commandId?: string; error?: string; } /** * Hook for managing scan button state and preventing duplicate scans */ export function useScanState(agentId: string, subsystem: string) { const [state, setState] = useState({ isScanning: false, }); const triggerScan = useCallback(async () => { if (state.isScanning) { toastWithLogging.info('Scan already in progress', { subsystem }); return; } setState({ isScanning: true, commandId: undefined, error: undefined }); try { const result = await api.post(`/agents/${agentId}/subsystems/${subsystem}/trigger`); setState(prev => ({ ...prev, commandId: result.data.command_id, })); // Poll for completion or wait for subscription update await waitForScanComplete(agentId, result.data.command_id); setState({ isScanning: false, commandId: result.data.command_id }); toastWithLogging.success(`${subsystem} scan completed`, { subsystem }); } catch (error: any) { const isAlreadyRunning = error.response?.status === 409; if (isAlreadyRunning) { const existingCommandId = error.response?.data?.command_id; setState({ isScanning: false, commandId: existingCommandId, error: 'Scan already in progress', }); toastWithLogging.info(`Scan already running (command: ${existingCommandId})`, { subsystem }); } else { const errorMessage = error.response?.data?.error || error.message; setState({ isScanning: false, error: errorMessage, }); toastWithLogging.error(`Failed to trigger scan: ${errorMessage}`, { subsystem }); } } }, [agentId, subsystem, state.isScanning]); const reset = useCallback(() => { setState({ isScanning: false, commandId: undefined, error: undefined }); }, []); return { isScanning: state.isScanning, commandId: state.commandId, error: state.error, triggerScan, reset, }; } /** * Wait for scan to complete by polling command status */ async function waitForScanComplete(agentId: string, commandId: string): Promise { const maxWaitMs = 300000; // 5 minutes max const startTime = Date.now(); const pollInterval = 2000; // Poll every 2 seconds return new Promise((resolve, reject) => { const interval = setInterval(async () => { try { const result = await api.get(`/agents/${agentId}/commands/${commandId}`); if (result.data.status === 'completed' || result.data.status === 'failed') { clearInterval(interval); resolve(); } } catch (error) { clearInterval(interval); reject(error); } if (Date.now() - startTime > maxWaitMs) { clearInterval(interval); reject(new Error('Scan timeout')); } }, pollInterval); }); } ``` **Usage Example in Component**: ```typescript import { useScanState } from '@/hooks/useScanState'; function ScanButton({ agentId, subsystem }: { agentId: string; subsystem: string }) { const { isScanning, triggerScan } = useScanState(agentId, subsystem); return ( ); } ``` #### 5.2 Update Existing Error Calls **Priority Files to Update** 1. **Agent Subsystem Actions** - `/src/components/AgentSubsystems.tsx` 2. **Command Retry Logic** - `/src/hooks/useCommands.ts` 3. **Authentication Errors** - `/src/lib/auth.ts` 4. **API Error Boundaries** - `/src/components/ErrorBoundary.tsx` ### Example Complete Integration **File**: `aggregator-web/src/components/AgentSubsystems.tsx` (example update) ```typescript import { toastWithLogging } from '@/lib/toast-with-logging'; const handleTrigger = async (subsystem: string) => { try { await triggerSubsystem(agentId, subsystem); } catch (error) { toastWithLogging.error( `Failed to trigger ${subsystem} scan: ${error.message}`, { subsystem, id: `trigger-${subsystem}`, } ); } }; ``` ### Time Required: 15 minutes #### 5.3 Deduplication Testing **Test Cases**: ```typescript // Test 1: Rapid clicking prevention test('clicking scan button 10 times creates only 1 command', async () => { const button = screen.getByText('Scan APT'); // Click 10 times rapidly for (let i = 0; i < 10; i++) { fireEvent.click(button); } // Should only create 1 command expect(api.post).toHaveBeenCalledTimes(1); expect(api.post).toHaveBeenCalledWith('/agents/123/subsystems/apt/trigger'); }); // Test 2: Button disabled while scanning test('button disabled during scan', async () => { const button = screen.getByText('Scan APT'); fireEvent.click(button); // Button should be disabled immediately expect(button).toBeDisabled(); expect(screen.getByText('Scanning...')).toBeInTheDocument(); await waitFor(() => { expect(button).not.toBeDisabled(); }); }); // Test 3: 409 Conflict returns existing command mock.onPost().reply(409, { error: 'Scan already in progress', command_id: 'existing-id', }); expect(await triggerScan()).toEqual({ command_id: 'existing-id' }); expect(toast).toHaveBeenCalledWith('Scan already running'); ``` --- ## Phase 6: Verification & Testing ### Manual Testing Checklist #### 6.1 Migration Testing - [ ] Run migration 023 successfully - [ ] Verify `client_errors` table exists - [ ] Verify `idempotency_key` column added to `agent_commands` - [ ] Test on fresh database (no duplicate key errors) #### 6.2 Command Factory Testing - [ ] Rapid-fire scan button clicks (10+ clicks in 2 seconds) - [ ] Verify all commands created with unique IDs - [ ] Check no duplicate key violations in logs - [ ] Verify commands appear in database correctly #### 6.3 Error Logging Testing - [ ] Trigger UI error (e.g., invalid input) - [ ] Verify error appears in toast - [ ] Check database - error should be stored in `client_errors` - [ ] Trigger API error (e.g., network timeout) - [ ] Verify exponential backoff retry works - [ ] Disconnect network, trigger error, reconnect - [ ] Verify error is queued and sent when back online #### 6.4 Integration Testing - [ ] Full user workflow: login → trigger scan → view results - [ ] Verify all errors logged with [HISTORY] prefix - [ ] Check logs are queryable by subsystem - [ ] Verify error logging doesn't block UI ### Automated Test Cases #### 6.5 Backend Tests **File**: `aggregator-server/internal/command/factory_test.go` ```go package command import ( "testing" "github.com/google/uuid" "github.com/stretchr/testify/assert" "github.com/stretchr/testify/require" ) func TestFactory_Create(t *testing.T) { factory := NewFactory() agentID := uuid.New() cmd, err := factory.Create(agentID, "scan_storage", map[string]interface{}{"path": "/"}) require.NoError(t, err) assert.NotEqual(t, uuid.Nil, cmd.ID, "ID must be generated") assert.Equal(t, agentID, cmd.AgentID) assert.Equal(t, "scan_storage", cmd.CommandType) assert.Equal(t, "pending", cmd.Status) assert.Equal(t, "manual", cmd.Source) } func TestFactory_CreateWithIdempotency(t *testing.T) { factory := NewFactory() agentID := uuid.New() idempotencyKey := "test-key-123" // Create first command cmd1, err := factory.CreateWithIdempotency(agentID, "scan_system", nil, idempotencyKey) require.NoError(t, err) // Create "duplicate" command with same idempotency key cmd2, err := factory.CreateWithIdempotency(agentID, "scan_system", nil, idempotencyKey) require.NoError(t, err) // Should return same command assert.Equal(t, cmd1.ID, cmd2.ID, "Idempotency key should return same command") } func TestFactory_Validate(t *testing.T) { tests := []struct { name string cmd *models.AgentCommand wantErr bool }{ { name: "valid command", cmd: &models.AgentCommand{ ID: uuid.New(), AgentID: uuid.New(), CommandType: "scan_storage", Status: "pending", Source: "manual", }, wantErr: false, }, { name: "missing ID", cmd: &models.AgentCommand{ ID: uuid.Nil, AgentID: uuid.New(), CommandType: "scan_storage", Status: "pending", Source: "manual", }, wantErr: true, }, } factory := NewFactory() for _, tt := range tests { t.Run(tt.name, func(t *testing.T) { _, err := factory.Create(tt.cmd.AgentID, tt.cmd.CommandType, nil) if tt.wantErr { assert.Error(t, err) } else { assert.NoError(t, err) } }) } } ``` #### 6.6 Frontend Tests **File**: `aggregator-web/src/lib/client-error-logger.test.ts` ```typescript import { clientErrorLogger } from './client-error-logger'; import { api } from './api'; jest.mock('./api'); describe('ClientErrorLogger', () => { beforeEach(() => { localStorage.clear(); jest.clearAllMocks(); }); test('logs error successfully on first attempt', async () => { (api.post as jest.Mock).mockResolvedValue({}); await clientErrorLogger.logError({ subsystem: 'storage', error_type: 'api_error', message: 'Test error', }); expect(api.post).toHaveBeenCalledTimes(1); expect(api.post).toHaveBeenCalledWith( '/logs/client-error', expect.objectContaining({ subsystem: 'storage', error_type: 'api_error', message: 'Test error', }), expect.any(Object) ); }); test('retries on failure then saves to localStorage', async () => { (api.post as jest.Mock) .mockRejectedValueOnce(new Error('Network error')) .mockRejectedValueOnce(new Error('Network error')) .mockRejectedValueOnce(new Error('Network error')); await clientErrorLogger.logError({ subsystem: 'storage', error_type: 'api_error', message: 'Test error', }); expect(api.post).toHaveBeenCalledTimes(3); // Should be saved to localStorage const queue = localStorage.getItem('redflag-error-queue'); expect(queue).toBeTruthy(); expect(JSON.parse(queue!).length).toBe(1); }); test('flushes queue when coming back online', async () => { // Pre-populate queue const queuedError = { subsystem: 'storage', error_type: 'api_error', message: 'Queued error', timestamp: new Date().toISOString(), }; localStorage.setItem('redflag-error-queue', JSON.stringify([queuedError])); (api.post as jest.Mock).mockResolvedValue({}); // Trigger online event window.dispatchEvent(new Event('online')); // Wait for flush await new Promise(resolve => setTimeout(resolve, 100)); expect(api.post).toHaveBeenCalled(); expect(localStorage.getItem('redflag-error-queue')).toBe('[]'); }); }); ``` ### Time Required: 30 minutes --- ## Implementation Checklist ### Pre-Implementation - [ ] ✅ Migration system bug fixed (lines 103 & 116 in db.go) - [ ] ✅ Database wiped and fresh instance ready - [ ] ✅ Test agents available for rapid scan testing - [ ] ✅ Development environment ready (all 3 components) ### Phase 1: Command Factory (25 min) - [ ] Create `aggregator-server/internal/command/factory.go` - [ ] Create `aggregator-server/internal/command/validator.go` - [ ] Update `aggregator-server/internal/models/command.go` - [ ] Update `aggregator-server/internal/api/handlers/subsystems.go` - [ ] Test: Verify rapid scan clicks work ### Phase 2: Database Schema (5 min) - [ ] Create migration `023_client_error_logging.up.sql` - [ ] Create migration `023_client_error_logging.down.sql` - [ ] Run migration and verify table creation - [ ] Verify indexes created ### Phase 3: Backend Handler (20 min) - [ ] Create `aggregator-server/internal/api/handlers/client_errors.go` - [ ] Create `aggregator-server/internal/database/queries/client_errors.sql` - [ ] Update `aggregator-server/internal/api/router.go` - [ ] Test API endpoint with curl ### Phase 4: Frontend Logger (20 min) - [ ] Create `aggregator-web/src/lib/client-error-logger.ts` - [ ] Create `aggregator-web/src/lib/toast-with-logging.ts` - [ ] Update `aggregator-web/src/lib/api.ts` - [ ] Test offline/online queue behavior ### Phase 5: Toast Integration (15 min) - [ ] Create `useScanState` hook for button state management - [ ] Update scan buttons to use `useScanState` - [ ] Test button disabling during scan - [ ] Update 3-5 critical error locations to use `toastWithLogging` - [ ] Verify errors appear in both toast and database - [ ] Test in multiple subsystems - [ ] **Test deduplication**: Rapid clicking creates only 1 command - [ ] **Test 409 response**: Returns existing command when scan running ### Phase 6: Verification (30 min) - [ ] Run all test cases - [ ] Verify ETHOS compliance checklist - [ ] Test rapid scan clicking (no duplicates) - [ ] Test error persistence across page reloads - [ ] Verify [HISTORY] logs in server output ### Documentation - [ ] Update session documentation - [ ] Create testing summary - [ ] Document any issues encountered - [ ] Update architecture documentation --- ## Risk Mitigation ### Risk 1: Migration Failures **Probability**: Medium | **Impact**: High | **Severity**: 🔴 Critical **Mitigation**: - Fix migration runner bug FIRST (before this implementation) - Test migration on fresh database - Keep database backups - Have rollback script ready **Contingency**: If migration fails, manually apply SQL and continue --- ### Risk 2: Performance Impact **Probability**: Low | **Impact**: Medium | **Severity**: 🟡 Medium **Mitigation**: - Async error logging (non-blocking) - LocalStorage queue with size limit (max 50 errors) - Database indexes for fast queries - Batch insert if needed in future **Contingency**: If performance degrades, add sampling (log 1 in 10 errors) --- ### Risk 3: Infinite Error Loops **Probability**: Low | **Impact**: High | **Severity**: 🟡 Medium **Mitigation**: - `X-Error-Logger-Request` header prevents recursive logging - Max retry count (3 attempts) - Exponential backoff prevents thundering herd **Contingency**: If loop detected, check for missing header and fix --- ### Risk 4: Privacy Concerns **Probability**: Low | **Impact**: High | **Severity**: 🟡 Medium **Mitigation**: - No PII in error messages (validate during logging) - User agent stored but can be anonymized - Stack traces only from our code (not user code) **Contingency**: Add privacy filter to scrub sensitive data --- ## Post-Implementation Review ### Success Criteria - [ ] No duplicate key violations during rapid clicking - [ ] All errors persist in database - [ ] Error logs queryable and useful for debugging - [ ] No performance degradation observed - [ ] System handles offline/online transitions gracefully - [ ] All tests pass ### Performance Benchmarks - Command creation: < 10ms per command - Error logging: < 50ms per error (async) - Database queries: < 100ms for common queries - Bundle size increase: < 5KB gzipped ### Known Limitations - Error logs don't include full request payloads (privacy) - localStorage queue limited by browser storage (~5MB) - Retries happen in foreground (could be moved to background) ### Future Enhancements (Post v0.1.27) - Error aggregation and deduplication - Error rate alerting - Error analytics dashboard - Automatic error categorization - Integration with notification system --- ## Rollback Plan If critical issues arise: 1. **Revert Code Changes**: ```bash git revert HEAD~6..HEAD # Revert last 6 commits ``` 2. **Rollback Database**: ```bash cd aggregator-server # Run down migration go run cmd/migrate/main.go -migrate-down 1 ``` 3. **Rebuild and Deploy**: ```bash docker-compose build --no-cache docker-compose up -d ``` --- ## Additional Notes **Team Coordination**: - Coordinate with frontend team if they're working on error handling - Notify QA about new error logging features for testing - Update documentation team about database schema changes **Monitoring**: - Monitor `client_errors` table growth - Set up alerts for error rate spikes - Track failed error logging attempts **Documentation Updates**: - Update API documentation for `/logs/client-error` endpoint - Document error log query patterns for support team - Add troubleshooting guide for common errors --- **Plan Created By**: Ani (AI Assistant) **Reviewed By**: Casey Tunturi **Status**: 🟢 APPROVED FOR IMPLEMENTATION **Next Step**: Begin Phase 1 (Command Factory) **Estimated Timeline**: - Start: Immediately - Complete: ~2-3 hours - Test: 30 minutes - Deploy: After verification This is a complete, production-ready implementation plan. Each phase builds on the previous one, with full error handling, testing, and rollback procedures included. Let's build this right. 💪