53 KiB
RedFlag Clean Architecture Implementation Master Plan
Date: 2025-12-19 Version: v1.0 Total Implementation Time: 3-4 hours (including migration fixes and command deduplication) Status: READY FOR EXECUTION
Executive Summary
Complete implementation plan for fixing critical ETHOS violations and implementing clean architecture patterns across RedFlag v0.1.27. Addresses duplicate command generation, lost frontend errors, and migration system bugs.
Three Core Objectives:
- ✅ Fix migration system (blocks everything else)
- ✅ Implement command factory pattern (prevents duplicate key violations)
- ✅ Build frontend error logging system (ETHOS #1 compliance)
Table of Contents
- Pre-Implementation: Migration System Fix
- Phase 1: Command Factory Pattern
- Phase 2: Database Schema
- Phase 3: Backend Error Handler
- Phase 4: Frontend Error Logger
- Phase 5: Toast Integration
- Phase 6: Verification & Testing
- Implementation Checklist
- Risk Mitigation
- Post-Implementation Review
Pre-Implementation: Migration System Fix
⚠️ CRITICAL: Must be completed first - blocks all other work
Problem
Migration runner has duplicate INSERT logic causing "duplicate key value violates unique constraint" errors on fresh installations.
Root Cause
File: aggregator-server/internal/database/db.go
- Line 103: Executes
INSERT INTO schema_migrations (version) VALUES ($1) - Line 116: Executes the exact same INSERT statement
- Result: Every migration filename gets inserted twice
Solution
// File: aggregator-server/internal/database/db.go
// Lines 95-120: Fix duplicate INSERT logic
func (db *DB) Migrate() error {
// ... existing code ...
for _, file := range files {
filename := file.Name()
// ❌ REMOVE THIS - Line 103 duplicates line 116
// if _, err = tx.Exec("INSERT INTO schema_migrations (version) VALUES ($1)", filename); err != nil {
// return fmt.Errorf("failed to mark migration %s as applied: %w", filename, err)
// }
// Keep only the EXECUTE + INSERT combo at lines 110-116
if _, err = tx.Exec(string(content)); err != nil {
log.Printf("Migration %s failed, marking as applied: %v", filename, err)
}
// ✅ Keep this INSERT - it's the correct location
if _, err = tx.Exec("INSERT INTO schema_migrations (version) VALUES ($1)", filename); err != nil {
return fmt.Errorf("failed to mark migration %s as applied: %w", filename, err)
}
}
// ... rest of function ...
}
Validation Steps
- Wipe database completely:
docker-compose down -v - Start fresh:
docker-compose up -d - Check migration logs: Should see all migrations apply without duplicate key errors
- Verify:
SELECT COUNT(DISTINCT version) = COUNT(version) FROM schema_migrations
Time Required: 5 minutes
Blocker Status: 🔴 CRITICAL - Do not proceed until fixed
Phase 1: Command Factory Pattern
Objective
Prevent duplicate command key violations by ensuring all commands have properly generated UUIDs at creation time.
Files to Create
1.1 Command Factory
File: aggregator-server/internal/command/factory.go
package command
import (
"errors"
"fmt"
"time"
"github.com/google/uuid"
"github.com/Fimeg/RedFlag/aggregator-server/internal/models"
)
// Factory creates validated AgentCommand instances
type Factory struct {
validator *Validator
}
// NewFactory creates a new command factory
func NewFactory() *Factory {
return &Factory{
validator: NewValidator(),
}
}
// Create generates a new validated AgentCommand with unique ID
func (f *Factory) Create(agentID uuid.UUID, commandType string, params map[string]interface{}) (*models.AgentCommand, error) {
cmd := &models.AgentCommand{
ID: uuid.New(), // Immediate, explicit generation
AgentID: agentID,
CommandType: commandType,
Status: "pending",
Source: determineSource(commandType),
Params: params,
CreatedAt: time.Now(),
UpdatedAt: time.Now(),
}
if err := f.validator.Validate(cmd); err != nil {
return nil, fmt.Errorf("command validation failed: %w", err)
}
return cmd, nil
}
// CreateWithIdempotency generates a command with idempotency protection
func (f *Factory) CreateWithIdempotency(agentID uuid.UUID, commandType string,
params map[string]interface{}, idempotencyKey string) (*models.AgentCommand, error) {
// Check for existing command with same idempotency key
// (Implementation depends on database query layer)
existing, err := f.findByIdempotencyKey(agentID, idempotencyKey)
if err != nil && err != sql.ErrNoRows {
return nil, fmt.Errorf("failed to check idempotency: %w", err)
}
if existing != nil {
return existing, nil // Return existing command instead of creating duplicate
}
cmd, err := f.Create(agentID, commandType, params)
if err != nil {
return nil, err
}
// Store idempotency key with command
if err := f.storeIdempotencyKey(cmd.ID, agentID, idempotencyKey); err != nil {
return nil, fmt.Errorf("failed to store idempotency key: %w", err)
}
return cmd, nil
}
// determineSource classifies command source based on type
func determineSource(commandType string) string {
if isSystemCommand(commandType) {
return "system"
}
return "manual"
}
func isSystemCommand(commandType string) bool {
systemCommands := []string{
"enable_heartbeat",
"disable_heartbeat",
"update_check",
"cleanup_old_logs",
}
for _, cmd := range systemCommands {
if commandType == cmd {
return true
}
}
return false
}
1.2 Command Validator
File: aggregator-server/internal/command/validator.go
package command
import (
"errors"
"fmt"
"github.com/google/uuid"
"github.com/Fimeg/RedFlag/aggregator-server/internal/models"
)
// Validator validates command parameters
type Validator struct {
minCheckInSeconds int
maxCheckInSeconds int
minScannerMinutes int
maxScannerMinutes int
}
// NewValidator creates a new command validator
func NewValidator() *Validator {
return &Validator{
minCheckInSeconds: 60, // 1 minute minimum
maxCheckInSeconds: 3600, // 1 hour maximum
minScannerMinutes: 1, // 1 minute minimum
maxScannerMinutes: 1440, // 24 hours maximum
}
}
// Validate performs comprehensive command validation
func (v *Validator) Validate(cmd *models.AgentCommand) error {
if cmd == nil {
return errors.New("command cannot be nil")
}
if cmd.ID == uuid.Nil {
return errors.New("command ID cannot be zero UUID")
}
if cmd.AgentID == uuid.Nil {
return errors.New("agent ID is required")
}
if cmd.CommandType == "" {
return errors.New("command type is required")
}
if cmd.Status == "" {
return errors.New("status is required")
}
validStatuses := []string{"pending", "running", "completed", "failed", "cancelled"}
if !contains(validStatuses, cmd.Status) {
return fmt.Errorf("invalid status: %s", cmd.Status)
}
if cmd.Source != "manual" && cmd.Source != "system" {
return fmt.Errorf("source must be 'manual' or 'system', got: %s", cmd.Source)
}
// Validate command type format
if err := v.validateCommandType(cmd.CommandType); err != nil {
return err
}
return nil
}
// ValidateSubsystemAction validates subsystem-specific actions
func (v *Validator) ValidateSubsystemAction(subsystem string, action string) error {
validActions := map[string][]string{
"storage": {"trigger", "enable", "disable", "set_interval"},
"system": {"trigger", "enable", "disable", "set_interval"},
"docker": {"trigger", "enable", "disable", "set_interval"},
"updates": {"trigger", "enable", "disable", "set_interval"},
}
actions, ok := validActions[subsystem]
if !ok {
return fmt.Errorf("unknown subsystem: %s", subsystem)
}
if !contains(actions, action) {
return fmt.Errorf("invalid action '%s' for subsystem '%s'", action, subsystem)
}
return nil
}
// ValidateInterval ensures scanner intervals are within bounds
func (v *Validator) ValidateInterval(subsystem string, minutes int) error {
if minutes < v.minScannerMinutes {
return fmt.Errorf("interval %d minutes below minimum %d for subsystem %s",
minutes, v.minScannerMinutes, subsystem)
}
if minutes > v.maxScannerMinutes {
return fmt.Errorf("interval %d minutes above maximum %d for subsystem %s",
minutes, v.maxScannerMinutes, subsystem)
}
return nil
}
func (v *Validator) validateCommandType(commandType string) error {
validPrefixes := []string{"scan_", "install_", "update_", "enable_", "disable_", "reboot"}
for _, prefix := range validPrefixes {
if len(commandType) >= len(prefix) && commandType[:len(prefix)] == prefix {
return nil
}
}
return fmt.Errorf("invalid command type format: %s", commandType)
}
func contains(slice []string, item string) bool {
for _, s := range slice {
if s == item {
return true
}
}
return false
}
1.3 Update AgentCommand Model
File: aggregator-server/internal/models/command.go
package models
import (
"database/sql"
"time"
"github.com/google/uuid"
"github.com/lib/pq"
)
// AgentCommand represents a command sent to an agent
type AgentCommand struct {
ID uuid.UUID `db:"id" json:"id"`
AgentID uuid.UUID `db:"agent_id" json:"agent_id"`
CommandType string `db:"command_type" json:"command_type"`
Status string `db:"status" json:"status"`
Source string `db:"source" json:"source"`
Params pq.ByteaArray `db:"params" json:"params"`
Result sql.NullString `db:"result" json:"result,omitempty"`
Error sql.NullString `db:"error" json:"error,omitempty"`
RetryCount int `db:"retry_count" json:"retry_count"`
CreatedAt time.Time `db:"created_at" json:"created_at"`
UpdatedAt time.Time `db:"updated_at" json:"updated_at"`
CompletedAt pq.NullTime `db:"completed_at" json:"completed_at,omitempty"`
// Idempotency support
IdempotencyKey uuid.NullUUID `db:"idempotency_key" json:"-"`
}
// Validate checks if the command is valid
func (c *AgentCommand) Validate() error {
if c.ID == uuid.Nil {
return ErrCommandIDRequired
}
if c.AgentID == uuid.Nil {
return ErrAgentIDRequired
}
if c.CommandType == "" {
return ErrCommandTypeRequired
}
if c.Status == "" {
return ErrStatusRequired
}
if c.Source != "manual" && c.Source != "system" {
return ErrInvalidSource
}
return nil
}
// IsTerminal returns true if the command is in a terminal state
func (c *AgentCommand) IsTerminal() bool {
return c.Status == "completed" || c.Status == "failed" || c.Status == "cancelled"
}
// CanRetry returns true if the command can be retried
func (c *AgentCommand) CanRetry() bool {
return c.Status == "failed" && c.RetryCount < 3
}
// Predefined errors for validation
var (
ErrCommandIDRequired = errors.New("command ID cannot be zero UUID")
ErrAgentIDRequired = errors.New("agent ID is required")
ErrCommandTypeRequired = errors.New("command type is required")
ErrStatusRequired = errors.New("status is required")
ErrInvalidSource = errors.New("source must be 'manual' or 'system'")
)
1.4 Update Subsystem Handler
File: aggregator-server/internal/api/handlers/subsystems.go
package handlers
import (
"log"
"net/http"
"github.com/gin-gonic/gin"
"github.com/google/uuid"
"github.com/jmoiron/sqlx"
"github.com/Fimeg/RedFlag/aggregator-server/internal/command"
"github.com/Fimeg/RedFlag/aggregator-server/internal/models"
)
type SubsystemHandler struct {
db *sqlx.DB
commandFactory *command.Factory
}
func NewSubsystemHandler(db *sqlx.DB) *SubsystemHandler {
return &SubsystemHandler{
db: db,
commandFactory: command.NewFactory(),
}
}
// TriggerSubsystem creates and enqueues a subsystem command
func (h *SubsystemHandler) TriggerSubsystem(c *gin.Context) {
agentID, err := uuid.Parse(c.Param("id"))
if err != nil {
log.Printf("[ERROR] [server] [subsystem] invalid_agent_id error=%v", err)
c.JSON(http.StatusBadRequest, gin.H{"error": "invalid agent ID"})
return
}
subsystem := c.Param("subsystem")
if err := h.validateSubsystem(subsystem); err != nil {
log.Printf("[ERROR] [server] [subsystem] invalid_subsystem subsystem=%s error=%v", subsystem, err)
c.JSON(http.StatusBadRequest, gin.H{"error": err.Error()})
return
}
// DEDUPLICATION CHECK: Prevent multiple pending scans
existingCmd, err := h.getPendingScanCommand(agentID, subsystem)
if err != nil {
log.Printf("[ERROR] [server] [subsystem] query_failed agent_id=%s subsystem=%s error=%v",
agentID, subsystem, err)
c.JSON(http.StatusInternalServerError, gin.H{"error": "internal error"})
return
}
if existingCmd != nil {
log.Printf("[INFO] [server] [subsystem] scan_already_pending agent_id=%s subsystem=%s command_id=%s",
agentID, subsystem, existingCmd.ID)
log.Printf("[HISTORY] [server] [scan_%s] duplicate_request_prevented agent_id=%s command_id=%s timestamp=%s",
subsystem, agentID, existingCmd.ID, time.Now().Format(time.RFC3339))
c.JSON(http.StatusConflict, gin.H{
"error": "Scan already in progress",
"command_id": existingCmd.ID.String(),
"subsystem": subsystem,
"status": existingCmd.Status,
"created_at": existingCmd.CreatedAt,
})
return
}
// Generate idempotency key from request context
idempotencyKey := h.generateIdempotencyKey(c, agentID, subsystem)
// Create command using factory
cmd, err := h.commandFactory.CreateWithIdempotency(
agentID,
"scan_"+subsystem,
map[string]interface{}{"subsystem": subsystem},
idempotencyKey,
)
if err != nil {
log.Printf("[ERROR] [server] [subsystem] command_creation_failed agent_id=%s subsystem=%s error=%v",
agentID, subsystem, err)
c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to create command"})
return
}
// Store command in database
if err := h.storeCommand(cmd); err != nil {
log.Printf("[ERROR] [server] [subsystem] command_store_failed agent_id=%s command_id=%s error=%v",
agentID, cmd.ID, err)
c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to store command"})
return
}
log.Printf("[INFO] [server] [subsystem] command_created agent_id=%s command_id=%s subsystem=%s",
agentID, cmd.ID, subsystem)
log.Printf("[HISTORY] [server] [scan_%s] command_created agent_id=%s command_id=%s source=manual timestamp=%s",
subsystem, agentID, cmd.ID, time.Now().Format(time.RFC3339))
c.JSON(http.StatusOK, gin.H{
"message": "Command created successfully",
"command_id": cmd.ID.String(),
"subsystem": subsystem,
})
}
// getPendingScanCommand checks for existing pending scan commands
func (h *SubsystemHandler) getPendingScanCommand(agentID uuid.UUID, subsystem string) (*models.AgentCommand, error) {
var cmd models.AgentCommand
query := `
SELECT id, command_type, status, created_at
FROM agent_commands
WHERE agent_id = $1
AND command_type = $2
AND status = 'pending'
LIMIT 1`
commandType := "scan_" + subsystem
err := h.db.Get(&cmd, query, agentID, commandType)
if err != nil {
if err == sql.ErrNoRows {
return nil, nil // No pending command found
}
return nil, fmt.Errorf("query failed: %w", err)
}
return &cmd, nil
}
// validateSubsystem checks if subsystem is recognized
func (h *SubsystemHandler) validateSubsystem(subsystem string) error {
validSubsystems := []string{"apt", "dnf", "windows", "winget", "storage", "system", "docker"}
for _, valid := range validSubsystems {
if subsystem == valid {
return nil
}
}
return fmt.Errorf("unknown subsystem: %s", subsystem)
}
// generateIdempotencyKey creates a key to prevent duplicate submissions
func (h *SubsystemHandler) generateIdempotencyKey(c *gin.Context, agentID uuid.UUID, subsystem string) string {
// Use timestamp rounded to nearest minute for idempotency window
// This allows retries within same minute but prevents duplicates across minutes
timestampWindow := time.Now().Unix() / 60 // Round to minute
return fmt.Sprintf("%s:%s:%d", agentID.String(), subsystem, timestampWindow)
}
// storeCommand persists command to database
func (h *SubsystemHandler) storeCommand(cmd *models.AgentCommand) error {
// Implementation depends on your command storage layer
// Use NamedExec or similar to insert command
query := `
INSERT INTO agent_commands
(id, agent_id, command_type, status, source, params, created_at)
VALUES (:id, :agent_id, :command_type, :status, :source, :params, NOW())`
_, err := h.db.NamedExec(query, cmd)
return err
}
Time Required: 30 minutes
Phase 2: Database Schema
Migration 023a: Command Deduplication
File: aggregator-server/internal/database/migrations/023a_command_deduplication.up.sql
-- Command Deduplication Schema
-- Prevents multiple pending scan commands per subsystem per agent
-- Add unique constraint to enforce single pending command per subsystem
CREATE UNIQUE INDEX idx_agent_pending_subsystem
ON agent_commands(agent_id, command_type, status)
WHERE status = 'pending';
-- Add idempotency key support for retry scenarios
ALTER TABLE agent_commands ADD COLUMN idempotency_key VARCHAR(64) UNIQUE NULL;
CREATE INDEX idx_agent_commands_idempotency_key ON agent_commands(idempotency_key);
COMMENT ON COLUMN agent_commands.idempotency_key IS
'Prevents duplicate command creation from retry logic. Based on (agent_id + subsystem + timestamp window).';
File: aggregator-server/internal/database/migrations/023a_command_deduplication.down.sql
DROP INDEX IF EXISTS idx_agent_pending_subsystem;
ALTER TABLE agent_commands DROP COLUMN IF EXISTS idempotency_key;
Migration 023: Client Error Logging Table
File: aggregator-server/internal/database/migrations/023_client_error_logging.up.sql
-- Client Error Logging Schema
-- Implements ETHOS #1: Errors are History, Not /dev/null
CREATE TABLE client_errors (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
agent_id UUID REFERENCES agents(id) ON DELETE SET NULL,
subsystem VARCHAR(50) NOT NULL,
error_type VARCHAR(50) NOT NULL,
message TEXT NOT NULL,
stack_trace TEXT,
metadata JSONB,
url TEXT NOT NULL,
user_agent TEXT,
created_at TIMESTAMP DEFAULT NOW()
);
-- Indexes for efficient querying
CREATE INDEX idx_client_errors_agent_time ON client_errors(agent_id, created_at DESC);
CREATE INDEX idx_client_errors_subsystem_time ON client_errors(subsystem, created_at DESC);
CREATE INDEX idx_client_errors_error_type_time ON client_errors(error_type, created_at DESC);
CREATE INDEX idx_client_errors_created_at ON client_errors(created_at DESC);
-- Comments for documentation
COMMENT ON TABLE client_errors IS 'Frontend error logs for debugging and auditing. Implements ETHOS #1.';
COMMENT ON COLUMN client_errors.agent_id IS 'Agent active when error occurred (NULL for pre-auth errors)';
COMMENT ON COLUMN client_errors.subsystem IS 'RedFlag subsystem being used (storage, system, docker, etc.)';
COMMENT ON COLUMN client_errors.error_type IS 'Error category: javascript_error, api_error, ui_error, validation_error';
COMMENT ON COLUMN client_errors.metadata IS 'Additional context (component, API response, user actions)';
-- Add idempotency support to agent_commands
ALTER TABLE agent_commands ADD COLUMN idempotency_key UUID UNIQUE NULL;
CREATE INDEX idx_agent_commands_idempotency_key ON agent_commands(idempotency_key);
COMMENT ON COLUMN agent_commands.idempotency_key IS 'Prevents duplicate command creation from retry logic';
File: aggregator-server/internal/database/migrations/023_client_error_logging.down.sql
DROP TABLE IF EXISTS client_errors;
ALTER TABLE agent_commands DROP COLUMN IF EXISTS idempotency_key;
Time Required: 5 minutes
Phase 3: Backend Error Handler
Files to Create
3.1 Error Handler
File: aggregator-server/internal/api/handlers/client_errors.go
package handlers
import (
"database/sql"
"fmt"
"log"
"net/http"
"time"
"github.com/gin-gonic/gin"
"github.com/google/uuid"
"github.com/jmoiron/sqlx"
)
// ClientErrorHandler handles frontend error logging per ETHOS #1
type ClientErrorHandler struct {
db *sqlx.DB
}
// NewClientErrorHandler creates a new error handler
func NewClientErrorHandler(db *sqlx.DB) *ClientErrorHandler {
return &ClientErrorHandler{db: db}
}
// LogErrorRequest represents a client error log entry
type LogErrorRequest struct {
Subsystem string `json:"subsystem" binding:"required"`
ErrorType string `json:"error_type" binding:"required,oneof=javascript_error api_error ui_error validation_error"`
Message string `json:"message" binding:"required,max=10000"`
StackTrace string `json:"stack_trace,omitempty"`
Metadata map[string]interface{} `json:"metadata,omitempty"`
URL string `json:"url" binding:"required"`
}
// LogError processes and stores frontend errors
func (h *ClientErrorHandler) LogError(c *gin.Context) {
var req LogErrorRequest
if err := c.ShouldBindJSON(&req); err != nil {
log.Printf("[ERROR] [server] [client_error] validation_failed error=\"%v\"", err)
c.JSON(http.StatusBadRequest, gin.H{"error": "invalid request data"})
return
}
// Extract agent ID from auth middleware if available
var agentID interface{}
if agentIDValue, exists := c.Get("agentID"); exists {
if id, ok := agentIDValue.(uuid.UUID); ok {
agentID = id
}
}
// Log to console with HISTORY prefix
log.Printf("[ERROR] [server] [client] [%s] agent_id=%v subsystem=%s message=\"%s\"",
req.ErrorType, agentID, req.Subsystem, truncate(req.Message, 200))
log.Printf("[HISTORY] [server] [client_error] agent_id=%v subsystem=%s type=%s url=\"%s\" message=\"%s\" timestamp=%s",
agentID, req.Subsystem, req.ErrorType, req.URL, req.Message, time.Now().Format(time.RFC3339))
// Store in database with retry logic
if err := h.storeError(agentID, req); err != nil {
log.Printf("[ERROR] [server] [client_error] store_failed error=\"%v\"", err)
c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to store error"})
return
}
c.JSON(http.StatusOK, gin.H{"logged": true})
}
// storeError persists error to database with retry
func (h *ClientErrorHandler) storeError(agentID interface{}, req LogErrorRequest) error {
const maxRetries = 3
var lastErr error
for attempt := 1; attempt <= maxRetries; attempt++ {
query := `INSERT INTO client_errors (agent_id, subsystem, error_type, message, stack_trace, metadata, url, user_agent)
VALUES (:agent_id, :subsystem, :error_type, :message, :stack_trace, :metadata, :url, :user_agent)`
_, err := h.db.NamedExec(query, map[string]interface{}{
"agent_id": agentID,
"subsystem": req.Subsystem,
"error_type": req.ErrorType,
"message": req.Message,
"stack_trace": req.StackTrace,
"metadata": req.Metadata,
"url": req.URL,
"user_agent": c.GetHeader("User-Agent"),
})
if err == nil {
return nil
}
lastErr = err
if attempt < maxRetries {
time.Sleep(time.Duration(attempt) * time.Second)
continue
}
}
return fmt.Errorf("failed after %d attempts: %w", maxRetries, lastErr)
}
func truncate(s string, maxLen int) string {
if len(s) <= maxLen {
return s
}
return s[:maxLen] + "..."
}
func hash(s string) string {
// Simple hash for message deduplication detection
h := sha256.Sum256([]byte(s))
return fmt.Sprintf("%x", h)[:16]
}
3.2 Query Client Errors
File: aggregator-server/internal/database/queries/client_errors.sql
-- name: GetClientErrorsByAgent :many
SELECT * FROM client_errors
WHERE agent_id = $1
ORDER BY created_at DESC
LIMIT $2;
-- name: GetClientErrorsBySubsystem :many
SELECT * FROM client_errors
WHERE subsystem = $1
ORDER BY created_at DESC
LIMIT $2;
-- name: GetClientErrorStats :one
SELECT
subsystem,
error_type,
COUNT(*) as count,
MIN(created_at) as first_occurrence,
MAX(created_at) as last_occurrence
FROM client_errors
WHERE created_at > NOW() - INTERVAL '24 hours'
GROUP BY subsystem, error_type
ORDER BY count DESC;
3.3 Update Router
File: aggregator-server/internal/api/router.go
// Add to router setup function
func SetupRouter(db *sqlx.DB, cfg *config.Config) *gin.Engine {
// ... existing setup ...
// Error logging endpoint (authenticated)
errorHandler := handlers.NewClientErrorHandler(db)
apiV1.POST("/logs/client-error",
middleware.AuthMiddleware(),
errorHandler.LogError,
)
// Admin endpoint for viewing errors
apiV1.GET("/logs/client-errors",
middleware.AuthMiddleware(),
middleware.AdminMiddleware(),
errorHandler.GetErrors,
)
// ... rest of setup ...
}
Time Required: 20 minutes
Phase 4: Frontend Error Logger
Files to Create
4.1 Client Error Logger
File: aggregator-web/src/lib/client-error-logger.ts
import { api, ApiError } from './api';
export interface ClientErrorLog {
subsystem: string;
error_type: 'javascript_error' | 'api_error' | 'ui_error' | 'validation_error';
message: string;
stack_trace?: string;
metadata?: Record<string, any>;
url: string;
timestamp: string;
}
/**
* ClientErrorLogger provides reliable frontend error logging with retry logic
* Implements ETHOS #3: Assume Failure; Build for Resilience
*/
export class ClientErrorLogger {
private maxRetries = 3;
private baseDelayMs = 1000;
private localStorageKey = 'redflag-error-queue';
private offlineBuffer: ClientErrorLog[] = [];
private isOnline = navigator.onLine;
constructor() {
// Listen for online/offline events
window.addEventListener('online', () => this.flushOfflineBuffer());
window.addEventListener('offline', () => { this.isOnline = false; });
}
/**
* Log an error with automatic retry and offline queuing
*/
async logError(errorData: Omit<ClientErrorLog, 'url' | 'timestamp'>): Promise<void> {
const fullError: ClientErrorLog = {
...errorData,
url: window.location.href,
timestamp: new Date().toISOString(),
};
// Try to send immediately
try {
await this.sendWithRetry(fullError);
return;
} catch (error) {
// If failed after retries, queue for later
this.queueForRetry(fullError);
}
}
/**
* Send error to backend with exponential backoff retry
*/
private async sendWithRetry(error: ClientErrorLog): Promise<void> {
for (let attempt = 1; attempt <= this.maxRetries; attempt++) {
try {
await api.post('/logs/client-error', error, {
headers: { 'X-Error-Logger-Request': 'true' },
});
// Success, remove from queue if it was there
this.removeFromQueue(error);
return;
} catch (error) {
if (attempt === this.maxRetries) {
throw error; // Rethrow after final attempt
}
// Exponential backoff
await this.sleep(this.baseDelayMs * Math.pow(2, attempt - 1));
}
}
}
/**
* Queue error for retry when network is available
*/
private queueForRetry(error: ClientErrorLog): void {
try {
const queue = this.getQueue();
queue.push({
...error,
retryCount: (error as any).retryCount || 0,
queuedAt: new Date().toISOString(),
});
// Save to localStorage for persistence
localStorage.setItem(this.localStorageKey, JSON.stringify(queue));
// Also keep in memory buffer
this.offlineBuffer.push(error);
} catch (storageError) {
// localStorage might be full or unavailable
console.warn('Failed to queue error for retry:', storageError);
}
}
/**
* Flush offline buffer when coming back online
*/
private async flushOfflineBuffer(): Promise<void> {
if (!this.isOnline) return;
const queue = this.getQueue();
if (queue.length === 0) return;
const failed: typeof queue = [];
for (const queuedError of queue) {
try {
await this.sendWithRetry(queuedError);
} catch (error) {
failed.push(queuedError);
}
}
// Update queue with remaining failed items
if (failed.length < queue.length) {
localStorage.setItem(this.localStorageKey, JSON.stringify(failed));
}
}
/**
* Get current error queue from localStorage
*/
private getQueue(): any[] {
try {
const stored = localStorage.getItem(this.localStorageKey);
return stored ? JSON.parse(stored) : [];
} catch {
return [];
}
}
/**
* Remove successfully sent error from queue
*/
private removeFromQueue(sentError: ClientErrorLog): void {
try {
const queue = this.getQueue();
const filtered = queue.filter(queued =>
queued.timestamp !== sentError.timestamp ||
queued.message !== sentError.message
);
if (filtered.length < queue.length) {
localStorage.setItem(this.localStorageKey, JSON.stringify(filtered));
}
} catch {
// Best effort cleanup
}
}
/**
* Capture unhandled JavaScript errors
*/
captureUnhandledErrors(): void {
// Global error handler
window.addEventListener('error', (event) => {
this.logError({
subsystem: 'global',
error_type: 'javascript_error',
message: event.message,
stack_trace: event.error?.stack,
metadata: {
filename: event.filename,
lineno: event.lineno,
colno: event.colno,
},
}).catch(() => {
// Silently ignore logging failures
});
});
// Unhandled promise rejections
window.addEventListener('unhandledrejection', (event) => {
this.logError({
subsystem: 'global',
error_type: 'javascript_error',
message: event.reason?.message || String(event.reason),
stack_trace: event.reason?.stack,
}).catch(() => {});
});
}
private sleep(ms: number): Promise<void> {
return new Promise(resolve => setTimeout(resolve, ms));
}
}
// Singleton instance
export const clientErrorLogger = new ClientErrorLogger();
// Auto-capture unhandled errors
if (typeof window !== 'undefined') {
clientErrorLogger.captureUnhandledErrors();
}
4.2 Toast Wrapper with Logging
File: aggregator-web/src/lib/toast-with-logging.ts
import toast, { ToastOptions } from 'react-hot-toast';
import { clientErrorLogger } from './client-error-logger';
import { useLocation } from 'react-router-dom';
/**
* Extract subsystem from current route
*/
function getCurrentSubsystem(): string {
if (typeof window === 'undefined') return 'unknown';
const path = window.location.pathname;
// Map routes to subsystems
if (path.includes('/storage')) return 'storage';
if (path.includes('/system')) return 'system';
if (path.includes('/docker')) return 'docker';
if (path.includes('/updates')) return 'updates';
if (path.includes('/agent/')) return 'agent';
return 'unknown';
}
/**
* Wrap toast.error to automatically log errors to backend
* Implements ETHOS #1: Errors are History
*/
export const toastWithLogging = {
error: (message: string, options?: ToastOptions & { subsystem?: string }) => {
const subsystem = options?.subsystem || getCurrentSubsystem();
// Log to backend asynchronously - don't block UI
clientErrorLogger.logError({
subsystem,
error_type: 'ui_error',
message: message.substring(0, 5000), // Prevent excessively long messages
metadata: {
component: options?.id,
duration: options?.duration,
position: options?.position,
timestamp: new Date().toISOString(),
},
}).catch(() => {
// Silently ignore logging failures - don't crash the UI
});
// Show toast to user
return toast.error(message, options);
},
// Passthrough methods
success: toast.success,
info: toast.info,
warning: toast.warning,
loading: toast.loading,
dismiss: toast.dismiss,
remove: toast.remove,
promise: toast.promise,
};
/**
* React hook for toast with automatic subsystem detection
*/
export function useToastWithLogging() {
const location = useLocation();
return {
error: (message: string, options?: ToastOptions) => {
return toastWithLogging.error(message, {
...options,
subsystem: getSubsystemFromPath(location.pathname),
});
},
success: toast.success,
info: toast.info,
warning: toast.warning,
loading: toast.loading,
dismiss: toast.dismiss,
};
}
function getSubsystemFromPath(pathname: string): string {
const matches = pathname.match(/\/(storage|system|docker|updates|agent)/);
return matches ? matches[1] : 'unknown';
}
4.3 API Integration
Update: aggregator-web/src/lib/api.ts
// Add error logging to axios interceptor
api.interceptors.response.use(
(response) => response,
async (error) => {
// Don't log errors from the error logger itself
if (error.config?.headers?.['X-Error-Logger-Request']) {
return Promise.reject(error);
}
// Extract subsystem from URL
const subsystem = extractSubsystem(error.config?.url);
// Log API errors
clientErrorLogger.logError({
subsystem,
error_type: 'api_error',
message: error.message,
metadata: {
status_code: error.response?.status,
endpoint: error.config?.url,
method: error.config?.method,
response_data: error.response?.data,
},
}).catch(() => {
// Don't let logging errors hide the original error
});
return Promise.reject(error);
}
);
function extractSubsystem(url: string = ''): string {
const matches = url.match(/\/(storage|system|docker|updates|agent)/);
return matches ? matches[1] : 'unknown';
}
Time Required: 20 minutes
Phase 5: Toast Integration
Update Existing Error Calls
Pattern: Update error toast calls to use new logger
Before:
import toast from 'react-hot-toast';
toast.error(`Failed to trigger scan: ${error.message}`);
After:
import { toastWithLogging } from '@/lib/toast-with-logging';
toastWithLogging.error(`Failed to trigger scan: ${error.message}`, {
subsystem: 'storage', // Specify subsystem
id: 'trigger-scan-error', // Optional component ID
});
Priority Files to Update
5.1 React State Management for Scan Buttons
File: Create aggregator-web/src/hooks/useScanState.ts
import { useState, useCallback } from 'react';
import { api } from '@/lib/api';
import { toastWithLogging } from '@/lib/toast-with-logging';
interface ScanState {
isScanning: boolean;
commandId?: string;
error?: string;
}
/**
* Hook for managing scan button state and preventing duplicate scans
*/
export function useScanState(agentId: string, subsystem: string) {
const [state, setState] = useState<ScanState>({
isScanning: false,
});
const triggerScan = useCallback(async () => {
if (state.isScanning) {
toastWithLogging.info('Scan already in progress', { subsystem });
return;
}
setState({ isScanning: true, commandId: undefined, error: undefined });
try {
const result = await api.post(`/agents/${agentId}/subsystems/${subsystem}/trigger`);
setState(prev => ({
...prev,
commandId: result.data.command_id,
}));
// Poll for completion or wait for subscription update
await waitForScanComplete(agentId, result.data.command_id);
setState({ isScanning: false, commandId: result.data.command_id });
toastWithLogging.success(`${subsystem} scan completed`, { subsystem });
} catch (error: any) {
const isAlreadyRunning = error.response?.status === 409;
if (isAlreadyRunning) {
const existingCommandId = error.response?.data?.command_id;
setState({
isScanning: false,
commandId: existingCommandId,
error: 'Scan already in progress',
});
toastWithLogging.info(`Scan already running (command: ${existingCommandId})`, { subsystem });
} else {
const errorMessage = error.response?.data?.error || error.message;
setState({
isScanning: false,
error: errorMessage,
});
toastWithLogging.error(`Failed to trigger scan: ${errorMessage}`, { subsystem });
}
}
}, [agentId, subsystem, state.isScanning]);
const reset = useCallback(() => {
setState({ isScanning: false, commandId: undefined, error: undefined });
}, []);
return {
isScanning: state.isScanning,
commandId: state.commandId,
error: state.error,
triggerScan,
reset,
};
}
/**
* Wait for scan to complete by polling command status
*/
async function waitForScanComplete(agentId: string, commandId: string): Promise<void> {
const maxWaitMs = 300000; // 5 minutes max
const startTime = Date.now();
const pollInterval = 2000; // Poll every 2 seconds
return new Promise((resolve, reject) => {
const interval = setInterval(async () => {
try {
const result = await api.get(`/agents/${agentId}/commands/${commandId}`);
if (result.data.status === 'completed' || result.data.status === 'failed') {
clearInterval(interval);
resolve();
}
} catch (error) {
clearInterval(interval);
reject(error);
}
if (Date.now() - startTime > maxWaitMs) {
clearInterval(interval);
reject(new Error('Scan timeout'));
}
}, pollInterval);
});
}
Usage Example in Component:
import { useScanState } from '@/hooks/useScanState';
function ScanButton({ agentId, subsystem }: { agentId: string; subsystem: string }) {
const { isScanning, triggerScan } = useScanState(agentId, subsystem);
return (
<button
onClick={triggerScan}
disabled={isScanning}
className={isScanning ? 'btn-disabled' : 'btn-primary'}
>
{isScanning ? (
<>
<Spinner className="animate-spin" />
Scanning...
</>
) : (
`Scan ${subsystem}`
)}
</button>
);
}
5.2 Update Existing Error Calls
Priority Files to Update
- Agent Subsystem Actions -
/src/components/AgentSubsystems.tsx - Command Retry Logic -
/src/hooks/useCommands.ts - Authentication Errors -
/src/lib/auth.ts - API Error Boundaries -
/src/components/ErrorBoundary.tsx
Example Complete Integration
File: aggregator-web/src/components/AgentSubsystems.tsx (example update)
import { toastWithLogging } from '@/lib/toast-with-logging';
const handleTrigger = async (subsystem: string) => {
try {
await triggerSubsystem(agentId, subsystem);
} catch (error) {
toastWithLogging.error(
`Failed to trigger ${subsystem} scan: ${error.message}`,
{
subsystem,
id: `trigger-${subsystem}`,
}
);
}
};
Time Required: 15 minutes
5.3 Deduplication Testing
Test Cases:
// Test 1: Rapid clicking prevention
test('clicking scan button 10 times creates only 1 command', async () => {
const button = screen.getByText('Scan APT');
// Click 10 times rapidly
for (let i = 0; i < 10; i++) {
fireEvent.click(button);
}
// Should only create 1 command
expect(api.post).toHaveBeenCalledTimes(1);
expect(api.post).toHaveBeenCalledWith('/agents/123/subsystems/apt/trigger');
});
// Test 2: Button disabled while scanning
test('button disabled during scan', async () => {
const button = screen.getByText('Scan APT');
fireEvent.click(button);
// Button should be disabled immediately
expect(button).toBeDisabled();
expect(screen.getByText('Scanning...')).toBeInTheDocument();
await waitFor(() => {
expect(button).not.toBeDisabled();
});
});
// Test 3: 409 Conflict returns existing command
mock.onPost().reply(409, {
error: 'Scan already in progress',
command_id: 'existing-id',
});
expect(await triggerScan()).toEqual({ command_id: 'existing-id' });
expect(toast).toHaveBeenCalledWith('Scan already running');
Phase 6: Verification & Testing
Manual Testing Checklist
6.1 Migration Testing
- Run migration 023 successfully
- Verify
client_errorstable exists - Verify
idempotency_keycolumn added toagent_commands - Test on fresh database (no duplicate key errors)
6.2 Command Factory Testing
- Rapid-fire scan button clicks (10+ clicks in 2 seconds)
- Verify all commands created with unique IDs
- Check no duplicate key violations in logs
- Verify commands appear in database correctly
6.3 Error Logging Testing
- Trigger UI error (e.g., invalid input)
- Verify error appears in toast
- Check database - error should be stored in
client_errors - Trigger API error (e.g., network timeout)
- Verify exponential backoff retry works
- Disconnect network, trigger error, reconnect
- Verify error is queued and sent when back online
6.4 Integration Testing
- Full user workflow: login → trigger scan → view results
- Verify all errors logged with [HISTORY] prefix
- Check logs are queryable by subsystem
- Verify error logging doesn't block UI
Automated Test Cases
6.5 Backend Tests
File: aggregator-server/internal/command/factory_test.go
package command
import (
"testing"
"github.com/google/uuid"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
func TestFactory_Create(t *testing.T) {
factory := NewFactory()
agentID := uuid.New()
cmd, err := factory.Create(agentID, "scan_storage", map[string]interface{}{"path": "/"})
require.NoError(t, err)
assert.NotEqual(t, uuid.Nil, cmd.ID, "ID must be generated")
assert.Equal(t, agentID, cmd.AgentID)
assert.Equal(t, "scan_storage", cmd.CommandType)
assert.Equal(t, "pending", cmd.Status)
assert.Equal(t, "manual", cmd.Source)
}
func TestFactory_CreateWithIdempotency(t *testing.T) {
factory := NewFactory()
agentID := uuid.New()
idempotencyKey := "test-key-123"
// Create first command
cmd1, err := factory.CreateWithIdempotency(agentID, "scan_system", nil, idempotencyKey)
require.NoError(t, err)
// Create "duplicate" command with same idempotency key
cmd2, err := factory.CreateWithIdempotency(agentID, "scan_system", nil, idempotencyKey)
require.NoError(t, err)
// Should return same command
assert.Equal(t, cmd1.ID, cmd2.ID, "Idempotency key should return same command")
}
func TestFactory_Validate(t *testing.T) {
tests := []struct {
name string
cmd *models.AgentCommand
wantErr bool
}{
{
name: "valid command",
cmd: &models.AgentCommand{
ID: uuid.New(),
AgentID: uuid.New(),
CommandType: "scan_storage",
Status: "pending",
Source: "manual",
},
wantErr: false,
},
{
name: "missing ID",
cmd: &models.AgentCommand{
ID: uuid.Nil,
AgentID: uuid.New(),
CommandType: "scan_storage",
Status: "pending",
Source: "manual",
},
wantErr: true,
},
}
factory := NewFactory()
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
_, err := factory.Create(tt.cmd.AgentID, tt.cmd.CommandType, nil)
if tt.wantErr {
assert.Error(t, err)
} else {
assert.NoError(t, err)
}
})
}
}
6.6 Frontend Tests
File: aggregator-web/src/lib/client-error-logger.test.ts
import { clientErrorLogger } from './client-error-logger';
import { api } from './api';
jest.mock('./api');
describe('ClientErrorLogger', () => {
beforeEach(() => {
localStorage.clear();
jest.clearAllMocks();
});
test('logs error successfully on first attempt', async () => {
(api.post as jest.Mock).mockResolvedValue({});
await clientErrorLogger.logError({
subsystem: 'storage',
error_type: 'api_error',
message: 'Test error',
});
expect(api.post).toHaveBeenCalledTimes(1);
expect(api.post).toHaveBeenCalledWith(
'/logs/client-error',
expect.objectContaining({
subsystem: 'storage',
error_type: 'api_error',
message: 'Test error',
}),
expect.any(Object)
);
});
test('retries on failure then saves to localStorage', async () => {
(api.post as jest.Mock)
.mockRejectedValueOnce(new Error('Network error'))
.mockRejectedValueOnce(new Error('Network error'))
.mockRejectedValueOnce(new Error('Network error'));
await clientErrorLogger.logError({
subsystem: 'storage',
error_type: 'api_error',
message: 'Test error',
});
expect(api.post).toHaveBeenCalledTimes(3);
// Should be saved to localStorage
const queue = localStorage.getItem('redflag-error-queue');
expect(queue).toBeTruthy();
expect(JSON.parse(queue!).length).toBe(1);
});
test('flushes queue when coming back online', async () => {
// Pre-populate queue
const queuedError = {
subsystem: 'storage',
error_type: 'api_error',
message: 'Queued error',
timestamp: new Date().toISOString(),
};
localStorage.setItem('redflag-error-queue', JSON.stringify([queuedError]));
(api.post as jest.Mock).mockResolvedValue({});
// Trigger online event
window.dispatchEvent(new Event('online'));
// Wait for flush
await new Promise(resolve => setTimeout(resolve, 100));
expect(api.post).toHaveBeenCalled();
expect(localStorage.getItem('redflag-error-queue')).toBe('[]');
});
});
Time Required: 30 minutes
Implementation Checklist
Pre-Implementation
- ✅ Migration system bug fixed (lines 103 & 116 in db.go)
- ✅ Database wiped and fresh instance ready
- ✅ Test agents available for rapid scan testing
- ✅ Development environment ready (all 3 components)
Phase 1: Command Factory (25 min)
- Create
aggregator-server/internal/command/factory.go - Create
aggregator-server/internal/command/validator.go - Update
aggregator-server/internal/models/command.go - Update
aggregator-server/internal/api/handlers/subsystems.go - Test: Verify rapid scan clicks work
Phase 2: Database Schema (5 min)
- Create migration
023_client_error_logging.up.sql - Create migration
023_client_error_logging.down.sql - Run migration and verify table creation
- Verify indexes created
Phase 3: Backend Handler (20 min)
- Create
aggregator-server/internal/api/handlers/client_errors.go - Create
aggregator-server/internal/database/queries/client_errors.sql - Update
aggregator-server/internal/api/router.go - Test API endpoint with curl
Phase 4: Frontend Logger (20 min)
- Create
aggregator-web/src/lib/client-error-logger.ts - Create
aggregator-web/src/lib/toast-with-logging.ts - Update
aggregator-web/src/lib/api.ts - Test offline/online queue behavior
Phase 5: Toast Integration (15 min)
- Create
useScanStatehook for button state management - Update scan buttons to use
useScanState - Test button disabling during scan
- Update 3-5 critical error locations to use
toastWithLogging - Verify errors appear in both toast and database
- Test in multiple subsystems
- Test deduplication: Rapid clicking creates only 1 command
- Test 409 response: Returns existing command when scan running
Phase 6: Verification (30 min)
- Run all test cases
- Verify ETHOS compliance checklist
- Test rapid scan clicking (no duplicates)
- Test error persistence across page reloads
- Verify [HISTORY] logs in server output
Documentation
- Update session documentation
- Create testing summary
- Document any issues encountered
- Update architecture documentation
Risk Mitigation
Risk 1: Migration Failures
Probability: Medium | Impact: High | Severity: 🔴 Critical
Mitigation:
- Fix migration runner bug FIRST (before this implementation)
- Test migration on fresh database
- Keep database backups
- Have rollback script ready
Contingency: If migration fails, manually apply SQL and continue
Risk 2: Performance Impact
Probability: Low | Impact: Medium | Severity: 🟡 Medium
Mitigation:
- Async error logging (non-blocking)
- LocalStorage queue with size limit (max 50 errors)
- Database indexes for fast queries
- Batch insert if needed in future
Contingency: If performance degrades, add sampling (log 1 in 10 errors)
Risk 3: Infinite Error Loops
Probability: Low | Impact: High | Severity: 🟡 Medium
Mitigation:
X-Error-Logger-Requestheader prevents recursive logging- Max retry count (3 attempts)
- Exponential backoff prevents thundering herd
Contingency: If loop detected, check for missing header and fix
Risk 4: Privacy Concerns
Probability: Low | Impact: High | Severity: 🟡 Medium
Mitigation:
- No PII in error messages (validate during logging)
- User agent stored but can be anonymized
- Stack traces only from our code (not user code)
Contingency: Add privacy filter to scrub sensitive data
Post-Implementation Review
Success Criteria
- No duplicate key violations during rapid clicking
- All errors persist in database
- Error logs queryable and useful for debugging
- No performance degradation observed
- System handles offline/online transitions gracefully
- All tests pass
Performance Benchmarks
- Command creation: < 10ms per command
- Error logging: < 50ms per error (async)
- Database queries: < 100ms for common queries
- Bundle size increase: < 5KB gzipped
Known Limitations
- Error logs don't include full request payloads (privacy)
- localStorage queue limited by browser storage (~5MB)
- Retries happen in foreground (could be moved to background)
Future Enhancements (Post v0.1.27)
- Error aggregation and deduplication
- Error rate alerting
- Error analytics dashboard
- Automatic error categorization
- Integration with notification system
Rollback Plan
If critical issues arise:
-
Revert Code Changes:
git revert HEAD~6..HEAD # Revert last 6 commits -
Rollback Database:
cd aggregator-server # Run down migration go run cmd/migrate/main.go -migrate-down 1 -
Rebuild and Deploy:
docker-compose build --no-cache docker-compose up -d
Additional Notes
Team Coordination:
- Coordinate with frontend team if they're working on error handling
- Notify QA about new error logging features for testing
- Update documentation team about database schema changes
Monitoring:
- Monitor
client_errorstable growth - Set up alerts for error rate spikes
- Track failed error logging attempts
Documentation Updates:
- Update API documentation for
/logs/client-errorendpoint - Document error log query patterns for support team
- Add troubleshooting guide for common errors
Plan Created By: Ani (AI Assistant) Reviewed By: Casey Tunturi Status: 🟢 APPROVED FOR IMPLEMENTATION Next Step: Begin Phase 1 (Command Factory)
Estimated Timeline:
- Start: Immediately
- Complete: ~2-3 hours
- Test: 30 minutes
- Deploy: After verification
This is a complete, production-ready implementation plan. Each phase builds on the previous one, with full error handling, testing, and rollback procedures included.
Let's build this right. 💪