Files
Redflag/IMPLEMENTATION_PLAN_CLEAN_ARCHITECTURE.md

1795 lines
53 KiB
Markdown

# RedFlag Clean Architecture Implementation Master Plan
**Date**: 2025-12-19
**Version**: v1.0
**Total Implementation Time**: 3-4 hours (including migration fixes and command deduplication)
**Status**: READY FOR EXECUTION
---
## Executive Summary
Complete implementation plan for fixing critical ETHOS violations and implementing clean architecture patterns across RedFlag v0.1.27. Addresses duplicate command generation, lost frontend errors, and migration system bugs.
**Three Core Objectives:**
1. ✅ Fix migration system (blocks everything else)
2. ✅ Implement command factory pattern (prevents duplicate key violations)
3. ✅ Build frontend error logging system (ETHOS #1 compliance)
---
## Table of Contents
1. [Pre-Implementation: Migration System Fix](#pre-implementation-migration-system-fix)
2. [Phase 1: Command Factory Pattern](#phase-1-command-factory-pattern)
3. [Phase 2: Database Schema](#phase-2-database-schema)
4. [Phase 3: Backend Error Handler](#phase-3-backend-error-handler)
5. [Phase 4: Frontend Error Logger](#phase-4-frontend-error-logger)
6. [Phase 5: Toast Integration](#phase-5-toast-integration)
7. [Phase 6: Verification & Testing](#phase-6-verification-and-testing)
8. [Implementation Checklist](#implementation-checklist)
9. [Risk Mitigation](#risk-mitigation)
10. [Post-Implementation Review](#post-implementation-review)
---
## Pre-Implementation: Migration System Fix
**⚠️ CRITICAL: Must be completed first - blocks all other work**
### Problem
Migration runner has duplicate INSERT logic causing "duplicate key value violates unique constraint" errors on fresh installations.
### Root Cause
File: `aggregator-server/internal/database/db.go`
- Line 103: Executes `INSERT INTO schema_migrations (version) VALUES ($1)`
- Line 116: Executes the exact same INSERT statement
- Result: Every migration filename gets inserted twice
### Solution
```go
// File: aggregator-server/internal/database/db.go
// Lines 95-120: Fix duplicate INSERT logic
func (db *DB) Migrate() error {
// ... existing code ...
for _, file := range files {
filename := file.Name()
// ❌ REMOVE THIS - Line 103 duplicates line 116
// if _, err = tx.Exec("INSERT INTO schema_migrations (version) VALUES ($1)", filename); err != nil {
// return fmt.Errorf("failed to mark migration %s as applied: %w", filename, err)
// }
// Keep only the EXECUTE + INSERT combo at lines 110-116
if _, err = tx.Exec(string(content)); err != nil {
log.Printf("Migration %s failed, marking as applied: %v", filename, err)
}
// ✅ Keep this INSERT - it's the correct location
if _, err = tx.Exec("INSERT INTO schema_migrations (version) VALUES ($1)", filename); err != nil {
return fmt.Errorf("failed to mark migration %s as applied: %w", filename, err)
}
}
// ... rest of function ...
}
```
### Validation Steps
1. Wipe database completely: `docker-compose down -v`
2. Start fresh: `docker-compose up -d`
3. Check migration logs: Should see all migrations apply without duplicate key errors
4. Verify: `SELECT COUNT(DISTINCT version) = COUNT(version) FROM schema_migrations`
### Time Required: 5 minutes
**Blocker Status**: 🔴 CRITICAL - Do not proceed until fixed
---
## Phase 1: Command Factory Pattern
### Objective
Prevent duplicate command key violations by ensuring all commands have properly generated UUIDs at creation time.
### Files to Create
#### 1.1 Command Factory
**File**: `aggregator-server/internal/command/factory.go`
```go
package command
import (
"errors"
"fmt"
"time"
"github.com/google/uuid"
"github.com/Fimeg/RedFlag/aggregator-server/internal/models"
)
// Factory creates validated AgentCommand instances
type Factory struct {
validator *Validator
}
// NewFactory creates a new command factory
func NewFactory() *Factory {
return &Factory{
validator: NewValidator(),
}
}
// Create generates a new validated AgentCommand with unique ID
func (f *Factory) Create(agentID uuid.UUID, commandType string, params map[string]interface{}) (*models.AgentCommand, error) {
cmd := &models.AgentCommand{
ID: uuid.New(), // Immediate, explicit generation
AgentID: agentID,
CommandType: commandType,
Status: "pending",
Source: determineSource(commandType),
Params: params,
CreatedAt: time.Now(),
UpdatedAt: time.Now(),
}
if err := f.validator.Validate(cmd); err != nil {
return nil, fmt.Errorf("command validation failed: %w", err)
}
return cmd, nil
}
// CreateWithIdempotency generates a command with idempotency protection
func (f *Factory) CreateWithIdempotency(agentID uuid.UUID, commandType string,
params map[string]interface{}, idempotencyKey string) (*models.AgentCommand, error) {
// Check for existing command with same idempotency key
// (Implementation depends on database query layer)
existing, err := f.findByIdempotencyKey(agentID, idempotencyKey)
if err != nil && err != sql.ErrNoRows {
return nil, fmt.Errorf("failed to check idempotency: %w", err)
}
if existing != nil {
return existing, nil // Return existing command instead of creating duplicate
}
cmd, err := f.Create(agentID, commandType, params)
if err != nil {
return nil, err
}
// Store idempotency key with command
if err := f.storeIdempotencyKey(cmd.ID, agentID, idempotencyKey); err != nil {
return nil, fmt.Errorf("failed to store idempotency key: %w", err)
}
return cmd, nil
}
// determineSource classifies command source based on type
func determineSource(commandType string) string {
if isSystemCommand(commandType) {
return "system"
}
return "manual"
}
func isSystemCommand(commandType string) bool {
systemCommands := []string{
"enable_heartbeat",
"disable_heartbeat",
"update_check",
"cleanup_old_logs",
}
for _, cmd := range systemCommands {
if commandType == cmd {
return true
}
}
return false
}
```
#### 1.2 Command Validator
**File**: `aggregator-server/internal/command/validator.go`
```go
package command
import (
"errors"
"fmt"
"github.com/google/uuid"
"github.com/Fimeg/RedFlag/aggregator-server/internal/models"
)
// Validator validates command parameters
type Validator struct {
minCheckInSeconds int
maxCheckInSeconds int
minScannerMinutes int
maxScannerMinutes int
}
// NewValidator creates a new command validator
func NewValidator() *Validator {
return &Validator{
minCheckInSeconds: 60, // 1 minute minimum
maxCheckInSeconds: 3600, // 1 hour maximum
minScannerMinutes: 1, // 1 minute minimum
maxScannerMinutes: 1440, // 24 hours maximum
}
}
// Validate performs comprehensive command validation
func (v *Validator) Validate(cmd *models.AgentCommand) error {
if cmd == nil {
return errors.New("command cannot be nil")
}
if cmd.ID == uuid.Nil {
return errors.New("command ID cannot be zero UUID")
}
if cmd.AgentID == uuid.Nil {
return errors.New("agent ID is required")
}
if cmd.CommandType == "" {
return errors.New("command type is required")
}
if cmd.Status == "" {
return errors.New("status is required")
}
validStatuses := []string{"pending", "running", "completed", "failed", "cancelled"}
if !contains(validStatuses, cmd.Status) {
return fmt.Errorf("invalid status: %s", cmd.Status)
}
if cmd.Source != "manual" && cmd.Source != "system" {
return fmt.Errorf("source must be 'manual' or 'system', got: %s", cmd.Source)
}
// Validate command type format
if err := v.validateCommandType(cmd.CommandType); err != nil {
return err
}
return nil
}
// ValidateSubsystemAction validates subsystem-specific actions
func (v *Validator) ValidateSubsystemAction(subsystem string, action string) error {
validActions := map[string][]string{
"storage": {"trigger", "enable", "disable", "set_interval"},
"system": {"trigger", "enable", "disable", "set_interval"},
"docker": {"trigger", "enable", "disable", "set_interval"},
"updates": {"trigger", "enable", "disable", "set_interval"},
}
actions, ok := validActions[subsystem]
if !ok {
return fmt.Errorf("unknown subsystem: %s", subsystem)
}
if !contains(actions, action) {
return fmt.Errorf("invalid action '%s' for subsystem '%s'", action, subsystem)
}
return nil
}
// ValidateInterval ensures scanner intervals are within bounds
func (v *Validator) ValidateInterval(subsystem string, minutes int) error {
if minutes < v.minScannerMinutes {
return fmt.Errorf("interval %d minutes below minimum %d for subsystem %s",
minutes, v.minScannerMinutes, subsystem)
}
if minutes > v.maxScannerMinutes {
return fmt.Errorf("interval %d minutes above maximum %d for subsystem %s",
minutes, v.maxScannerMinutes, subsystem)
}
return nil
}
func (v *Validator) validateCommandType(commandType string) error {
validPrefixes := []string{"scan_", "install_", "update_", "enable_", "disable_", "reboot"}
for _, prefix := range validPrefixes {
if len(commandType) >= len(prefix) && commandType[:len(prefix)] == prefix {
return nil
}
}
return fmt.Errorf("invalid command type format: %s", commandType)
}
func contains(slice []string, item string) bool {
for _, s := range slice {
if s == item {
return true
}
}
return false
}
```
#### 1.3 Update AgentCommand Model
**File**: `aggregator-server/internal/models/command.go`
```go
package models
import (
"database/sql"
"time"
"github.com/google/uuid"
"github.com/lib/pq"
)
// AgentCommand represents a command sent to an agent
type AgentCommand struct {
ID uuid.UUID `db:"id" json:"id"`
AgentID uuid.UUID `db:"agent_id" json:"agent_id"`
CommandType string `db:"command_type" json:"command_type"`
Status string `db:"status" json:"status"`
Source string `db:"source" json:"source"`
Params pq.ByteaArray `db:"params" json:"params"`
Result sql.NullString `db:"result" json:"result,omitempty"`
Error sql.NullString `db:"error" json:"error,omitempty"`
RetryCount int `db:"retry_count" json:"retry_count"`
CreatedAt time.Time `db:"created_at" json:"created_at"`
UpdatedAt time.Time `db:"updated_at" json:"updated_at"`
CompletedAt pq.NullTime `db:"completed_at" json:"completed_at,omitempty"`
// Idempotency support
IdempotencyKey uuid.NullUUID `db:"idempotency_key" json:"-"`
}
// Validate checks if the command is valid
func (c *AgentCommand) Validate() error {
if c.ID == uuid.Nil {
return ErrCommandIDRequired
}
if c.AgentID == uuid.Nil {
return ErrAgentIDRequired
}
if c.CommandType == "" {
return ErrCommandTypeRequired
}
if c.Status == "" {
return ErrStatusRequired
}
if c.Source != "manual" && c.Source != "system" {
return ErrInvalidSource
}
return nil
}
// IsTerminal returns true if the command is in a terminal state
func (c *AgentCommand) IsTerminal() bool {
return c.Status == "completed" || c.Status == "failed" || c.Status == "cancelled"
}
// CanRetry returns true if the command can be retried
func (c *AgentCommand) CanRetry() bool {
return c.Status == "failed" && c.RetryCount < 3
}
// Predefined errors for validation
var (
ErrCommandIDRequired = errors.New("command ID cannot be zero UUID")
ErrAgentIDRequired = errors.New("agent ID is required")
ErrCommandTypeRequired = errors.New("command type is required")
ErrStatusRequired = errors.New("status is required")
ErrInvalidSource = errors.New("source must be 'manual' or 'system'")
)
```
#### 1.4 Update Subsystem Handler
**File**: `aggregator-server/internal/api/handlers/subsystems.go`
```go
package handlers
import (
"log"
"net/http"
"github.com/gin-gonic/gin"
"github.com/google/uuid"
"github.com/jmoiron/sqlx"
"github.com/Fimeg/RedFlag/aggregator-server/internal/command"
"github.com/Fimeg/RedFlag/aggregator-server/internal/models"
)
type SubsystemHandler struct {
db *sqlx.DB
commandFactory *command.Factory
}
func NewSubsystemHandler(db *sqlx.DB) *SubsystemHandler {
return &SubsystemHandler{
db: db,
commandFactory: command.NewFactory(),
}
}
// TriggerSubsystem creates and enqueues a subsystem command
func (h *SubsystemHandler) TriggerSubsystem(c *gin.Context) {
agentID, err := uuid.Parse(c.Param("id"))
if err != nil {
log.Printf("[ERROR] [server] [subsystem] invalid_agent_id error=%v", err)
c.JSON(http.StatusBadRequest, gin.H{"error": "invalid agent ID"})
return
}
subsystem := c.Param("subsystem")
if err := h.validateSubsystem(subsystem); err != nil {
log.Printf("[ERROR] [server] [subsystem] invalid_subsystem subsystem=%s error=%v", subsystem, err)
c.JSON(http.StatusBadRequest, gin.H{"error": err.Error()})
return
}
// DEDUPLICATION CHECK: Prevent multiple pending scans
existingCmd, err := h.getPendingScanCommand(agentID, subsystem)
if err != nil {
log.Printf("[ERROR] [server] [subsystem] query_failed agent_id=%s subsystem=%s error=%v",
agentID, subsystem, err)
c.JSON(http.StatusInternalServerError, gin.H{"error": "internal error"})
return
}
if existingCmd != nil {
log.Printf("[INFO] [server] [subsystem] scan_already_pending agent_id=%s subsystem=%s command_id=%s",
agentID, subsystem, existingCmd.ID)
log.Printf("[HISTORY] [server] [scan_%s] duplicate_request_prevented agent_id=%s command_id=%s timestamp=%s",
subsystem, agentID, existingCmd.ID, time.Now().Format(time.RFC3339))
c.JSON(http.StatusConflict, gin.H{
"error": "Scan already in progress",
"command_id": existingCmd.ID.String(),
"subsystem": subsystem,
"status": existingCmd.Status,
"created_at": existingCmd.CreatedAt,
})
return
}
// Generate idempotency key from request context
idempotencyKey := h.generateIdempotencyKey(c, agentID, subsystem)
// Create command using factory
cmd, err := h.commandFactory.CreateWithIdempotency(
agentID,
"scan_"+subsystem,
map[string]interface{}{"subsystem": subsystem},
idempotencyKey,
)
if err != nil {
log.Printf("[ERROR] [server] [subsystem] command_creation_failed agent_id=%s subsystem=%s error=%v",
agentID, subsystem, err)
c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to create command"})
return
}
// Store command in database
if err := h.storeCommand(cmd); err != nil {
log.Printf("[ERROR] [server] [subsystem] command_store_failed agent_id=%s command_id=%s error=%v",
agentID, cmd.ID, err)
c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to store command"})
return
}
log.Printf("[INFO] [server] [subsystem] command_created agent_id=%s command_id=%s subsystem=%s",
agentID, cmd.ID, subsystem)
log.Printf("[HISTORY] [server] [scan_%s] command_created agent_id=%s command_id=%s source=manual timestamp=%s",
subsystem, agentID, cmd.ID, time.Now().Format(time.RFC3339))
c.JSON(http.StatusOK, gin.H{
"message": "Command created successfully",
"command_id": cmd.ID.String(),
"subsystem": subsystem,
})
}
// getPendingScanCommand checks for existing pending scan commands
func (h *SubsystemHandler) getPendingScanCommand(agentID uuid.UUID, subsystem string) (*models.AgentCommand, error) {
var cmd models.AgentCommand
query := `
SELECT id, command_type, status, created_at
FROM agent_commands
WHERE agent_id = $1
AND command_type = $2
AND status = 'pending'
LIMIT 1`
commandType := "scan_" + subsystem
err := h.db.Get(&cmd, query, agentID, commandType)
if err != nil {
if err == sql.ErrNoRows {
return nil, nil // No pending command found
}
return nil, fmt.Errorf("query failed: %w", err)
}
return &cmd, nil
}
// validateSubsystem checks if subsystem is recognized
func (h *SubsystemHandler) validateSubsystem(subsystem string) error {
validSubsystems := []string{"apt", "dnf", "windows", "winget", "storage", "system", "docker"}
for _, valid := range validSubsystems {
if subsystem == valid {
return nil
}
}
return fmt.Errorf("unknown subsystem: %s", subsystem)
}
// generateIdempotencyKey creates a key to prevent duplicate submissions
func (h *SubsystemHandler) generateIdempotencyKey(c *gin.Context, agentID uuid.UUID, subsystem string) string {
// Use timestamp rounded to nearest minute for idempotency window
// This allows retries within same minute but prevents duplicates across minutes
timestampWindow := time.Now().Unix() / 60 // Round to minute
return fmt.Sprintf("%s:%s:%d", agentID.String(), subsystem, timestampWindow)
}
// storeCommand persists command to database
func (h *SubsystemHandler) storeCommand(cmd *models.AgentCommand) error {
// Implementation depends on your command storage layer
// Use NamedExec or similar to insert command
query := `
INSERT INTO agent_commands
(id, agent_id, command_type, status, source, params, created_at)
VALUES (:id, :agent_id, :command_type, :status, :source, :params, NOW())`
_, err := h.db.NamedExec(query, cmd)
return err
}
```
### Time Required: 30 minutes
---
## Phase 2: Database Schema
### Migration 023a: Command Deduplication
**File**: `aggregator-server/internal/database/migrations/023a_command_deduplication.up.sql`
```sql
-- Command Deduplication Schema
-- Prevents multiple pending scan commands per subsystem per agent
-- Add unique constraint to enforce single pending command per subsystem
CREATE UNIQUE INDEX idx_agent_pending_subsystem
ON agent_commands(agent_id, command_type, status)
WHERE status = 'pending';
-- Add idempotency key support for retry scenarios
ALTER TABLE agent_commands ADD COLUMN idempotency_key VARCHAR(64) UNIQUE NULL;
CREATE INDEX idx_agent_commands_idempotency_key ON agent_commands(idempotency_key);
COMMENT ON COLUMN agent_commands.idempotency_key IS
'Prevents duplicate command creation from retry logic. Based on (agent_id + subsystem + timestamp window).';
```
**File**: `aggregator-server/internal/database/migrations/023a_command_deduplication.down.sql`
```sql
DROP INDEX IF EXISTS idx_agent_pending_subsystem;
ALTER TABLE agent_commands DROP COLUMN IF EXISTS idempotency_key;
```
### Migration 023: Client Error Logging Table
**File**: `aggregator-server/internal/database/migrations/023_client_error_logging.up.sql`
```sql
-- Client Error Logging Schema
-- Implements ETHOS #1: Errors are History, Not /dev/null
CREATE TABLE client_errors (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
agent_id UUID REFERENCES agents(id) ON DELETE SET NULL,
subsystem VARCHAR(50) NOT NULL,
error_type VARCHAR(50) NOT NULL,
message TEXT NOT NULL,
stack_trace TEXT,
metadata JSONB,
url TEXT NOT NULL,
user_agent TEXT,
created_at TIMESTAMP DEFAULT NOW()
);
-- Indexes for efficient querying
CREATE INDEX idx_client_errors_agent_time ON client_errors(agent_id, created_at DESC);
CREATE INDEX idx_client_errors_subsystem_time ON client_errors(subsystem, created_at DESC);
CREATE INDEX idx_client_errors_error_type_time ON client_errors(error_type, created_at DESC);
CREATE INDEX idx_client_errors_created_at ON client_errors(created_at DESC);
-- Comments for documentation
COMMENT ON TABLE client_errors IS 'Frontend error logs for debugging and auditing. Implements ETHOS #1.';
COMMENT ON COLUMN client_errors.agent_id IS 'Agent active when error occurred (NULL for pre-auth errors)';
COMMENT ON COLUMN client_errors.subsystem IS 'RedFlag subsystem being used (storage, system, docker, etc.)';
COMMENT ON COLUMN client_errors.error_type IS 'Error category: javascript_error, api_error, ui_error, validation_error';
COMMENT ON COLUMN client_errors.metadata IS 'Additional context (component, API response, user actions)';
-- Add idempotency support to agent_commands
ALTER TABLE agent_commands ADD COLUMN idempotency_key UUID UNIQUE NULL;
CREATE INDEX idx_agent_commands_idempotency_key ON agent_commands(idempotency_key);
COMMENT ON COLUMN agent_commands.idempotency_key IS 'Prevents duplicate command creation from retry logic';
```
**File**: `aggregator-server/internal/database/migrations/023_client_error_logging.down.sql`
```sql
DROP TABLE IF EXISTS client_errors;
ALTER TABLE agent_commands DROP COLUMN IF EXISTS idempotency_key;
```
### Time Required: 5 minutes
---
## Phase 3: Backend Error Handler
### Files to Create
#### 3.1 Error Handler
**File**: `aggregator-server/internal/api/handlers/client_errors.go`
```go
package handlers
import (
"database/sql"
"fmt"
"log"
"net/http"
"time"
"github.com/gin-gonic/gin"
"github.com/google/uuid"
"github.com/jmoiron/sqlx"
)
// ClientErrorHandler handles frontend error logging per ETHOS #1
type ClientErrorHandler struct {
db *sqlx.DB
}
// NewClientErrorHandler creates a new error handler
func NewClientErrorHandler(db *sqlx.DB) *ClientErrorHandler {
return &ClientErrorHandler{db: db}
}
// LogErrorRequest represents a client error log entry
type LogErrorRequest struct {
Subsystem string `json:"subsystem" binding:"required"`
ErrorType string `json:"error_type" binding:"required,oneof=javascript_error api_error ui_error validation_error"`
Message string `json:"message" binding:"required,max=10000"`
StackTrace string `json:"stack_trace,omitempty"`
Metadata map[string]interface{} `json:"metadata,omitempty"`
URL string `json:"url" binding:"required"`
}
// LogError processes and stores frontend errors
func (h *ClientErrorHandler) LogError(c *gin.Context) {
var req LogErrorRequest
if err := c.ShouldBindJSON(&req); err != nil {
log.Printf("[ERROR] [server] [client_error] validation_failed error=\"%v\"", err)
c.JSON(http.StatusBadRequest, gin.H{"error": "invalid request data"})
return
}
// Extract agent ID from auth middleware if available
var agentID interface{}
if agentIDValue, exists := c.Get("agentID"); exists {
if id, ok := agentIDValue.(uuid.UUID); ok {
agentID = id
}
}
// Log to console with HISTORY prefix
log.Printf("[ERROR] [server] [client] [%s] agent_id=%v subsystem=%s message=\"%s\"",
req.ErrorType, agentID, req.Subsystem, truncate(req.Message, 200))
log.Printf("[HISTORY] [server] [client_error] agent_id=%v subsystem=%s type=%s url=\"%s\" message=\"%s\" timestamp=%s",
agentID, req.Subsystem, req.ErrorType, req.URL, req.Message, time.Now().Format(time.RFC3339))
// Store in database with retry logic
if err := h.storeError(agentID, req); err != nil {
log.Printf("[ERROR] [server] [client_error] store_failed error=\"%v\"", err)
c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to store error"})
return
}
c.JSON(http.StatusOK, gin.H{"logged": true})
}
// storeError persists error to database with retry
func (h *ClientErrorHandler) storeError(agentID interface{}, req LogErrorRequest) error {
const maxRetries = 3
var lastErr error
for attempt := 1; attempt <= maxRetries; attempt++ {
query := `INSERT INTO client_errors (agent_id, subsystem, error_type, message, stack_trace, metadata, url, user_agent)
VALUES (:agent_id, :subsystem, :error_type, :message, :stack_trace, :metadata, :url, :user_agent)`
_, err := h.db.NamedExec(query, map[string]interface{}{
"agent_id": agentID,
"subsystem": req.Subsystem,
"error_type": req.ErrorType,
"message": req.Message,
"stack_trace": req.StackTrace,
"metadata": req.Metadata,
"url": req.URL,
"user_agent": c.GetHeader("User-Agent"),
})
if err == nil {
return nil
}
lastErr = err
if attempt < maxRetries {
time.Sleep(time.Duration(attempt) * time.Second)
continue
}
}
return fmt.Errorf("failed after %d attempts: %w", maxRetries, lastErr)
}
func truncate(s string, maxLen int) string {
if len(s) <= maxLen {
return s
}
return s[:maxLen] + "..."
}
func hash(s string) string {
// Simple hash for message deduplication detection
h := sha256.Sum256([]byte(s))
return fmt.Sprintf("%x", h)[:16]
}
```
#### 3.2 Query Client Errors
**File**: `aggregator-server/internal/database/queries/client_errors.sql`
```sql
-- name: GetClientErrorsByAgent :many
SELECT * FROM client_errors
WHERE agent_id = $1
ORDER BY created_at DESC
LIMIT $2;
-- name: GetClientErrorsBySubsystem :many
SELECT * FROM client_errors
WHERE subsystem = $1
ORDER BY created_at DESC
LIMIT $2;
-- name: GetClientErrorStats :one
SELECT
subsystem,
error_type,
COUNT(*) as count,
MIN(created_at) as first_occurrence,
MAX(created_at) as last_occurrence
FROM client_errors
WHERE created_at > NOW() - INTERVAL '24 hours'
GROUP BY subsystem, error_type
ORDER BY count DESC;
```
#### 3.3 Update Router
**File**: `aggregator-server/internal/api/router.go`
```go
// Add to router setup function
func SetupRouter(db *sqlx.DB, cfg *config.Config) *gin.Engine {
// ... existing setup ...
// Error logging endpoint (authenticated)
errorHandler := handlers.NewClientErrorHandler(db)
apiV1.POST("/logs/client-error",
middleware.AuthMiddleware(),
errorHandler.LogError,
)
// Admin endpoint for viewing errors
apiV1.GET("/logs/client-errors",
middleware.AuthMiddleware(),
middleware.AdminMiddleware(),
errorHandler.GetErrors,
)
// ... rest of setup ...
}
```
### Time Required: 20 minutes
---
## Phase 4: Frontend Error Logger
### Files to Create
#### 4.1 Client Error Logger
**File**: `aggregator-web/src/lib/client-error-logger.ts`
```typescript
import { api, ApiError } from './api';
export interface ClientErrorLog {
subsystem: string;
error_type: 'javascript_error' | 'api_error' | 'ui_error' | 'validation_error';
message: string;
stack_trace?: string;
metadata?: Record<string, any>;
url: string;
timestamp: string;
}
/**
* ClientErrorLogger provides reliable frontend error logging with retry logic
* Implements ETHOS #3: Assume Failure; Build for Resilience
*/
export class ClientErrorLogger {
private maxRetries = 3;
private baseDelayMs = 1000;
private localStorageKey = 'redflag-error-queue';
private offlineBuffer: ClientErrorLog[] = [];
private isOnline = navigator.onLine;
constructor() {
// Listen for online/offline events
window.addEventListener('online', () => this.flushOfflineBuffer());
window.addEventListener('offline', () => { this.isOnline = false; });
}
/**
* Log an error with automatic retry and offline queuing
*/
async logError(errorData: Omit<ClientErrorLog, 'url' | 'timestamp'>): Promise<void> {
const fullError: ClientErrorLog = {
...errorData,
url: window.location.href,
timestamp: new Date().toISOString(),
};
// Try to send immediately
try {
await this.sendWithRetry(fullError);
return;
} catch (error) {
// If failed after retries, queue for later
this.queueForRetry(fullError);
}
}
/**
* Send error to backend with exponential backoff retry
*/
private async sendWithRetry(error: ClientErrorLog): Promise<void> {
for (let attempt = 1; attempt <= this.maxRetries; attempt++) {
try {
await api.post('/logs/client-error', error, {
headers: { 'X-Error-Logger-Request': 'true' },
});
// Success, remove from queue if it was there
this.removeFromQueue(error);
return;
} catch (error) {
if (attempt === this.maxRetries) {
throw error; // Rethrow after final attempt
}
// Exponential backoff
await this.sleep(this.baseDelayMs * Math.pow(2, attempt - 1));
}
}
}
/**
* Queue error for retry when network is available
*/
private queueForRetry(error: ClientErrorLog): void {
try {
const queue = this.getQueue();
queue.push({
...error,
retryCount: (error as any).retryCount || 0,
queuedAt: new Date().toISOString(),
});
// Save to localStorage for persistence
localStorage.setItem(this.localStorageKey, JSON.stringify(queue));
// Also keep in memory buffer
this.offlineBuffer.push(error);
} catch (storageError) {
// localStorage might be full or unavailable
console.warn('Failed to queue error for retry:', storageError);
}
}
/**
* Flush offline buffer when coming back online
*/
private async flushOfflineBuffer(): Promise<void> {
if (!this.isOnline) return;
const queue = this.getQueue();
if (queue.length === 0) return;
const failed: typeof queue = [];
for (const queuedError of queue) {
try {
await this.sendWithRetry(queuedError);
} catch (error) {
failed.push(queuedError);
}
}
// Update queue with remaining failed items
if (failed.length < queue.length) {
localStorage.setItem(this.localStorageKey, JSON.stringify(failed));
}
}
/**
* Get current error queue from localStorage
*/
private getQueue(): any[] {
try {
const stored = localStorage.getItem(this.localStorageKey);
return stored ? JSON.parse(stored) : [];
} catch {
return [];
}
}
/**
* Remove successfully sent error from queue
*/
private removeFromQueue(sentError: ClientErrorLog): void {
try {
const queue = this.getQueue();
const filtered = queue.filter(queued =>
queued.timestamp !== sentError.timestamp ||
queued.message !== sentError.message
);
if (filtered.length < queue.length) {
localStorage.setItem(this.localStorageKey, JSON.stringify(filtered));
}
} catch {
// Best effort cleanup
}
}
/**
* Capture unhandled JavaScript errors
*/
captureUnhandledErrors(): void {
// Global error handler
window.addEventListener('error', (event) => {
this.logError({
subsystem: 'global',
error_type: 'javascript_error',
message: event.message,
stack_trace: event.error?.stack,
metadata: {
filename: event.filename,
lineno: event.lineno,
colno: event.colno,
},
}).catch(() => {
// Silently ignore logging failures
});
});
// Unhandled promise rejections
window.addEventListener('unhandledrejection', (event) => {
this.logError({
subsystem: 'global',
error_type: 'javascript_error',
message: event.reason?.message || String(event.reason),
stack_trace: event.reason?.stack,
}).catch(() => {});
});
}
private sleep(ms: number): Promise<void> {
return new Promise(resolve => setTimeout(resolve, ms));
}
}
// Singleton instance
export const clientErrorLogger = new ClientErrorLogger();
// Auto-capture unhandled errors
if (typeof window !== 'undefined') {
clientErrorLogger.captureUnhandledErrors();
}
```
#### 4.2 Toast Wrapper with Logging
**File**: `aggregator-web/src/lib/toast-with-logging.ts`
```typescript
import toast, { ToastOptions } from 'react-hot-toast';
import { clientErrorLogger } from './client-error-logger';
import { useLocation } from 'react-router-dom';
/**
* Extract subsystem from current route
*/
function getCurrentSubsystem(): string {
if (typeof window === 'undefined') return 'unknown';
const path = window.location.pathname;
// Map routes to subsystems
if (path.includes('/storage')) return 'storage';
if (path.includes('/system')) return 'system';
if (path.includes('/docker')) return 'docker';
if (path.includes('/updates')) return 'updates';
if (path.includes('/agent/')) return 'agent';
return 'unknown';
}
/**
* Wrap toast.error to automatically log errors to backend
* Implements ETHOS #1: Errors are History
*/
export const toastWithLogging = {
error: (message: string, options?: ToastOptions & { subsystem?: string }) => {
const subsystem = options?.subsystem || getCurrentSubsystem();
// Log to backend asynchronously - don't block UI
clientErrorLogger.logError({
subsystem,
error_type: 'ui_error',
message: message.substring(0, 5000), // Prevent excessively long messages
metadata: {
component: options?.id,
duration: options?.duration,
position: options?.position,
timestamp: new Date().toISOString(),
},
}).catch(() => {
// Silently ignore logging failures - don't crash the UI
});
// Show toast to user
return toast.error(message, options);
},
// Passthrough methods
success: toast.success,
info: toast.info,
warning: toast.warning,
loading: toast.loading,
dismiss: toast.dismiss,
remove: toast.remove,
promise: toast.promise,
};
/**
* React hook for toast with automatic subsystem detection
*/
export function useToastWithLogging() {
const location = useLocation();
return {
error: (message: string, options?: ToastOptions) => {
return toastWithLogging.error(message, {
...options,
subsystem: getSubsystemFromPath(location.pathname),
});
},
success: toast.success,
info: toast.info,
warning: toast.warning,
loading: toast.loading,
dismiss: toast.dismiss,
};
}
function getSubsystemFromPath(pathname: string): string {
const matches = pathname.match(/\/(storage|system|docker|updates|agent)/);
return matches ? matches[1] : 'unknown';
}
```
#### 4.3 API Integration
**Update**: `aggregator-web/src/lib/api.ts`
```typescript
// Add error logging to axios interceptor
api.interceptors.response.use(
(response) => response,
async (error) => {
// Don't log errors from the error logger itself
if (error.config?.headers?.['X-Error-Logger-Request']) {
return Promise.reject(error);
}
// Extract subsystem from URL
const subsystem = extractSubsystem(error.config?.url);
// Log API errors
clientErrorLogger.logError({
subsystem,
error_type: 'api_error',
message: error.message,
metadata: {
status_code: error.response?.status,
endpoint: error.config?.url,
method: error.config?.method,
response_data: error.response?.data,
},
}).catch(() => {
// Don't let logging errors hide the original error
});
return Promise.reject(error);
}
);
function extractSubsystem(url: string = ''): string {
const matches = url.match(/\/(storage|system|docker|updates|agent)/);
return matches ? matches[1] : 'unknown';
}
```
### Time Required: 20 minutes
---
## Phase 5: Toast Integration
### Update Existing Error Calls
**Pattern**: Update error toast calls to use new logger
**Before**:
```typescript
import toast from 'react-hot-toast';
toast.error(`Failed to trigger scan: ${error.message}`);
```
**After**:
```typescript
import { toastWithLogging } from '@/lib/toast-with-logging';
toastWithLogging.error(`Failed to trigger scan: ${error.message}`, {
subsystem: 'storage', // Specify subsystem
id: 'trigger-scan-error', // Optional component ID
});
```
### Priority Files to Update
#### 5.1 React State Management for Scan Buttons
**File**: Create `aggregator-web/src/hooks/useScanState.ts`
```typescript
import { useState, useCallback } from 'react';
import { api } from '@/lib/api';
import { toastWithLogging } from '@/lib/toast-with-logging';
interface ScanState {
isScanning: boolean;
commandId?: string;
error?: string;
}
/**
* Hook for managing scan button state and preventing duplicate scans
*/
export function useScanState(agentId: string, subsystem: string) {
const [state, setState] = useState<ScanState>({
isScanning: false,
});
const triggerScan = useCallback(async () => {
if (state.isScanning) {
toastWithLogging.info('Scan already in progress', { subsystem });
return;
}
setState({ isScanning: true, commandId: undefined, error: undefined });
try {
const result = await api.post(`/agents/${agentId}/subsystems/${subsystem}/trigger`);
setState(prev => ({
...prev,
commandId: result.data.command_id,
}));
// Poll for completion or wait for subscription update
await waitForScanComplete(agentId, result.data.command_id);
setState({ isScanning: false, commandId: result.data.command_id });
toastWithLogging.success(`${subsystem} scan completed`, { subsystem });
} catch (error: any) {
const isAlreadyRunning = error.response?.status === 409;
if (isAlreadyRunning) {
const existingCommandId = error.response?.data?.command_id;
setState({
isScanning: false,
commandId: existingCommandId,
error: 'Scan already in progress',
});
toastWithLogging.info(`Scan already running (command: ${existingCommandId})`, { subsystem });
} else {
const errorMessage = error.response?.data?.error || error.message;
setState({
isScanning: false,
error: errorMessage,
});
toastWithLogging.error(`Failed to trigger scan: ${errorMessage}`, { subsystem });
}
}
}, [agentId, subsystem, state.isScanning]);
const reset = useCallback(() => {
setState({ isScanning: false, commandId: undefined, error: undefined });
}, []);
return {
isScanning: state.isScanning,
commandId: state.commandId,
error: state.error,
triggerScan,
reset,
};
}
/**
* Wait for scan to complete by polling command status
*/
async function waitForScanComplete(agentId: string, commandId: string): Promise<void> {
const maxWaitMs = 300000; // 5 minutes max
const startTime = Date.now();
const pollInterval = 2000; // Poll every 2 seconds
return new Promise((resolve, reject) => {
const interval = setInterval(async () => {
try {
const result = await api.get(`/agents/${agentId}/commands/${commandId}`);
if (result.data.status === 'completed' || result.data.status === 'failed') {
clearInterval(interval);
resolve();
}
} catch (error) {
clearInterval(interval);
reject(error);
}
if (Date.now() - startTime > maxWaitMs) {
clearInterval(interval);
reject(new Error('Scan timeout'));
}
}, pollInterval);
});
}
```
**Usage Example in Component**:
```typescript
import { useScanState } from '@/hooks/useScanState';
function ScanButton({ agentId, subsystem }: { agentId: string; subsystem: string }) {
const { isScanning, triggerScan } = useScanState(agentId, subsystem);
return (
<button
onClick={triggerScan}
disabled={isScanning}
className={isScanning ? 'btn-disabled' : 'btn-primary'}
>
{isScanning ? (
<>
<Spinner className="animate-spin" />
Scanning...
</>
) : (
`Scan ${subsystem}`
)}
</button>
);
}
```
#### 5.2 Update Existing Error Calls
**Priority Files to Update**
1. **Agent Subsystem Actions** - `/src/components/AgentSubsystems.tsx`
2. **Command Retry Logic** - `/src/hooks/useCommands.ts`
3. **Authentication Errors** - `/src/lib/auth.ts`
4. **API Error Boundaries** - `/src/components/ErrorBoundary.tsx`
### Example Complete Integration
**File**: `aggregator-web/src/components/AgentSubsystems.tsx` (example update)
```typescript
import { toastWithLogging } from '@/lib/toast-with-logging';
const handleTrigger = async (subsystem: string) => {
try {
await triggerSubsystem(agentId, subsystem);
} catch (error) {
toastWithLogging.error(
`Failed to trigger ${subsystem} scan: ${error.message}`,
{
subsystem,
id: `trigger-${subsystem}`,
}
);
}
};
```
### Time Required: 15 minutes
#### 5.3 Deduplication Testing
**Test Cases**:
```typescript
// Test 1: Rapid clicking prevention
test('clicking scan button 10 times creates only 1 command', async () => {
const button = screen.getByText('Scan APT');
// Click 10 times rapidly
for (let i = 0; i < 10; i++) {
fireEvent.click(button);
}
// Should only create 1 command
expect(api.post).toHaveBeenCalledTimes(1);
expect(api.post).toHaveBeenCalledWith('/agents/123/subsystems/apt/trigger');
});
// Test 2: Button disabled while scanning
test('button disabled during scan', async () => {
const button = screen.getByText('Scan APT');
fireEvent.click(button);
// Button should be disabled immediately
expect(button).toBeDisabled();
expect(screen.getByText('Scanning...')).toBeInTheDocument();
await waitFor(() => {
expect(button).not.toBeDisabled();
});
});
// Test 3: 409 Conflict returns existing command
mock.onPost().reply(409, {
error: 'Scan already in progress',
command_id: 'existing-id',
});
expect(await triggerScan()).toEqual({ command_id: 'existing-id' });
expect(toast).toHaveBeenCalledWith('Scan already running');
```
---
## Phase 6: Verification & Testing
### Manual Testing Checklist
#### 6.1 Migration Testing
- [ ] Run migration 023 successfully
- [ ] Verify `client_errors` table exists
- [ ] Verify `idempotency_key` column added to `agent_commands`
- [ ] Test on fresh database (no duplicate key errors)
#### 6.2 Command Factory Testing
- [ ] Rapid-fire scan button clicks (10+ clicks in 2 seconds)
- [ ] Verify all commands created with unique IDs
- [ ] Check no duplicate key violations in logs
- [ ] Verify commands appear in database correctly
#### 6.3 Error Logging Testing
- [ ] Trigger UI error (e.g., invalid input)
- [ ] Verify error appears in toast
- [ ] Check database - error should be stored in `client_errors`
- [ ] Trigger API error (e.g., network timeout)
- [ ] Verify exponential backoff retry works
- [ ] Disconnect network, trigger error, reconnect
- [ ] Verify error is queued and sent when back online
#### 6.4 Integration Testing
- [ ] Full user workflow: login → trigger scan → view results
- [ ] Verify all errors logged with [HISTORY] prefix
- [ ] Check logs are queryable by subsystem
- [ ] Verify error logging doesn't block UI
### Automated Test Cases
#### 6.5 Backend Tests
**File**: `aggregator-server/internal/command/factory_test.go`
```go
package command
import (
"testing"
"github.com/google/uuid"
"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
)
func TestFactory_Create(t *testing.T) {
factory := NewFactory()
agentID := uuid.New()
cmd, err := factory.Create(agentID, "scan_storage", map[string]interface{}{"path": "/"})
require.NoError(t, err)
assert.NotEqual(t, uuid.Nil, cmd.ID, "ID must be generated")
assert.Equal(t, agentID, cmd.AgentID)
assert.Equal(t, "scan_storage", cmd.CommandType)
assert.Equal(t, "pending", cmd.Status)
assert.Equal(t, "manual", cmd.Source)
}
func TestFactory_CreateWithIdempotency(t *testing.T) {
factory := NewFactory()
agentID := uuid.New()
idempotencyKey := "test-key-123"
// Create first command
cmd1, err := factory.CreateWithIdempotency(agentID, "scan_system", nil, idempotencyKey)
require.NoError(t, err)
// Create "duplicate" command with same idempotency key
cmd2, err := factory.CreateWithIdempotency(agentID, "scan_system", nil, idempotencyKey)
require.NoError(t, err)
// Should return same command
assert.Equal(t, cmd1.ID, cmd2.ID, "Idempotency key should return same command")
}
func TestFactory_Validate(t *testing.T) {
tests := []struct {
name string
cmd *models.AgentCommand
wantErr bool
}{
{
name: "valid command",
cmd: &models.AgentCommand{
ID: uuid.New(),
AgentID: uuid.New(),
CommandType: "scan_storage",
Status: "pending",
Source: "manual",
},
wantErr: false,
},
{
name: "missing ID",
cmd: &models.AgentCommand{
ID: uuid.Nil,
AgentID: uuid.New(),
CommandType: "scan_storage",
Status: "pending",
Source: "manual",
},
wantErr: true,
},
}
factory := NewFactory()
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
_, err := factory.Create(tt.cmd.AgentID, tt.cmd.CommandType, nil)
if tt.wantErr {
assert.Error(t, err)
} else {
assert.NoError(t, err)
}
})
}
}
```
#### 6.6 Frontend Tests
**File**: `aggregator-web/src/lib/client-error-logger.test.ts`
```typescript
import { clientErrorLogger } from './client-error-logger';
import { api } from './api';
jest.mock('./api');
describe('ClientErrorLogger', () => {
beforeEach(() => {
localStorage.clear();
jest.clearAllMocks();
});
test('logs error successfully on first attempt', async () => {
(api.post as jest.Mock).mockResolvedValue({});
await clientErrorLogger.logError({
subsystem: 'storage',
error_type: 'api_error',
message: 'Test error',
});
expect(api.post).toHaveBeenCalledTimes(1);
expect(api.post).toHaveBeenCalledWith(
'/logs/client-error',
expect.objectContaining({
subsystem: 'storage',
error_type: 'api_error',
message: 'Test error',
}),
expect.any(Object)
);
});
test('retries on failure then saves to localStorage', async () => {
(api.post as jest.Mock)
.mockRejectedValueOnce(new Error('Network error'))
.mockRejectedValueOnce(new Error('Network error'))
.mockRejectedValueOnce(new Error('Network error'));
await clientErrorLogger.logError({
subsystem: 'storage',
error_type: 'api_error',
message: 'Test error',
});
expect(api.post).toHaveBeenCalledTimes(3);
// Should be saved to localStorage
const queue = localStorage.getItem('redflag-error-queue');
expect(queue).toBeTruthy();
expect(JSON.parse(queue!).length).toBe(1);
});
test('flushes queue when coming back online', async () => {
// Pre-populate queue
const queuedError = {
subsystem: 'storage',
error_type: 'api_error',
message: 'Queued error',
timestamp: new Date().toISOString(),
};
localStorage.setItem('redflag-error-queue', JSON.stringify([queuedError]));
(api.post as jest.Mock).mockResolvedValue({});
// Trigger online event
window.dispatchEvent(new Event('online'));
// Wait for flush
await new Promise(resolve => setTimeout(resolve, 100));
expect(api.post).toHaveBeenCalled();
expect(localStorage.getItem('redflag-error-queue')).toBe('[]');
});
});
```
### Time Required: 30 minutes
---
## Implementation Checklist
### Pre-Implementation
- [ ] ✅ Migration system bug fixed (lines 103 & 116 in db.go)
- [ ] ✅ Database wiped and fresh instance ready
- [ ] ✅ Test agents available for rapid scan testing
- [ ] ✅ Development environment ready (all 3 components)
### Phase 1: Command Factory (25 min)
- [ ] Create `aggregator-server/internal/command/factory.go`
- [ ] Create `aggregator-server/internal/command/validator.go`
- [ ] Update `aggregator-server/internal/models/command.go`
- [ ] Update `aggregator-server/internal/api/handlers/subsystems.go`
- [ ] Test: Verify rapid scan clicks work
### Phase 2: Database Schema (5 min)
- [ ] Create migration `023_client_error_logging.up.sql`
- [ ] Create migration `023_client_error_logging.down.sql`
- [ ] Run migration and verify table creation
- [ ] Verify indexes created
### Phase 3: Backend Handler (20 min)
- [ ] Create `aggregator-server/internal/api/handlers/client_errors.go`
- [ ] Create `aggregator-server/internal/database/queries/client_errors.sql`
- [ ] Update `aggregator-server/internal/api/router.go`
- [ ] Test API endpoint with curl
### Phase 4: Frontend Logger (20 min)
- [ ] Create `aggregator-web/src/lib/client-error-logger.ts`
- [ ] Create `aggregator-web/src/lib/toast-with-logging.ts`
- [ ] Update `aggregator-web/src/lib/api.ts`
- [ ] Test offline/online queue behavior
### Phase 5: Toast Integration (15 min)
- [ ] Create `useScanState` hook for button state management
- [ ] Update scan buttons to use `useScanState`
- [ ] Test button disabling during scan
- [ ] Update 3-5 critical error locations to use `toastWithLogging`
- [ ] Verify errors appear in both toast and database
- [ ] Test in multiple subsystems
- [ ] **Test deduplication**: Rapid clicking creates only 1 command
- [ ] **Test 409 response**: Returns existing command when scan running
### Phase 6: Verification (30 min)
- [ ] Run all test cases
- [ ] Verify ETHOS compliance checklist
- [ ] Test rapid scan clicking (no duplicates)
- [ ] Test error persistence across page reloads
- [ ] Verify [HISTORY] logs in server output
### Documentation
- [ ] Update session documentation
- [ ] Create testing summary
- [ ] Document any issues encountered
- [ ] Update architecture documentation
---
## Risk Mitigation
### Risk 1: Migration Failures
**Probability**: Medium | **Impact**: High | **Severity**: 🔴 Critical
**Mitigation**:
- Fix migration runner bug FIRST (before this implementation)
- Test migration on fresh database
- Keep database backups
- Have rollback script ready
**Contingency**: If migration fails, manually apply SQL and continue
---
### Risk 2: Performance Impact
**Probability**: Low | **Impact**: Medium | **Severity**: 🟡 Medium
**Mitigation**:
- Async error logging (non-blocking)
- LocalStorage queue with size limit (max 50 errors)
- Database indexes for fast queries
- Batch insert if needed in future
**Contingency**: If performance degrades, add sampling (log 1 in 10 errors)
---
### Risk 3: Infinite Error Loops
**Probability**: Low | **Impact**: High | **Severity**: 🟡 Medium
**Mitigation**:
- `X-Error-Logger-Request` header prevents recursive logging
- Max retry count (3 attempts)
- Exponential backoff prevents thundering herd
**Contingency**: If loop detected, check for missing header and fix
---
### Risk 4: Privacy Concerns
**Probability**: Low | **Impact**: High | **Severity**: 🟡 Medium
**Mitigation**:
- No PII in error messages (validate during logging)
- User agent stored but can be anonymized
- Stack traces only from our code (not user code)
**Contingency**: Add privacy filter to scrub sensitive data
---
## Post-Implementation Review
### Success Criteria
- [ ] No duplicate key violations during rapid clicking
- [ ] All errors persist in database
- [ ] Error logs queryable and useful for debugging
- [ ] No performance degradation observed
- [ ] System handles offline/online transitions gracefully
- [ ] All tests pass
### Performance Benchmarks
- Command creation: < 10ms per command
- Error logging: < 50ms per error (async)
- Database queries: < 100ms for common queries
- Bundle size increase: < 5KB gzipped
### Known Limitations
- Error logs don't include full request payloads (privacy)
- localStorage queue limited by browser storage (~5MB)
- Retries happen in foreground (could be moved to background)
### Future Enhancements (Post v0.1.27)
- Error aggregation and deduplication
- Error rate alerting
- Error analytics dashboard
- Automatic error categorization
- Integration with notification system
---
## Rollback Plan
If critical issues arise:
1. **Revert Code Changes**:
```bash
git revert HEAD~6..HEAD # Revert last 6 commits
```
2. **Rollback Database**:
```bash
cd aggregator-server
# Run down migration
go run cmd/migrate/main.go -migrate-down 1
```
3. **Rebuild and Deploy**:
```bash
docker-compose build --no-cache
docker-compose up -d
```
---
## Additional Notes
**Team Coordination**:
- Coordinate with frontend team if they're working on error handling
- Notify QA about new error logging features for testing
- Update documentation team about database schema changes
**Monitoring**:
- Monitor `client_errors` table growth
- Set up alerts for error rate spikes
- Track failed error logging attempts
**Documentation Updates**:
- Update API documentation for `/logs/client-error` endpoint
- Document error log query patterns for support team
- Add troubleshooting guide for common errors
---
**Plan Created By**: Ani (AI Assistant)
**Reviewed By**: Casey Tunturi
**Status**: 🟢 APPROVED FOR IMPLEMENTATION
**Next Step**: Begin Phase 1 (Command Factory)
**Estimated Timeline**:
- Start: Immediately
- Complete: ~2-3 hours
- Test: 30 minutes
- Deploy: After verification
This is a complete, production-ready implementation plan. Each phase builds on the previous one, with full error handling, testing, and rollback procedures included.
Let's build this right. 💪