1795 lines
53 KiB
Markdown
1795 lines
53 KiB
Markdown
# RedFlag Clean Architecture Implementation Master Plan
|
|
|
|
**Date**: 2025-12-19
|
|
**Version**: v1.0
|
|
**Total Implementation Time**: 3-4 hours (including migration fixes and command deduplication)
|
|
**Status**: READY FOR EXECUTION
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
Complete implementation plan for fixing critical ETHOS violations and implementing clean architecture patterns across RedFlag v0.1.27. Addresses duplicate command generation, lost frontend errors, and migration system bugs.
|
|
|
|
**Three Core Objectives:**
|
|
1. ✅ Fix migration system (blocks everything else)
|
|
2. ✅ Implement command factory pattern (prevents duplicate key violations)
|
|
3. ✅ Build frontend error logging system (ETHOS #1 compliance)
|
|
|
|
---
|
|
|
|
## Table of Contents
|
|
|
|
1. [Pre-Implementation: Migration System Fix](#pre-implementation-migration-system-fix)
|
|
2. [Phase 1: Command Factory Pattern](#phase-1-command-factory-pattern)
|
|
3. [Phase 2: Database Schema](#phase-2-database-schema)
|
|
4. [Phase 3: Backend Error Handler](#phase-3-backend-error-handler)
|
|
5. [Phase 4: Frontend Error Logger](#phase-4-frontend-error-logger)
|
|
6. [Phase 5: Toast Integration](#phase-5-toast-integration)
|
|
7. [Phase 6: Verification & Testing](#phase-6-verification-and-testing)
|
|
8. [Implementation Checklist](#implementation-checklist)
|
|
9. [Risk Mitigation](#risk-mitigation)
|
|
10. [Post-Implementation Review](#post-implementation-review)
|
|
|
|
---
|
|
|
|
## Pre-Implementation: Migration System Fix
|
|
|
|
**⚠️ CRITICAL: Must be completed first - blocks all other work**
|
|
|
|
### Problem
|
|
Migration runner has duplicate INSERT logic causing "duplicate key value violates unique constraint" errors on fresh installations.
|
|
|
|
### Root Cause
|
|
File: `aggregator-server/internal/database/db.go`
|
|
- Line 103: Executes `INSERT INTO schema_migrations (version) VALUES ($1)`
|
|
- Line 116: Executes the exact same INSERT statement
|
|
- Result: Every migration filename gets inserted twice
|
|
|
|
### Solution
|
|
```go
|
|
// File: aggregator-server/internal/database/db.go
|
|
|
|
// Lines 95-120: Fix duplicate INSERT logic
|
|
func (db *DB) Migrate() error {
|
|
// ... existing code ...
|
|
|
|
for _, file := range files {
|
|
filename := file.Name()
|
|
// ❌ REMOVE THIS - Line 103 duplicates line 116
|
|
// if _, err = tx.Exec("INSERT INTO schema_migrations (version) VALUES ($1)", filename); err != nil {
|
|
// return fmt.Errorf("failed to mark migration %s as applied: %w", filename, err)
|
|
// }
|
|
|
|
// Keep only the EXECUTE + INSERT combo at lines 110-116
|
|
if _, err = tx.Exec(string(content)); err != nil {
|
|
log.Printf("Migration %s failed, marking as applied: %v", filename, err)
|
|
}
|
|
|
|
// ✅ Keep this INSERT - it's the correct location
|
|
if _, err = tx.Exec("INSERT INTO schema_migrations (version) VALUES ($1)", filename); err != nil {
|
|
return fmt.Errorf("failed to mark migration %s as applied: %w", filename, err)
|
|
}
|
|
}
|
|
|
|
// ... rest of function ...
|
|
}
|
|
```
|
|
|
|
### Validation Steps
|
|
1. Wipe database completely: `docker-compose down -v`
|
|
2. Start fresh: `docker-compose up -d`
|
|
3. Check migration logs: Should see all migrations apply without duplicate key errors
|
|
4. Verify: `SELECT COUNT(DISTINCT version) = COUNT(version) FROM schema_migrations`
|
|
|
|
### Time Required: 5 minutes
|
|
**Blocker Status**: 🔴 CRITICAL - Do not proceed until fixed
|
|
|
|
---
|
|
|
|
## Phase 1: Command Factory Pattern
|
|
|
|
### Objective
|
|
Prevent duplicate command key violations by ensuring all commands have properly generated UUIDs at creation time.
|
|
|
|
### Files to Create
|
|
|
|
#### 1.1 Command Factory
|
|
**File**: `aggregator-server/internal/command/factory.go`
|
|
```go
|
|
package command
|
|
|
|
import (
|
|
"errors"
|
|
"fmt"
|
|
"time"
|
|
|
|
"github.com/google/uuid"
|
|
"github.com/Fimeg/RedFlag/aggregator-server/internal/models"
|
|
)
|
|
|
|
// Factory creates validated AgentCommand instances
|
|
type Factory struct {
|
|
validator *Validator
|
|
}
|
|
|
|
// NewFactory creates a new command factory
|
|
func NewFactory() *Factory {
|
|
return &Factory{
|
|
validator: NewValidator(),
|
|
}
|
|
}
|
|
|
|
// Create generates a new validated AgentCommand with unique ID
|
|
func (f *Factory) Create(agentID uuid.UUID, commandType string, params map[string]interface{}) (*models.AgentCommand, error) {
|
|
cmd := &models.AgentCommand{
|
|
ID: uuid.New(), // Immediate, explicit generation
|
|
AgentID: agentID,
|
|
CommandType: commandType,
|
|
Status: "pending",
|
|
Source: determineSource(commandType),
|
|
Params: params,
|
|
CreatedAt: time.Now(),
|
|
UpdatedAt: time.Now(),
|
|
}
|
|
|
|
if err := f.validator.Validate(cmd); err != nil {
|
|
return nil, fmt.Errorf("command validation failed: %w", err)
|
|
}
|
|
|
|
return cmd, nil
|
|
}
|
|
|
|
// CreateWithIdempotency generates a command with idempotency protection
|
|
func (f *Factory) CreateWithIdempotency(agentID uuid.UUID, commandType string,
|
|
params map[string]interface{}, idempotencyKey string) (*models.AgentCommand, error) {
|
|
|
|
// Check for existing command with same idempotency key
|
|
// (Implementation depends on database query layer)
|
|
existing, err := f.findByIdempotencyKey(agentID, idempotencyKey)
|
|
if err != nil && err != sql.ErrNoRows {
|
|
return nil, fmt.Errorf("failed to check idempotency: %w", err)
|
|
}
|
|
|
|
if existing != nil {
|
|
return existing, nil // Return existing command instead of creating duplicate
|
|
}
|
|
|
|
cmd, err := f.Create(agentID, commandType, params)
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
|
|
// Store idempotency key with command
|
|
if err := f.storeIdempotencyKey(cmd.ID, agentID, idempotencyKey); err != nil {
|
|
return nil, fmt.Errorf("failed to store idempotency key: %w", err)
|
|
}
|
|
|
|
return cmd, nil
|
|
}
|
|
|
|
// determineSource classifies command source based on type
|
|
func determineSource(commandType string) string {
|
|
if isSystemCommand(commandType) {
|
|
return "system"
|
|
}
|
|
return "manual"
|
|
}
|
|
|
|
func isSystemCommand(commandType string) bool {
|
|
systemCommands := []string{
|
|
"enable_heartbeat",
|
|
"disable_heartbeat",
|
|
"update_check",
|
|
"cleanup_old_logs",
|
|
}
|
|
|
|
for _, cmd := range systemCommands {
|
|
if commandType == cmd {
|
|
return true
|
|
}
|
|
}
|
|
return false
|
|
}
|
|
```
|
|
|
|
#### 1.2 Command Validator
|
|
**File**: `aggregator-server/internal/command/validator.go`
|
|
```go
|
|
package command
|
|
|
|
import (
|
|
"errors"
|
|
"fmt"
|
|
|
|
"github.com/google/uuid"
|
|
"github.com/Fimeg/RedFlag/aggregator-server/internal/models"
|
|
)
|
|
|
|
// Validator validates command parameters
|
|
type Validator struct {
|
|
minCheckInSeconds int
|
|
maxCheckInSeconds int
|
|
minScannerMinutes int
|
|
maxScannerMinutes int
|
|
}
|
|
|
|
// NewValidator creates a new command validator
|
|
func NewValidator() *Validator {
|
|
return &Validator{
|
|
minCheckInSeconds: 60, // 1 minute minimum
|
|
maxCheckInSeconds: 3600, // 1 hour maximum
|
|
minScannerMinutes: 1, // 1 minute minimum
|
|
maxScannerMinutes: 1440, // 24 hours maximum
|
|
}
|
|
}
|
|
|
|
// Validate performs comprehensive command validation
|
|
func (v *Validator) Validate(cmd *models.AgentCommand) error {
|
|
if cmd == nil {
|
|
return errors.New("command cannot be nil")
|
|
}
|
|
|
|
if cmd.ID == uuid.Nil {
|
|
return errors.New("command ID cannot be zero UUID")
|
|
}
|
|
|
|
if cmd.AgentID == uuid.Nil {
|
|
return errors.New("agent ID is required")
|
|
}
|
|
|
|
if cmd.CommandType == "" {
|
|
return errors.New("command type is required")
|
|
}
|
|
|
|
if cmd.Status == "" {
|
|
return errors.New("status is required")
|
|
}
|
|
|
|
validStatuses := []string{"pending", "running", "completed", "failed", "cancelled"}
|
|
if !contains(validStatuses, cmd.Status) {
|
|
return fmt.Errorf("invalid status: %s", cmd.Status)
|
|
}
|
|
|
|
if cmd.Source != "manual" && cmd.Source != "system" {
|
|
return fmt.Errorf("source must be 'manual' or 'system', got: %s", cmd.Source)
|
|
}
|
|
|
|
// Validate command type format
|
|
if err := v.validateCommandType(cmd.CommandType); err != nil {
|
|
return err
|
|
}
|
|
|
|
return nil
|
|
}
|
|
|
|
// ValidateSubsystemAction validates subsystem-specific actions
|
|
func (v *Validator) ValidateSubsystemAction(subsystem string, action string) error {
|
|
validActions := map[string][]string{
|
|
"storage": {"trigger", "enable", "disable", "set_interval"},
|
|
"system": {"trigger", "enable", "disable", "set_interval"},
|
|
"docker": {"trigger", "enable", "disable", "set_interval"},
|
|
"updates": {"trigger", "enable", "disable", "set_interval"},
|
|
}
|
|
|
|
actions, ok := validActions[subsystem]
|
|
if !ok {
|
|
return fmt.Errorf("unknown subsystem: %s", subsystem)
|
|
}
|
|
|
|
if !contains(actions, action) {
|
|
return fmt.Errorf("invalid action '%s' for subsystem '%s'", action, subsystem)
|
|
}
|
|
|
|
return nil
|
|
}
|
|
|
|
// ValidateInterval ensures scanner intervals are within bounds
|
|
func (v *Validator) ValidateInterval(subsystem string, minutes int) error {
|
|
if minutes < v.minScannerMinutes {
|
|
return fmt.Errorf("interval %d minutes below minimum %d for subsystem %s",
|
|
minutes, v.minScannerMinutes, subsystem)
|
|
}
|
|
|
|
if minutes > v.maxScannerMinutes {
|
|
return fmt.Errorf("interval %d minutes above maximum %d for subsystem %s",
|
|
minutes, v.maxScannerMinutes, subsystem)
|
|
}
|
|
|
|
return nil
|
|
}
|
|
|
|
func (v *Validator) validateCommandType(commandType string) error {
|
|
validPrefixes := []string{"scan_", "install_", "update_", "enable_", "disable_", "reboot"}
|
|
|
|
for _, prefix := range validPrefixes {
|
|
if len(commandType) >= len(prefix) && commandType[:len(prefix)] == prefix {
|
|
return nil
|
|
}
|
|
}
|
|
|
|
return fmt.Errorf("invalid command type format: %s", commandType)
|
|
}
|
|
|
|
func contains(slice []string, item string) bool {
|
|
for _, s := range slice {
|
|
if s == item {
|
|
return true
|
|
}
|
|
}
|
|
return false
|
|
}
|
|
```
|
|
|
|
#### 1.3 Update AgentCommand Model
|
|
**File**: `aggregator-server/internal/models/command.go`
|
|
```go
|
|
package models
|
|
|
|
import (
|
|
"database/sql"
|
|
"time"
|
|
|
|
"github.com/google/uuid"
|
|
"github.com/lib/pq"
|
|
)
|
|
|
|
// AgentCommand represents a command sent to an agent
|
|
type AgentCommand struct {
|
|
ID uuid.UUID `db:"id" json:"id"`
|
|
AgentID uuid.UUID `db:"agent_id" json:"agent_id"`
|
|
CommandType string `db:"command_type" json:"command_type"`
|
|
Status string `db:"status" json:"status"`
|
|
Source string `db:"source" json:"source"`
|
|
Params pq.ByteaArray `db:"params" json:"params"`
|
|
Result sql.NullString `db:"result" json:"result,omitempty"`
|
|
Error sql.NullString `db:"error" json:"error,omitempty"`
|
|
RetryCount int `db:"retry_count" json:"retry_count"`
|
|
CreatedAt time.Time `db:"created_at" json:"created_at"`
|
|
UpdatedAt time.Time `db:"updated_at" json:"updated_at"`
|
|
CompletedAt pq.NullTime `db:"completed_at" json:"completed_at,omitempty"`
|
|
|
|
// Idempotency support
|
|
IdempotencyKey uuid.NullUUID `db:"idempotency_key" json:"-"`
|
|
}
|
|
|
|
// Validate checks if the command is valid
|
|
func (c *AgentCommand) Validate() error {
|
|
if c.ID == uuid.Nil {
|
|
return ErrCommandIDRequired
|
|
}
|
|
if c.AgentID == uuid.Nil {
|
|
return ErrAgentIDRequired
|
|
}
|
|
if c.CommandType == "" {
|
|
return ErrCommandTypeRequired
|
|
}
|
|
if c.Status == "" {
|
|
return ErrStatusRequired
|
|
}
|
|
if c.Source != "manual" && c.Source != "system" {
|
|
return ErrInvalidSource
|
|
}
|
|
|
|
return nil
|
|
}
|
|
|
|
// IsTerminal returns true if the command is in a terminal state
|
|
func (c *AgentCommand) IsTerminal() bool {
|
|
return c.Status == "completed" || c.Status == "failed" || c.Status == "cancelled"
|
|
}
|
|
|
|
// CanRetry returns true if the command can be retried
|
|
func (c *AgentCommand) CanRetry() bool {
|
|
return c.Status == "failed" && c.RetryCount < 3
|
|
}
|
|
|
|
// Predefined errors for validation
|
|
var (
|
|
ErrCommandIDRequired = errors.New("command ID cannot be zero UUID")
|
|
ErrAgentIDRequired = errors.New("agent ID is required")
|
|
ErrCommandTypeRequired = errors.New("command type is required")
|
|
ErrStatusRequired = errors.New("status is required")
|
|
ErrInvalidSource = errors.New("source must be 'manual' or 'system'")
|
|
)
|
|
```
|
|
|
|
#### 1.4 Update Subsystem Handler
|
|
**File**: `aggregator-server/internal/api/handlers/subsystems.go`
|
|
```go
|
|
package handlers
|
|
|
|
import (
|
|
"log"
|
|
"net/http"
|
|
|
|
"github.com/gin-gonic/gin"
|
|
"github.com/google/uuid"
|
|
"github.com/jmoiron/sqlx"
|
|
|
|
"github.com/Fimeg/RedFlag/aggregator-server/internal/command"
|
|
"github.com/Fimeg/RedFlag/aggregator-server/internal/models"
|
|
)
|
|
|
|
type SubsystemHandler struct {
|
|
db *sqlx.DB
|
|
commandFactory *command.Factory
|
|
}
|
|
|
|
func NewSubsystemHandler(db *sqlx.DB) *SubsystemHandler {
|
|
return &SubsystemHandler{
|
|
db: db,
|
|
commandFactory: command.NewFactory(),
|
|
}
|
|
}
|
|
|
|
// TriggerSubsystem creates and enqueues a subsystem command
|
|
func (h *SubsystemHandler) TriggerSubsystem(c *gin.Context) {
|
|
agentID, err := uuid.Parse(c.Param("id"))
|
|
if err != nil {
|
|
log.Printf("[ERROR] [server] [subsystem] invalid_agent_id error=%v", err)
|
|
c.JSON(http.StatusBadRequest, gin.H{"error": "invalid agent ID"})
|
|
return
|
|
}
|
|
|
|
subsystem := c.Param("subsystem")
|
|
if err := h.validateSubsystem(subsystem); err != nil {
|
|
log.Printf("[ERROR] [server] [subsystem] invalid_subsystem subsystem=%s error=%v", subsystem, err)
|
|
c.JSON(http.StatusBadRequest, gin.H{"error": err.Error()})
|
|
return
|
|
}
|
|
|
|
// DEDUPLICATION CHECK: Prevent multiple pending scans
|
|
existingCmd, err := h.getPendingScanCommand(agentID, subsystem)
|
|
if err != nil {
|
|
log.Printf("[ERROR] [server] [subsystem] query_failed agent_id=%s subsystem=%s error=%v",
|
|
agentID, subsystem, err)
|
|
c.JSON(http.StatusInternalServerError, gin.H{"error": "internal error"})
|
|
return
|
|
}
|
|
|
|
if existingCmd != nil {
|
|
log.Printf("[INFO] [server] [subsystem] scan_already_pending agent_id=%s subsystem=%s command_id=%s",
|
|
agentID, subsystem, existingCmd.ID)
|
|
log.Printf("[HISTORY] [server] [scan_%s] duplicate_request_prevented agent_id=%s command_id=%s timestamp=%s",
|
|
subsystem, agentID, existingCmd.ID, time.Now().Format(time.RFC3339))
|
|
|
|
c.JSON(http.StatusConflict, gin.H{
|
|
"error": "Scan already in progress",
|
|
"command_id": existingCmd.ID.String(),
|
|
"subsystem": subsystem,
|
|
"status": existingCmd.Status,
|
|
"created_at": existingCmd.CreatedAt,
|
|
})
|
|
return
|
|
}
|
|
|
|
// Generate idempotency key from request context
|
|
idempotencyKey := h.generateIdempotencyKey(c, agentID, subsystem)
|
|
|
|
// Create command using factory
|
|
cmd, err := h.commandFactory.CreateWithIdempotency(
|
|
agentID,
|
|
"scan_"+subsystem,
|
|
map[string]interface{}{"subsystem": subsystem},
|
|
idempotencyKey,
|
|
)
|
|
if err != nil {
|
|
log.Printf("[ERROR] [server] [subsystem] command_creation_failed agent_id=%s subsystem=%s error=%v",
|
|
agentID, subsystem, err)
|
|
c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to create command"})
|
|
return
|
|
}
|
|
|
|
// Store command in database
|
|
if err := h.storeCommand(cmd); err != nil {
|
|
log.Printf("[ERROR] [server] [subsystem] command_store_failed agent_id=%s command_id=%s error=%v",
|
|
agentID, cmd.ID, err)
|
|
c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to store command"})
|
|
return
|
|
}
|
|
|
|
log.Printf("[INFO] [server] [subsystem] command_created agent_id=%s command_id=%s subsystem=%s",
|
|
agentID, cmd.ID, subsystem)
|
|
log.Printf("[HISTORY] [server] [scan_%s] command_created agent_id=%s command_id=%s source=manual timestamp=%s",
|
|
subsystem, agentID, cmd.ID, time.Now().Format(time.RFC3339))
|
|
|
|
c.JSON(http.StatusOK, gin.H{
|
|
"message": "Command created successfully",
|
|
"command_id": cmd.ID.String(),
|
|
"subsystem": subsystem,
|
|
})
|
|
}
|
|
|
|
// getPendingScanCommand checks for existing pending scan commands
|
|
func (h *SubsystemHandler) getPendingScanCommand(agentID uuid.UUID, subsystem string) (*models.AgentCommand, error) {
|
|
var cmd models.AgentCommand
|
|
query := `
|
|
SELECT id, command_type, status, created_at
|
|
FROM agent_commands
|
|
WHERE agent_id = $1
|
|
AND command_type = $2
|
|
AND status = 'pending'
|
|
LIMIT 1`
|
|
|
|
commandType := "scan_" + subsystem
|
|
err := h.db.Get(&cmd, query, agentID, commandType)
|
|
if err != nil {
|
|
if err == sql.ErrNoRows {
|
|
return nil, nil // No pending command found
|
|
}
|
|
return nil, fmt.Errorf("query failed: %w", err)
|
|
}
|
|
|
|
return &cmd, nil
|
|
}
|
|
|
|
// validateSubsystem checks if subsystem is recognized
|
|
func (h *SubsystemHandler) validateSubsystem(subsystem string) error {
|
|
validSubsystems := []string{"apt", "dnf", "windows", "winget", "storage", "system", "docker"}
|
|
for _, valid := range validSubsystems {
|
|
if subsystem == valid {
|
|
return nil
|
|
}
|
|
}
|
|
return fmt.Errorf("unknown subsystem: %s", subsystem)
|
|
}
|
|
|
|
// generateIdempotencyKey creates a key to prevent duplicate submissions
|
|
func (h *SubsystemHandler) generateIdempotencyKey(c *gin.Context, agentID uuid.UUID, subsystem string) string {
|
|
// Use timestamp rounded to nearest minute for idempotency window
|
|
// This allows retries within same minute but prevents duplicates across minutes
|
|
timestampWindow := time.Now().Unix() / 60 // Round to minute
|
|
return fmt.Sprintf("%s:%s:%d", agentID.String(), subsystem, timestampWindow)
|
|
}
|
|
|
|
// storeCommand persists command to database
|
|
func (h *SubsystemHandler) storeCommand(cmd *models.AgentCommand) error {
|
|
// Implementation depends on your command storage layer
|
|
// Use NamedExec or similar to insert command
|
|
query := `
|
|
INSERT INTO agent_commands
|
|
(id, agent_id, command_type, status, source, params, created_at)
|
|
VALUES (:id, :agent_id, :command_type, :status, :source, :params, NOW())`
|
|
|
|
_, err := h.db.NamedExec(query, cmd)
|
|
return err
|
|
}
|
|
```
|
|
|
|
### Time Required: 30 minutes
|
|
|
|
---
|
|
|
|
## Phase 2: Database Schema
|
|
|
|
### Migration 023a: Command Deduplication
|
|
**File**: `aggregator-server/internal/database/migrations/023a_command_deduplication.up.sql`
|
|
```sql
|
|
-- Command Deduplication Schema
|
|
-- Prevents multiple pending scan commands per subsystem per agent
|
|
|
|
-- Add unique constraint to enforce single pending command per subsystem
|
|
CREATE UNIQUE INDEX idx_agent_pending_subsystem
|
|
ON agent_commands(agent_id, command_type, status)
|
|
WHERE status = 'pending';
|
|
|
|
-- Add idempotency key support for retry scenarios
|
|
ALTER TABLE agent_commands ADD COLUMN idempotency_key VARCHAR(64) UNIQUE NULL;
|
|
CREATE INDEX idx_agent_commands_idempotency_key ON agent_commands(idempotency_key);
|
|
|
|
COMMENT ON COLUMN agent_commands.idempotency_key IS
|
|
'Prevents duplicate command creation from retry logic. Based on (agent_id + subsystem + timestamp window).';
|
|
```
|
|
|
|
**File**: `aggregator-server/internal/database/migrations/023a_command_deduplication.down.sql`
|
|
```sql
|
|
DROP INDEX IF EXISTS idx_agent_pending_subsystem;
|
|
ALTER TABLE agent_commands DROP COLUMN IF EXISTS idempotency_key;
|
|
```
|
|
|
|
### Migration 023: Client Error Logging Table
|
|
**File**: `aggregator-server/internal/database/migrations/023_client_error_logging.up.sql`
|
|
```sql
|
|
-- Client Error Logging Schema
|
|
-- Implements ETHOS #1: Errors are History, Not /dev/null
|
|
|
|
CREATE TABLE client_errors (
|
|
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
|
|
agent_id UUID REFERENCES agents(id) ON DELETE SET NULL,
|
|
subsystem VARCHAR(50) NOT NULL,
|
|
error_type VARCHAR(50) NOT NULL,
|
|
message TEXT NOT NULL,
|
|
stack_trace TEXT,
|
|
metadata JSONB,
|
|
url TEXT NOT NULL,
|
|
user_agent TEXT,
|
|
created_at TIMESTAMP DEFAULT NOW()
|
|
);
|
|
|
|
-- Indexes for efficient querying
|
|
CREATE INDEX idx_client_errors_agent_time ON client_errors(agent_id, created_at DESC);
|
|
CREATE INDEX idx_client_errors_subsystem_time ON client_errors(subsystem, created_at DESC);
|
|
CREATE INDEX idx_client_errors_error_type_time ON client_errors(error_type, created_at DESC);
|
|
CREATE INDEX idx_client_errors_created_at ON client_errors(created_at DESC);
|
|
|
|
-- Comments for documentation
|
|
COMMENT ON TABLE client_errors IS 'Frontend error logs for debugging and auditing. Implements ETHOS #1.';
|
|
COMMENT ON COLUMN client_errors.agent_id IS 'Agent active when error occurred (NULL for pre-auth errors)';
|
|
COMMENT ON COLUMN client_errors.subsystem IS 'RedFlag subsystem being used (storage, system, docker, etc.)';
|
|
COMMENT ON COLUMN client_errors.error_type IS 'Error category: javascript_error, api_error, ui_error, validation_error';
|
|
COMMENT ON COLUMN client_errors.metadata IS 'Additional context (component, API response, user actions)';
|
|
|
|
-- Add idempotency support to agent_commands
|
|
ALTER TABLE agent_commands ADD COLUMN idempotency_key UUID UNIQUE NULL;
|
|
CREATE INDEX idx_agent_commands_idempotency_key ON agent_commands(idempotency_key);
|
|
COMMENT ON COLUMN agent_commands.idempotency_key IS 'Prevents duplicate command creation from retry logic';
|
|
```
|
|
|
|
**File**: `aggregator-server/internal/database/migrations/023_client_error_logging.down.sql`
|
|
```sql
|
|
DROP TABLE IF EXISTS client_errors;
|
|
ALTER TABLE agent_commands DROP COLUMN IF EXISTS idempotency_key;
|
|
```
|
|
|
|
### Time Required: 5 minutes
|
|
|
|
---
|
|
|
|
## Phase 3: Backend Error Handler
|
|
|
|
### Files to Create
|
|
|
|
#### 3.1 Error Handler
|
|
**File**: `aggregator-server/internal/api/handlers/client_errors.go`
|
|
```go
|
|
package handlers
|
|
|
|
import (
|
|
"database/sql"
|
|
"fmt"
|
|
"log"
|
|
"net/http"
|
|
"time"
|
|
|
|
"github.com/gin-gonic/gin"
|
|
"github.com/google/uuid"
|
|
"github.com/jmoiron/sqlx"
|
|
)
|
|
|
|
// ClientErrorHandler handles frontend error logging per ETHOS #1
|
|
type ClientErrorHandler struct {
|
|
db *sqlx.DB
|
|
}
|
|
|
|
// NewClientErrorHandler creates a new error handler
|
|
func NewClientErrorHandler(db *sqlx.DB) *ClientErrorHandler {
|
|
return &ClientErrorHandler{db: db}
|
|
}
|
|
|
|
// LogErrorRequest represents a client error log entry
|
|
type LogErrorRequest struct {
|
|
Subsystem string `json:"subsystem" binding:"required"`
|
|
ErrorType string `json:"error_type" binding:"required,oneof=javascript_error api_error ui_error validation_error"`
|
|
Message string `json:"message" binding:"required,max=10000"`
|
|
StackTrace string `json:"stack_trace,omitempty"`
|
|
Metadata map[string]interface{} `json:"metadata,omitempty"`
|
|
URL string `json:"url" binding:"required"`
|
|
}
|
|
|
|
// LogError processes and stores frontend errors
|
|
func (h *ClientErrorHandler) LogError(c *gin.Context) {
|
|
var req LogErrorRequest
|
|
if err := c.ShouldBindJSON(&req); err != nil {
|
|
log.Printf("[ERROR] [server] [client_error] validation_failed error=\"%v\"", err)
|
|
c.JSON(http.StatusBadRequest, gin.H{"error": "invalid request data"})
|
|
return
|
|
}
|
|
|
|
// Extract agent ID from auth middleware if available
|
|
var agentID interface{}
|
|
if agentIDValue, exists := c.Get("agentID"); exists {
|
|
if id, ok := agentIDValue.(uuid.UUID); ok {
|
|
agentID = id
|
|
}
|
|
}
|
|
|
|
// Log to console with HISTORY prefix
|
|
log.Printf("[ERROR] [server] [client] [%s] agent_id=%v subsystem=%s message=\"%s\"",
|
|
req.ErrorType, agentID, req.Subsystem, truncate(req.Message, 200))
|
|
log.Printf("[HISTORY] [server] [client_error] agent_id=%v subsystem=%s type=%s url=\"%s\" message=\"%s\" timestamp=%s",
|
|
agentID, req.Subsystem, req.ErrorType, req.URL, req.Message, time.Now().Format(time.RFC3339))
|
|
|
|
// Store in database with retry logic
|
|
if err := h.storeError(agentID, req); err != nil {
|
|
log.Printf("[ERROR] [server] [client_error] store_failed error=\"%v\"", err)
|
|
c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to store error"})
|
|
return
|
|
}
|
|
|
|
c.JSON(http.StatusOK, gin.H{"logged": true})
|
|
}
|
|
|
|
// storeError persists error to database with retry
|
|
func (h *ClientErrorHandler) storeError(agentID interface{}, req LogErrorRequest) error {
|
|
const maxRetries = 3
|
|
var lastErr error
|
|
|
|
for attempt := 1; attempt <= maxRetries; attempt++ {
|
|
query := `INSERT INTO client_errors (agent_id, subsystem, error_type, message, stack_trace, metadata, url, user_agent)
|
|
VALUES (:agent_id, :subsystem, :error_type, :message, :stack_trace, :metadata, :url, :user_agent)`
|
|
|
|
_, err := h.db.NamedExec(query, map[string]interface{}{
|
|
"agent_id": agentID,
|
|
"subsystem": req.Subsystem,
|
|
"error_type": req.ErrorType,
|
|
"message": req.Message,
|
|
"stack_trace": req.StackTrace,
|
|
"metadata": req.Metadata,
|
|
"url": req.URL,
|
|
"user_agent": c.GetHeader("User-Agent"),
|
|
})
|
|
|
|
if err == nil {
|
|
return nil
|
|
}
|
|
|
|
lastErr = err
|
|
if attempt < maxRetries {
|
|
time.Sleep(time.Duration(attempt) * time.Second)
|
|
continue
|
|
}
|
|
}
|
|
|
|
return fmt.Errorf("failed after %d attempts: %w", maxRetries, lastErr)
|
|
}
|
|
|
|
func truncate(s string, maxLen int) string {
|
|
if len(s) <= maxLen {
|
|
return s
|
|
}
|
|
return s[:maxLen] + "..."
|
|
}
|
|
|
|
func hash(s string) string {
|
|
// Simple hash for message deduplication detection
|
|
h := sha256.Sum256([]byte(s))
|
|
return fmt.Sprintf("%x", h)[:16]
|
|
}
|
|
```
|
|
|
|
#### 3.2 Query Client Errors
|
|
**File**: `aggregator-server/internal/database/queries/client_errors.sql`
|
|
```sql
|
|
-- name: GetClientErrorsByAgent :many
|
|
SELECT * FROM client_errors
|
|
WHERE agent_id = $1
|
|
ORDER BY created_at DESC
|
|
LIMIT $2;
|
|
|
|
-- name: GetClientErrorsBySubsystem :many
|
|
SELECT * FROM client_errors
|
|
WHERE subsystem = $1
|
|
ORDER BY created_at DESC
|
|
LIMIT $2;
|
|
|
|
-- name: GetClientErrorStats :one
|
|
SELECT
|
|
subsystem,
|
|
error_type,
|
|
COUNT(*) as count,
|
|
MIN(created_at) as first_occurrence,
|
|
MAX(created_at) as last_occurrence
|
|
FROM client_errors
|
|
WHERE created_at > NOW() - INTERVAL '24 hours'
|
|
GROUP BY subsystem, error_type
|
|
ORDER BY count DESC;
|
|
```
|
|
|
|
#### 3.3 Update Router
|
|
**File**: `aggregator-server/internal/api/router.go`
|
|
```go
|
|
// Add to router setup function
|
|
func SetupRouter(db *sqlx.DB, cfg *config.Config) *gin.Engine {
|
|
// ... existing setup ...
|
|
|
|
// Error logging endpoint (authenticated)
|
|
errorHandler := handlers.NewClientErrorHandler(db)
|
|
apiV1.POST("/logs/client-error",
|
|
middleware.AuthMiddleware(),
|
|
errorHandler.LogError,
|
|
)
|
|
|
|
// Admin endpoint for viewing errors
|
|
apiV1.GET("/logs/client-errors",
|
|
middleware.AuthMiddleware(),
|
|
middleware.AdminMiddleware(),
|
|
errorHandler.GetErrors,
|
|
)
|
|
|
|
// ... rest of setup ...
|
|
}
|
|
```
|
|
|
|
### Time Required: 20 minutes
|
|
|
|
---
|
|
|
|
## Phase 4: Frontend Error Logger
|
|
|
|
### Files to Create
|
|
|
|
#### 4.1 Client Error Logger
|
|
**File**: `aggregator-web/src/lib/client-error-logger.ts`
|
|
```typescript
|
|
import { api, ApiError } from './api';
|
|
|
|
export interface ClientErrorLog {
|
|
subsystem: string;
|
|
error_type: 'javascript_error' | 'api_error' | 'ui_error' | 'validation_error';
|
|
message: string;
|
|
stack_trace?: string;
|
|
metadata?: Record<string, any>;
|
|
url: string;
|
|
timestamp: string;
|
|
}
|
|
|
|
/**
|
|
* ClientErrorLogger provides reliable frontend error logging with retry logic
|
|
* Implements ETHOS #3: Assume Failure; Build for Resilience
|
|
*/
|
|
export class ClientErrorLogger {
|
|
private maxRetries = 3;
|
|
private baseDelayMs = 1000;
|
|
private localStorageKey = 'redflag-error-queue';
|
|
private offlineBuffer: ClientErrorLog[] = [];
|
|
private isOnline = navigator.onLine;
|
|
|
|
constructor() {
|
|
// Listen for online/offline events
|
|
window.addEventListener('online', () => this.flushOfflineBuffer());
|
|
window.addEventListener('offline', () => { this.isOnline = false; });
|
|
}
|
|
|
|
/**
|
|
* Log an error with automatic retry and offline queuing
|
|
*/
|
|
async logError(errorData: Omit<ClientErrorLog, 'url' | 'timestamp'>): Promise<void> {
|
|
const fullError: ClientErrorLog = {
|
|
...errorData,
|
|
url: window.location.href,
|
|
timestamp: new Date().toISOString(),
|
|
};
|
|
|
|
// Try to send immediately
|
|
try {
|
|
await this.sendWithRetry(fullError);
|
|
return;
|
|
} catch (error) {
|
|
// If failed after retries, queue for later
|
|
this.queueForRetry(fullError);
|
|
}
|
|
}
|
|
|
|
/**
|
|
* Send error to backend with exponential backoff retry
|
|
*/
|
|
private async sendWithRetry(error: ClientErrorLog): Promise<void> {
|
|
for (let attempt = 1; attempt <= this.maxRetries; attempt++) {
|
|
try {
|
|
await api.post('/logs/client-error', error, {
|
|
headers: { 'X-Error-Logger-Request': 'true' },
|
|
});
|
|
|
|
// Success, remove from queue if it was there
|
|
this.removeFromQueue(error);
|
|
return;
|
|
} catch (error) {
|
|
if (attempt === this.maxRetries) {
|
|
throw error; // Rethrow after final attempt
|
|
}
|
|
|
|
// Exponential backoff
|
|
await this.sleep(this.baseDelayMs * Math.pow(2, attempt - 1));
|
|
}
|
|
}
|
|
}
|
|
|
|
/**
|
|
* Queue error for retry when network is available
|
|
*/
|
|
private queueForRetry(error: ClientErrorLog): void {
|
|
try {
|
|
const queue = this.getQueue();
|
|
queue.push({
|
|
...error,
|
|
retryCount: (error as any).retryCount || 0,
|
|
queuedAt: new Date().toISOString(),
|
|
});
|
|
|
|
// Save to localStorage for persistence
|
|
localStorage.setItem(this.localStorageKey, JSON.stringify(queue));
|
|
|
|
// Also keep in memory buffer
|
|
this.offlineBuffer.push(error);
|
|
} catch (storageError) {
|
|
// localStorage might be full or unavailable
|
|
console.warn('Failed to queue error for retry:', storageError);
|
|
}
|
|
}
|
|
|
|
/**
|
|
* Flush offline buffer when coming back online
|
|
*/
|
|
private async flushOfflineBuffer(): Promise<void> {
|
|
if (!this.isOnline) return;
|
|
|
|
const queue = this.getQueue();
|
|
if (queue.length === 0) return;
|
|
|
|
const failed: typeof queue = [];
|
|
|
|
for (const queuedError of queue) {
|
|
try {
|
|
await this.sendWithRetry(queuedError);
|
|
} catch (error) {
|
|
failed.push(queuedError);
|
|
}
|
|
}
|
|
|
|
// Update queue with remaining failed items
|
|
if (failed.length < queue.length) {
|
|
localStorage.setItem(this.localStorageKey, JSON.stringify(failed));
|
|
}
|
|
}
|
|
|
|
/**
|
|
* Get current error queue from localStorage
|
|
*/
|
|
private getQueue(): any[] {
|
|
try {
|
|
const stored = localStorage.getItem(this.localStorageKey);
|
|
return stored ? JSON.parse(stored) : [];
|
|
} catch {
|
|
return [];
|
|
}
|
|
}
|
|
|
|
/**
|
|
* Remove successfully sent error from queue
|
|
*/
|
|
private removeFromQueue(sentError: ClientErrorLog): void {
|
|
try {
|
|
const queue = this.getQueue();
|
|
const filtered = queue.filter(queued =>
|
|
queued.timestamp !== sentError.timestamp ||
|
|
queued.message !== sentError.message
|
|
);
|
|
|
|
if (filtered.length < queue.length) {
|
|
localStorage.setItem(this.localStorageKey, JSON.stringify(filtered));
|
|
}
|
|
} catch {
|
|
// Best effort cleanup
|
|
}
|
|
}
|
|
|
|
/**
|
|
* Capture unhandled JavaScript errors
|
|
*/
|
|
captureUnhandledErrors(): void {
|
|
// Global error handler
|
|
window.addEventListener('error', (event) => {
|
|
this.logError({
|
|
subsystem: 'global',
|
|
error_type: 'javascript_error',
|
|
message: event.message,
|
|
stack_trace: event.error?.stack,
|
|
metadata: {
|
|
filename: event.filename,
|
|
lineno: event.lineno,
|
|
colno: event.colno,
|
|
},
|
|
}).catch(() => {
|
|
// Silently ignore logging failures
|
|
});
|
|
});
|
|
|
|
// Unhandled promise rejections
|
|
window.addEventListener('unhandledrejection', (event) => {
|
|
this.logError({
|
|
subsystem: 'global',
|
|
error_type: 'javascript_error',
|
|
message: event.reason?.message || String(event.reason),
|
|
stack_trace: event.reason?.stack,
|
|
}).catch(() => {});
|
|
});
|
|
}
|
|
|
|
private sleep(ms: number): Promise<void> {
|
|
return new Promise(resolve => setTimeout(resolve, ms));
|
|
}
|
|
}
|
|
|
|
// Singleton instance
|
|
export const clientErrorLogger = new ClientErrorLogger();
|
|
|
|
// Auto-capture unhandled errors
|
|
if (typeof window !== 'undefined') {
|
|
clientErrorLogger.captureUnhandledErrors();
|
|
}
|
|
```
|
|
|
|
#### 4.2 Toast Wrapper with Logging
|
|
**File**: `aggregator-web/src/lib/toast-with-logging.ts`
|
|
```typescript
|
|
import toast, { ToastOptions } from 'react-hot-toast';
|
|
import { clientErrorLogger } from './client-error-logger';
|
|
import { useLocation } from 'react-router-dom';
|
|
|
|
/**
|
|
* Extract subsystem from current route
|
|
*/
|
|
function getCurrentSubsystem(): string {
|
|
if (typeof window === 'undefined') return 'unknown';
|
|
|
|
const path = window.location.pathname;
|
|
|
|
// Map routes to subsystems
|
|
if (path.includes('/storage')) return 'storage';
|
|
if (path.includes('/system')) return 'system';
|
|
if (path.includes('/docker')) return 'docker';
|
|
if (path.includes('/updates')) return 'updates';
|
|
if (path.includes('/agent/')) return 'agent';
|
|
|
|
return 'unknown';
|
|
}
|
|
|
|
/**
|
|
* Wrap toast.error to automatically log errors to backend
|
|
* Implements ETHOS #1: Errors are History
|
|
*/
|
|
export const toastWithLogging = {
|
|
error: (message: string, options?: ToastOptions & { subsystem?: string }) => {
|
|
const subsystem = options?.subsystem || getCurrentSubsystem();
|
|
|
|
// Log to backend asynchronously - don't block UI
|
|
clientErrorLogger.logError({
|
|
subsystem,
|
|
error_type: 'ui_error',
|
|
message: message.substring(0, 5000), // Prevent excessively long messages
|
|
metadata: {
|
|
component: options?.id,
|
|
duration: options?.duration,
|
|
position: options?.position,
|
|
timestamp: new Date().toISOString(),
|
|
},
|
|
}).catch(() => {
|
|
// Silently ignore logging failures - don't crash the UI
|
|
});
|
|
|
|
// Show toast to user
|
|
return toast.error(message, options);
|
|
},
|
|
|
|
// Passthrough methods
|
|
success: toast.success,
|
|
info: toast.info,
|
|
warning: toast.warning,
|
|
loading: toast.loading,
|
|
dismiss: toast.dismiss,
|
|
remove: toast.remove,
|
|
promise: toast.promise,
|
|
};
|
|
|
|
/**
|
|
* React hook for toast with automatic subsystem detection
|
|
*/
|
|
export function useToastWithLogging() {
|
|
const location = useLocation();
|
|
|
|
return {
|
|
error: (message: string, options?: ToastOptions) => {
|
|
return toastWithLogging.error(message, {
|
|
...options,
|
|
subsystem: getSubsystemFromPath(location.pathname),
|
|
});
|
|
},
|
|
success: toast.success,
|
|
info: toast.info,
|
|
warning: toast.warning,
|
|
loading: toast.loading,
|
|
dismiss: toast.dismiss,
|
|
};
|
|
}
|
|
|
|
function getSubsystemFromPath(pathname: string): string {
|
|
const matches = pathname.match(/\/(storage|system|docker|updates|agent)/);
|
|
return matches ? matches[1] : 'unknown';
|
|
}
|
|
```
|
|
|
|
#### 4.3 API Integration
|
|
**Update**: `aggregator-web/src/lib/api.ts`
|
|
```typescript
|
|
// Add error logging to axios interceptor
|
|
api.interceptors.response.use(
|
|
(response) => response,
|
|
async (error) => {
|
|
// Don't log errors from the error logger itself
|
|
if (error.config?.headers?.['X-Error-Logger-Request']) {
|
|
return Promise.reject(error);
|
|
}
|
|
|
|
// Extract subsystem from URL
|
|
const subsystem = extractSubsystem(error.config?.url);
|
|
|
|
// Log API errors
|
|
clientErrorLogger.logError({
|
|
subsystem,
|
|
error_type: 'api_error',
|
|
message: error.message,
|
|
metadata: {
|
|
status_code: error.response?.status,
|
|
endpoint: error.config?.url,
|
|
method: error.config?.method,
|
|
response_data: error.response?.data,
|
|
},
|
|
}).catch(() => {
|
|
// Don't let logging errors hide the original error
|
|
});
|
|
|
|
return Promise.reject(error);
|
|
}
|
|
);
|
|
|
|
function extractSubsystem(url: string = ''): string {
|
|
const matches = url.match(/\/(storage|system|docker|updates|agent)/);
|
|
return matches ? matches[1] : 'unknown';
|
|
}
|
|
```
|
|
|
|
### Time Required: 20 minutes
|
|
|
|
---
|
|
|
|
## Phase 5: Toast Integration
|
|
|
|
### Update Existing Error Calls
|
|
|
|
**Pattern**: Update error toast calls to use new logger
|
|
|
|
**Before**:
|
|
```typescript
|
|
import toast from 'react-hot-toast';
|
|
|
|
toast.error(`Failed to trigger scan: ${error.message}`);
|
|
```
|
|
|
|
**After**:
|
|
```typescript
|
|
import { toastWithLogging } from '@/lib/toast-with-logging';
|
|
|
|
toastWithLogging.error(`Failed to trigger scan: ${error.message}`, {
|
|
subsystem: 'storage', // Specify subsystem
|
|
id: 'trigger-scan-error', // Optional component ID
|
|
});
|
|
```
|
|
|
|
### Priority Files to Update
|
|
|
|
#### 5.1 React State Management for Scan Buttons
|
|
**File**: Create `aggregator-web/src/hooks/useScanState.ts`
|
|
```typescript
|
|
import { useState, useCallback } from 'react';
|
|
import { api } from '@/lib/api';
|
|
import { toastWithLogging } from '@/lib/toast-with-logging';
|
|
|
|
interface ScanState {
|
|
isScanning: boolean;
|
|
commandId?: string;
|
|
error?: string;
|
|
}
|
|
|
|
/**
|
|
* Hook for managing scan button state and preventing duplicate scans
|
|
*/
|
|
export function useScanState(agentId: string, subsystem: string) {
|
|
const [state, setState] = useState<ScanState>({
|
|
isScanning: false,
|
|
});
|
|
|
|
const triggerScan = useCallback(async () => {
|
|
if (state.isScanning) {
|
|
toastWithLogging.info('Scan already in progress', { subsystem });
|
|
return;
|
|
}
|
|
|
|
setState({ isScanning: true, commandId: undefined, error: undefined });
|
|
|
|
try {
|
|
const result = await api.post(`/agents/${agentId}/subsystems/${subsystem}/trigger`);
|
|
|
|
setState(prev => ({
|
|
...prev,
|
|
commandId: result.data.command_id,
|
|
}));
|
|
|
|
// Poll for completion or wait for subscription update
|
|
await waitForScanComplete(agentId, result.data.command_id);
|
|
|
|
setState({ isScanning: false, commandId: result.data.command_id });
|
|
|
|
toastWithLogging.success(`${subsystem} scan completed`, { subsystem });
|
|
} catch (error: any) {
|
|
const isAlreadyRunning = error.response?.status === 409;
|
|
|
|
if (isAlreadyRunning) {
|
|
const existingCommandId = error.response?.data?.command_id;
|
|
setState({
|
|
isScanning: false,
|
|
commandId: existingCommandId,
|
|
error: 'Scan already in progress',
|
|
});
|
|
|
|
toastWithLogging.info(`Scan already running (command: ${existingCommandId})`, { subsystem });
|
|
} else {
|
|
const errorMessage = error.response?.data?.error || error.message;
|
|
setState({
|
|
isScanning: false,
|
|
error: errorMessage,
|
|
});
|
|
|
|
toastWithLogging.error(`Failed to trigger scan: ${errorMessage}`, { subsystem });
|
|
}
|
|
}
|
|
}, [agentId, subsystem, state.isScanning]);
|
|
|
|
const reset = useCallback(() => {
|
|
setState({ isScanning: false, commandId: undefined, error: undefined });
|
|
}, []);
|
|
|
|
return {
|
|
isScanning: state.isScanning,
|
|
commandId: state.commandId,
|
|
error: state.error,
|
|
triggerScan,
|
|
reset,
|
|
};
|
|
}
|
|
|
|
/**
|
|
* Wait for scan to complete by polling command status
|
|
*/
|
|
async function waitForScanComplete(agentId: string, commandId: string): Promise<void> {
|
|
const maxWaitMs = 300000; // 5 minutes max
|
|
const startTime = Date.now();
|
|
const pollInterval = 2000; // Poll every 2 seconds
|
|
|
|
return new Promise((resolve, reject) => {
|
|
const interval = setInterval(async () => {
|
|
try {
|
|
const result = await api.get(`/agents/${agentId}/commands/${commandId}`);
|
|
|
|
if (result.data.status === 'completed' || result.data.status === 'failed') {
|
|
clearInterval(interval);
|
|
resolve();
|
|
}
|
|
} catch (error) {
|
|
clearInterval(interval);
|
|
reject(error);
|
|
}
|
|
|
|
if (Date.now() - startTime > maxWaitMs) {
|
|
clearInterval(interval);
|
|
reject(new Error('Scan timeout'));
|
|
}
|
|
}, pollInterval);
|
|
});
|
|
}
|
|
```
|
|
|
|
**Usage Example in Component**:
|
|
```typescript
|
|
import { useScanState } from '@/hooks/useScanState';
|
|
|
|
function ScanButton({ agentId, subsystem }: { agentId: string; subsystem: string }) {
|
|
const { isScanning, triggerScan } = useScanState(agentId, subsystem);
|
|
|
|
return (
|
|
<button
|
|
onClick={triggerScan}
|
|
disabled={isScanning}
|
|
className={isScanning ? 'btn-disabled' : 'btn-primary'}
|
|
>
|
|
{isScanning ? (
|
|
<>
|
|
<Spinner className="animate-spin" />
|
|
Scanning...
|
|
</>
|
|
) : (
|
|
`Scan ${subsystem}`
|
|
)}
|
|
</button>
|
|
);
|
|
}
|
|
```
|
|
|
|
#### 5.2 Update Existing Error Calls
|
|
**Priority Files to Update**
|
|
|
|
1. **Agent Subsystem Actions** - `/src/components/AgentSubsystems.tsx`
|
|
2. **Command Retry Logic** - `/src/hooks/useCommands.ts`
|
|
3. **Authentication Errors** - `/src/lib/auth.ts`
|
|
4. **API Error Boundaries** - `/src/components/ErrorBoundary.tsx`
|
|
|
|
### Example Complete Integration
|
|
|
|
**File**: `aggregator-web/src/components/AgentSubsystems.tsx` (example update)
|
|
```typescript
|
|
import { toastWithLogging } from '@/lib/toast-with-logging';
|
|
|
|
const handleTrigger = async (subsystem: string) => {
|
|
try {
|
|
await triggerSubsystem(agentId, subsystem);
|
|
} catch (error) {
|
|
toastWithLogging.error(
|
|
`Failed to trigger ${subsystem} scan: ${error.message}`,
|
|
{
|
|
subsystem,
|
|
id: `trigger-${subsystem}`,
|
|
}
|
|
);
|
|
}
|
|
};
|
|
```
|
|
|
|
### Time Required: 15 minutes
|
|
|
|
#### 5.3 Deduplication Testing
|
|
**Test Cases**:
|
|
```typescript
|
|
// Test 1: Rapid clicking prevention
|
|
test('clicking scan button 10 times creates only 1 command', async () => {
|
|
const button = screen.getByText('Scan APT');
|
|
|
|
// Click 10 times rapidly
|
|
for (let i = 0; i < 10; i++) {
|
|
fireEvent.click(button);
|
|
}
|
|
|
|
// Should only create 1 command
|
|
expect(api.post).toHaveBeenCalledTimes(1);
|
|
expect(api.post).toHaveBeenCalledWith('/agents/123/subsystems/apt/trigger');
|
|
});
|
|
|
|
// Test 2: Button disabled while scanning
|
|
test('button disabled during scan', async () => {
|
|
const button = screen.getByText('Scan APT');
|
|
|
|
fireEvent.click(button);
|
|
|
|
// Button should be disabled immediately
|
|
expect(button).toBeDisabled();
|
|
expect(screen.getByText('Scanning...')).toBeInTheDocument();
|
|
|
|
await waitFor(() => {
|
|
expect(button).not.toBeDisabled();
|
|
});
|
|
});
|
|
|
|
// Test 3: 409 Conflict returns existing command
|
|
mock.onPost().reply(409, {
|
|
error: 'Scan already in progress',
|
|
command_id: 'existing-id',
|
|
});
|
|
|
|
expect(await triggerScan()).toEqual({ command_id: 'existing-id' });
|
|
expect(toast).toHaveBeenCalledWith('Scan already running');
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 6: Verification & Testing
|
|
|
|
### Manual Testing Checklist
|
|
|
|
#### 6.1 Migration Testing
|
|
- [ ] Run migration 023 successfully
|
|
- [ ] Verify `client_errors` table exists
|
|
- [ ] Verify `idempotency_key` column added to `agent_commands`
|
|
- [ ] Test on fresh database (no duplicate key errors)
|
|
|
|
#### 6.2 Command Factory Testing
|
|
- [ ] Rapid-fire scan button clicks (10+ clicks in 2 seconds)
|
|
- [ ] Verify all commands created with unique IDs
|
|
- [ ] Check no duplicate key violations in logs
|
|
- [ ] Verify commands appear in database correctly
|
|
|
|
#### 6.3 Error Logging Testing
|
|
- [ ] Trigger UI error (e.g., invalid input)
|
|
- [ ] Verify error appears in toast
|
|
- [ ] Check database - error should be stored in `client_errors`
|
|
- [ ] Trigger API error (e.g., network timeout)
|
|
- [ ] Verify exponential backoff retry works
|
|
- [ ] Disconnect network, trigger error, reconnect
|
|
- [ ] Verify error is queued and sent when back online
|
|
|
|
#### 6.4 Integration Testing
|
|
- [ ] Full user workflow: login → trigger scan → view results
|
|
- [ ] Verify all errors logged with [HISTORY] prefix
|
|
- [ ] Check logs are queryable by subsystem
|
|
- [ ] Verify error logging doesn't block UI
|
|
|
|
### Automated Test Cases
|
|
|
|
#### 6.5 Backend Tests
|
|
**File**: `aggregator-server/internal/command/factory_test.go`
|
|
```go
|
|
package command
|
|
|
|
import (
|
|
"testing"
|
|
|
|
"github.com/google/uuid"
|
|
"github.com/stretchr/testify/assert"
|
|
"github.com/stretchr/testify/require"
|
|
)
|
|
|
|
func TestFactory_Create(t *testing.T) {
|
|
factory := NewFactory()
|
|
agentID := uuid.New()
|
|
|
|
cmd, err := factory.Create(agentID, "scan_storage", map[string]interface{}{"path": "/"})
|
|
|
|
require.NoError(t, err)
|
|
assert.NotEqual(t, uuid.Nil, cmd.ID, "ID must be generated")
|
|
assert.Equal(t, agentID, cmd.AgentID)
|
|
assert.Equal(t, "scan_storage", cmd.CommandType)
|
|
assert.Equal(t, "pending", cmd.Status)
|
|
assert.Equal(t, "manual", cmd.Source)
|
|
}
|
|
|
|
func TestFactory_CreateWithIdempotency(t *testing.T) {
|
|
factory := NewFactory()
|
|
agentID := uuid.New()
|
|
idempotencyKey := "test-key-123"
|
|
|
|
// Create first command
|
|
cmd1, err := factory.CreateWithIdempotency(agentID, "scan_system", nil, idempotencyKey)
|
|
require.NoError(t, err)
|
|
|
|
// Create "duplicate" command with same idempotency key
|
|
cmd2, err := factory.CreateWithIdempotency(agentID, "scan_system", nil, idempotencyKey)
|
|
require.NoError(t, err)
|
|
|
|
// Should return same command
|
|
assert.Equal(t, cmd1.ID, cmd2.ID, "Idempotency key should return same command")
|
|
}
|
|
|
|
func TestFactory_Validate(t *testing.T) {
|
|
tests := []struct {
|
|
name string
|
|
cmd *models.AgentCommand
|
|
wantErr bool
|
|
}{
|
|
{
|
|
name: "valid command",
|
|
cmd: &models.AgentCommand{
|
|
ID: uuid.New(),
|
|
AgentID: uuid.New(),
|
|
CommandType: "scan_storage",
|
|
Status: "pending",
|
|
Source: "manual",
|
|
},
|
|
wantErr: false,
|
|
},
|
|
{
|
|
name: "missing ID",
|
|
cmd: &models.AgentCommand{
|
|
ID: uuid.Nil,
|
|
AgentID: uuid.New(),
|
|
CommandType: "scan_storage",
|
|
Status: "pending",
|
|
Source: "manual",
|
|
},
|
|
wantErr: true,
|
|
},
|
|
}
|
|
|
|
factory := NewFactory()
|
|
|
|
for _, tt := range tests {
|
|
t.Run(tt.name, func(t *testing.T) {
|
|
_, err := factory.Create(tt.cmd.AgentID, tt.cmd.CommandType, nil)
|
|
if tt.wantErr {
|
|
assert.Error(t, err)
|
|
} else {
|
|
assert.NoError(t, err)
|
|
}
|
|
})
|
|
}
|
|
}
|
|
```
|
|
|
|
#### 6.6 Frontend Tests
|
|
**File**: `aggregator-web/src/lib/client-error-logger.test.ts`
|
|
```typescript
|
|
import { clientErrorLogger } from './client-error-logger';
|
|
import { api } from './api';
|
|
|
|
jest.mock('./api');
|
|
|
|
describe('ClientErrorLogger', () => {
|
|
beforeEach(() => {
|
|
localStorage.clear();
|
|
jest.clearAllMocks();
|
|
});
|
|
|
|
test('logs error successfully on first attempt', async () => {
|
|
(api.post as jest.Mock).mockResolvedValue({});
|
|
|
|
await clientErrorLogger.logError({
|
|
subsystem: 'storage',
|
|
error_type: 'api_error',
|
|
message: 'Test error',
|
|
});
|
|
|
|
expect(api.post).toHaveBeenCalledTimes(1);
|
|
expect(api.post).toHaveBeenCalledWith(
|
|
'/logs/client-error',
|
|
expect.objectContaining({
|
|
subsystem: 'storage',
|
|
error_type: 'api_error',
|
|
message: 'Test error',
|
|
}),
|
|
expect.any(Object)
|
|
);
|
|
});
|
|
|
|
test('retries on failure then saves to localStorage', async () => {
|
|
(api.post as jest.Mock)
|
|
.mockRejectedValueOnce(new Error('Network error'))
|
|
.mockRejectedValueOnce(new Error('Network error'))
|
|
.mockRejectedValueOnce(new Error('Network error'));
|
|
|
|
await clientErrorLogger.logError({
|
|
subsystem: 'storage',
|
|
error_type: 'api_error',
|
|
message: 'Test error',
|
|
});
|
|
|
|
expect(api.post).toHaveBeenCalledTimes(3);
|
|
|
|
// Should be saved to localStorage
|
|
const queue = localStorage.getItem('redflag-error-queue');
|
|
expect(queue).toBeTruthy();
|
|
expect(JSON.parse(queue!).length).toBe(1);
|
|
});
|
|
|
|
test('flushes queue when coming back online', async () => {
|
|
// Pre-populate queue
|
|
const queuedError = {
|
|
subsystem: 'storage',
|
|
error_type: 'api_error',
|
|
message: 'Queued error',
|
|
timestamp: new Date().toISOString(),
|
|
};
|
|
localStorage.setItem('redflag-error-queue', JSON.stringify([queuedError]));
|
|
|
|
(api.post as jest.Mock).mockResolvedValue({});
|
|
|
|
// Trigger online event
|
|
window.dispatchEvent(new Event('online'));
|
|
|
|
// Wait for flush
|
|
await new Promise(resolve => setTimeout(resolve, 100));
|
|
|
|
expect(api.post).toHaveBeenCalled();
|
|
expect(localStorage.getItem('redflag-error-queue')).toBe('[]');
|
|
});
|
|
});
|
|
```
|
|
|
|
### Time Required: 30 minutes
|
|
|
|
---
|
|
|
|
## Implementation Checklist
|
|
|
|
### Pre-Implementation
|
|
- [ ] ✅ Migration system bug fixed (lines 103 & 116 in db.go)
|
|
- [ ] ✅ Database wiped and fresh instance ready
|
|
- [ ] ✅ Test agents available for rapid scan testing
|
|
- [ ] ✅ Development environment ready (all 3 components)
|
|
|
|
### Phase 1: Command Factory (25 min)
|
|
- [ ] Create `aggregator-server/internal/command/factory.go`
|
|
- [ ] Create `aggregator-server/internal/command/validator.go`
|
|
- [ ] Update `aggregator-server/internal/models/command.go`
|
|
- [ ] Update `aggregator-server/internal/api/handlers/subsystems.go`
|
|
- [ ] Test: Verify rapid scan clicks work
|
|
|
|
### Phase 2: Database Schema (5 min)
|
|
- [ ] Create migration `023_client_error_logging.up.sql`
|
|
- [ ] Create migration `023_client_error_logging.down.sql`
|
|
- [ ] Run migration and verify table creation
|
|
- [ ] Verify indexes created
|
|
|
|
### Phase 3: Backend Handler (20 min)
|
|
- [ ] Create `aggregator-server/internal/api/handlers/client_errors.go`
|
|
- [ ] Create `aggregator-server/internal/database/queries/client_errors.sql`
|
|
- [ ] Update `aggregator-server/internal/api/router.go`
|
|
- [ ] Test API endpoint with curl
|
|
|
|
### Phase 4: Frontend Logger (20 min)
|
|
- [ ] Create `aggregator-web/src/lib/client-error-logger.ts`
|
|
- [ ] Create `aggregator-web/src/lib/toast-with-logging.ts`
|
|
- [ ] Update `aggregator-web/src/lib/api.ts`
|
|
- [ ] Test offline/online queue behavior
|
|
|
|
### Phase 5: Toast Integration (15 min)
|
|
- [ ] Create `useScanState` hook for button state management
|
|
- [ ] Update scan buttons to use `useScanState`
|
|
- [ ] Test button disabling during scan
|
|
- [ ] Update 3-5 critical error locations to use `toastWithLogging`
|
|
- [ ] Verify errors appear in both toast and database
|
|
- [ ] Test in multiple subsystems
|
|
- [ ] **Test deduplication**: Rapid clicking creates only 1 command
|
|
- [ ] **Test 409 response**: Returns existing command when scan running
|
|
|
|
### Phase 6: Verification (30 min)
|
|
- [ ] Run all test cases
|
|
- [ ] Verify ETHOS compliance checklist
|
|
- [ ] Test rapid scan clicking (no duplicates)
|
|
- [ ] Test error persistence across page reloads
|
|
- [ ] Verify [HISTORY] logs in server output
|
|
|
|
### Documentation
|
|
- [ ] Update session documentation
|
|
- [ ] Create testing summary
|
|
- [ ] Document any issues encountered
|
|
- [ ] Update architecture documentation
|
|
|
|
---
|
|
|
|
## Risk Mitigation
|
|
|
|
### Risk 1: Migration Failures
|
|
**Probability**: Medium | **Impact**: High | **Severity**: 🔴 Critical
|
|
|
|
**Mitigation**:
|
|
- Fix migration runner bug FIRST (before this implementation)
|
|
- Test migration on fresh database
|
|
- Keep database backups
|
|
- Have rollback script ready
|
|
|
|
**Contingency**: If migration fails, manually apply SQL and continue
|
|
|
|
---
|
|
|
|
### Risk 2: Performance Impact
|
|
**Probability**: Low | **Impact**: Medium | **Severity**: 🟡 Medium
|
|
|
|
**Mitigation**:
|
|
- Async error logging (non-blocking)
|
|
- LocalStorage queue with size limit (max 50 errors)
|
|
- Database indexes for fast queries
|
|
- Batch insert if needed in future
|
|
|
|
**Contingency**: If performance degrades, add sampling (log 1 in 10 errors)
|
|
|
|
---
|
|
|
|
### Risk 3: Infinite Error Loops
|
|
**Probability**: Low | **Impact**: High | **Severity**: 🟡 Medium
|
|
|
|
**Mitigation**:
|
|
- `X-Error-Logger-Request` header prevents recursive logging
|
|
- Max retry count (3 attempts)
|
|
- Exponential backoff prevents thundering herd
|
|
|
|
**Contingency**: If loop detected, check for missing header and fix
|
|
|
|
---
|
|
|
|
### Risk 4: Privacy Concerns
|
|
**Probability**: Low | **Impact**: High | **Severity**: 🟡 Medium
|
|
|
|
**Mitigation**:
|
|
- No PII in error messages (validate during logging)
|
|
- User agent stored but can be anonymized
|
|
- Stack traces only from our code (not user code)
|
|
|
|
**Contingency**: Add privacy filter to scrub sensitive data
|
|
|
|
---
|
|
|
|
## Post-Implementation Review
|
|
|
|
### Success Criteria
|
|
- [ ] No duplicate key violations during rapid clicking
|
|
- [ ] All errors persist in database
|
|
- [ ] Error logs queryable and useful for debugging
|
|
- [ ] No performance degradation observed
|
|
- [ ] System handles offline/online transitions gracefully
|
|
- [ ] All tests pass
|
|
|
|
### Performance Benchmarks
|
|
- Command creation: < 10ms per command
|
|
- Error logging: < 50ms per error (async)
|
|
- Database queries: < 100ms for common queries
|
|
- Bundle size increase: < 5KB gzipped
|
|
|
|
### Known Limitations
|
|
- Error logs don't include full request payloads (privacy)
|
|
- localStorage queue limited by browser storage (~5MB)
|
|
- Retries happen in foreground (could be moved to background)
|
|
|
|
### Future Enhancements (Post v0.1.27)
|
|
- Error aggregation and deduplication
|
|
- Error rate alerting
|
|
- Error analytics dashboard
|
|
- Automatic error categorization
|
|
- Integration with notification system
|
|
|
|
---
|
|
|
|
## Rollback Plan
|
|
|
|
If critical issues arise:
|
|
|
|
1. **Revert Code Changes**:
|
|
```bash
|
|
git revert HEAD~6..HEAD # Revert last 6 commits
|
|
```
|
|
|
|
2. **Rollback Database**:
|
|
```bash
|
|
cd aggregator-server
|
|
# Run down migration
|
|
go run cmd/migrate/main.go -migrate-down 1
|
|
```
|
|
|
|
3. **Rebuild and Deploy**:
|
|
```bash
|
|
docker-compose build --no-cache
|
|
docker-compose up -d
|
|
```
|
|
|
|
---
|
|
|
|
## Additional Notes
|
|
|
|
**Team Coordination**:
|
|
- Coordinate with frontend team if they're working on error handling
|
|
- Notify QA about new error logging features for testing
|
|
- Update documentation team about database schema changes
|
|
|
|
**Monitoring**:
|
|
- Monitor `client_errors` table growth
|
|
- Set up alerts for error rate spikes
|
|
- Track failed error logging attempts
|
|
|
|
**Documentation Updates**:
|
|
- Update API documentation for `/logs/client-error` endpoint
|
|
- Document error log query patterns for support team
|
|
- Add troubleshooting guide for common errors
|
|
|
|
---
|
|
|
|
**Plan Created By**: Ani (AI Assistant)
|
|
**Reviewed By**: Casey Tunturi
|
|
**Status**: 🟢 APPROVED FOR IMPLEMENTATION
|
|
**Next Step**: Begin Phase 1 (Command Factory)
|
|
|
|
**Estimated Timeline**:
|
|
- Start: Immediately
|
|
- Complete: ~2-3 hours
|
|
- Test: 30 minutes
|
|
- Deploy: After verification
|
|
|
|
This is a complete, production-ready implementation plan. Each phase builds on the previous one, with full error handling, testing, and rollback procedures included.
|
|
|
|
Let's build this right. 💪
|