Files
Redflag/docs/4_LOG/November_2025/Security-Documentation/ERROR_FLOW_AUDIT.md

28 KiB

RedFlag Error Handling and Event Flow Audit

Overview

This audit comprehensively maps error handling and event flow across the RedFlag system based on actual code analysis. The goal is to identify gaps where critical events are lost and create a systematic approach to logging all important events and making them visible in the UI.

Section 1: Agent-Side Error Sources

1.1 Command Entry Point

File: aggregator-agent/cmd/agent/main.go

Critical Startup Failures (Lines 259-262)

cfg, err := config.Load(configPath, cliFlags)
if err != nil {
    log.Fatal("Failed to load configuration:", err)
}
  • Current Logging: log.Fatal() - exits immediately, not reported to server
  • Server Reporting: None - agent dies silently from server perspective
  • Gap: Critical configuration failures never reach server

Registration Failures (Lines 305-307)

if err := registerAgent(cfg, *serverURL); err != nil {
    log.Fatal("Registration failed:", err)
}
  • Current Logging: log.Fatal() - exits immediately
  • Server Reporting: None - server sees registration as incomplete but doesn't know why
  • Gap: Registration failure details lost

Scan Command Failures (Lines 323-325, 330-332, 337-339)

if err := handleScanCommand(cfg, *exportFormat); err != nil {
    log.Fatal("Scan failed:", err)
}
  • Current Logging: log.Fatal() - exits immediately for local operations
  • Server Reporting: Not applicable (local command)
  • Gap: Local scan failures not trackable

Agent Runtime Failures (Lines 360-362)

if err := runAgent(cfg); err != nil {
    log.Fatal("Agent failed:", err)
}
  • Current Logging: log.Fatal() - exits immediately
  • Server Reporting: None - server sees agent as offline with no context
  • Gap: Agent startup failures completely invisible to server

1.2 Configuration System

File: aggregator-agent/internal/config/config.go

Configuration Load Failures (Lines 115-117)

if err := validateConfig(config); err != nil {
    return nil, fmt.Errorf("invalid configuration: %w", err)
}
  • Current Logging: Error returned to caller
  • Server Reporting: None - handled at higher level
  • Gap: Configuration validation errors may not reach server

File System Errors (Lines 166-168, 414-416)

if err := os.MkdirAll(dir, 0755); err != nil {
    return nil, fmt.Errorf("failed to create config directory: %w", err)
}
  • Current Logging: Error returned as formatted string
  • Server Reporting: None
  • Gap: File system permission errors lost to stdout

Configuration Migration (Lines 207-230)

func migrateConfig(cfg *Config) {
    if cfg.Version != "5" {
        fmt.Printf("[CONFIG] Migrating config schema from version %s to 5\n", cfg.Version)
        cfg.Version = "5"
    }
    // ... other migrations
}
  • Current Logging: fmt.Printf() to stdout only
  • Server Reporting: None
  • Gap: Configuration migration success/failure not tracked

1.3 Migration System

File: aggregator-agent/internal/migration/executor.go

Migration Execution Failures (Lines 60-62, 67-69, 75-77, 96-98)

if err := e.createBackups(); err != nil {
    return e.completeMigration(false, fmt.Errorf("backup creation failed: %w", err))
}
if err := e.migrateDirectories(); err != nil {
    return e.completeMigration(false, fmt.Errorf("directory migration failed: %w", err))
}
  • Current Logging: Detailed migration logs via fmt.Printf()
  • Server Reporting: None - migration is pre-startup
  • Gap: Migration results visible only in local logs
  • Success Case: Lines 348-352 log success but no server reporting

Validation Failures (Lines 105-107)

if err := e.validateMigration(); err != nil {
    return e.completeMigration(false, fmt.Errorf("migration validation failed: %w", err))
}
  • Current Logging: Validation errors to stdout
  • Server Reporting: None
  • Gap: Migration validation failures not tracked centrally

1.4 Client Communication

File: aggregator-agent/internal/client/client.go

HTTP Request Failures (Lines 114-122, 172-175, 261-264, 329-332)

if resp.StatusCode != http.StatusOK {
    bodyBytes, _ := io.ReadAll(resp.Body)
    return nil, fmt.Errorf("registration failed: %s - %s", resp.Status, string(bodyBytes))
}
  • Current Logging: Error returned to caller
  • Server Reporting: None - this IS the server communication
  • Gap: Communication failures logged locally but not categorized

Network Timeout Failures (Lines 42-45)

http: &http.Client{
    Timeout: 30 * time.Second,
}
  • Current Logging: Go HTTP client logs timeouts
  • Server Reporting: None - agent can't communicate
  • Gap: Network connectivity issues lost

Token Renewal Failures (Lines 167-175, 499-503)

if err := tempClient.RenewToken(cfg.AgentID, cfg.RefreshToken); err != nil {
    log.Printf("❌ Refresh token renewal failed: %v", err)
    return nil, fmt.Errorf("refresh token renewal failed: %w - please re-register agent", err)
}
  • Current Logging: log.Printf() with emoji indicators
  • Server Reporting: None - agent can't authenticate
  • Gap: Token renewal failures cause agent death without server visibility

1.5 Scanner and Orchestrator Systems

Circuit Breaker Failures (Multiple scanner wrappers)

Pattern found in: aggregator-agent/internal/orchestrator/*.go

  • Current Logging: Circuit breaker state changes logged locally
  • Server Reporting: None
  • Gap: Scanner reliability issues not tracked server-side

Scanner Timeouts (Lines in orchestrator files)

  • Current Logging: Timeout errors returned and logged
  • Server Reporting: None
  • Gap: Scanner performance issues invisible to server

Section 2: Server-Side Error Sources

2.1 API Handlers

2.1.1 Agent Registration Handler

File: aggregator-server/internal/api/handlers/agents.go (Lines 40-100)

Invalid Registration Token (Lines 64-67):

if registrationToken == "" {
    c.JSON(http.StatusUnauthorized, gin.H{"error": "registration token required"})
    return
}
  • Current Logging: HTTP 401 response only
  • Database Persistence: No event logged
  • Gap: Failed registration attempts not tracked

Token Validation Failures (Lines 70-74):

tokenInfo, err := h.registrationTokenQueries.ValidateRegistrationToken(registrationToken)
if err != nil || tokenInfo == nil {
    c.JSON(http.StatusUnauthorized, gin.H{"error": "invalid or expired registration token"})
    return
}
  • Current Logging: HTTP 401 response only
  • Database Persistence: No event logged
  • Gap: Security events (invalid tokens) not audited

Machine ID Conflicts (Lines 77-84):

existingAgent, err := h.agentQueries.GetAgentByMachineID(req.MachineID)
if err == nil && existingAgent != nil && existingAgent.ID.String() != "" {
    c.JSON(http.StatusConflict, gin.H{"error": "machine ID already registered to another agent"})
    return
}
  • Current Logging: HTTP 409 response only
  • Database Persistence: No event logged
  • Gap: Security events (duplicate machine IDs) not audited

2.1.2 Download Handler

File: aggregator-server/internal/api/handlers/downloads.go (Already analyzed in previous fixes)

File Not Found (Lines 100-110):

info, err := os.Stat(agentPath)
if err != nil {
    c.JSON(http.StatusNotFound, gin.H{
        "error":    "Agent binary not found",
        "platform": platform,
        "version":  version,
    })
    return
}
  • Current Logging: HTTP 404 response only
  • Database Persistence: No event logged
  • Gap: Download failures not tracked

Empty File Handling (Lines 110-117):

if info.Size() == 0 {
    c.JSON(http.StatusNotFound, gin.H{
        "error":    "Agent binary not found (empty file)",
        "platform": platform,
        "version":  version,
    })
    return
}
  • Current Logging: HTTP 404 response only
  • Database Persistence: No event logged
  • Gap: File corruption/deployment issues not tracked

2.1.3 Agent Setup Handler

File: aggregator-server/internal/api/handlers/agent_setup.go

Invalid Request Binding (Lines 14-17):

if err := c.ShouldBindJSON(&req); err != nil {
    c.JSON(http.StatusBadRequest, gin.H{"error": err.Error()})
    return
}
  • Current Logging: HTTP 400 response only
  • Database Persistence: No event logged
  • Gap: Malformed setup requests not tracked

Configuration Build Failures (Lines 23-27):

config, err := configBuilder.BuildAgentConfig(req)
if err != nil {
    c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()})
    return
}
  • Current Logging: HTTP 500 response only
  • Database Persistence: No event logged
  • Gap: Build system failures not tracked

2.2 Service Layer

2.2.1 Agent Lifecycle Service

File: aggregator-server/internal/services/agent_lifecycle.go

Validation Failures (Lines 73-75):

if err := s.validateOperation(op, agentCfg); err != nil {
    return nil, fmt.Errorf("validation failed: %w", err)
}
  • Current Logging: Error returned to caller
  • Database Persistence: No event logged
  • Gap: Agent lifecycle validation failures not tracked

Agent Not Found (Lines 78-81):

_, err := s.getAgent(ctx, agentCfg.AgentID)
if err != nil && op != OperationNew {
    return nil, fmt.Errorf("agent not found: %w", err)
}
  • Current Logging: Error returned to caller
  • Database Persistence: No event logged
  • Gap: Invalid upgrade/rebuild attempts not tracked

Configuration Generation Failures (Lines 90-92):

if err != nil {
    return nil, fmt.Errorf("config generation failed: %w", err)
}
  • Current Logging: Error returned to caller
  • Database Persistence: No event logged
  • Gap: Configuration system failures not tracked

2.2.2 Placeholder Services (Lines 270-315)

Build Service Operations:

func (s *BuildService) IsBuildRequired(cfg *AgentConfig) (bool, error) {
    // Placeholder: Always return false for now (use existing builds)
    return false, nil
}
  • Current Logging: None
  • Database Persistence: None
  • Gap: Build operations completely untracked

Artifact Service Operations:

func (s *ArtifactService) Store(ctx context.Context, artifacts *BuildArtifacts) error {
    // Placeholder: Do nothing for now
    return nil
}
  • Current Logging: None
  • Database Persistence: None
  • Gap: Artifact management completely untracked

2.3 Database Layer

2.3.1 Connection and Query Failures

Pattern: All database queries use standard Go error returns

  • Current Logging: Errors returned up the call stack
  • Database Persistence: Errors don't create audit trails
  • Gap: Database operational issues not tracked separately

Section 3: Database Schema Analysis

3.1 Current Schema (From Migration Files)

3.1.1 Core Tables (001_initial_schema.up.sql)

agents Table:

CREATE TABLE agents (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    hostname VARCHAR(255) NOT NULL,
    os_type VARCHAR(50) NOT NULL CHECK (os_type IN ('windows', 'linux', 'macos')),
    os_version VARCHAR(100),
    os_architecture VARCHAR(20),
    agent_version VARCHAR(20) NOT NULL,
    last_seen TIMESTAMP NOT NULL DEFAULT NOW(),
    status VARCHAR(20) DEFAULT 'online' CHECK (status IN ('online', 'offline', 'error')),
    metadata JSONB,
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW()
);

update_logs Table:

CREATE TABLE update_logs (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    agent_id UUID REFERENCES agents(id) ON DELETE CASCADE,
    update_package_id UUID REFERENCES update_packages(id) ON DELETE SET NULL,
    action VARCHAR(50) NOT NULL,
    result VARCHAR(20) NOT NULL CHECK (result IN ('success', 'failed', 'partial')),
    stdout TEXT,
    stderr TEXT,
    exit_code INTEGER,
    duration_seconds INTEGER,
    executed_at TIMESTAMP DEFAULT NOW()
);

agent_commands Table:

CREATE TABLE agent_commands (
    id UUID PRIMARY KEY DEFAULT uuid_generate_v4(),
    agent_id UUID REFERENCES agents(id) ON DELETE CASCADE,
    command_type VARCHAR(50) NOT NULL,
    params JSONB,
    status VARCHAR(20) DEFAULT 'pending' CHECK (status IN ('pending', 'sent', 'completed', 'failed')),
    created_at TIMESTAMP DEFAULT NOW(),
    sent_at TIMESTAMP,
    completed_at TIMESTAMP,
    result JSONB
);

3.1.2 Update Events System (003_create_update_tables.up.sql)

update_events Table:

CREATE TABLE IF NOT EXISTS update_events (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    agent_id UUID NOT NULL REFERENCES agents(id) ON DELETE CASCADE,
    package_type VARCHAR(50) NOT NULL,
    package_name TEXT NOT NULL,
    version_from TEXT,
    version_to TEXT NOT NULL,
    severity VARCHAR(20) NOT NULL CHECK (severity IN ('critical', 'important', 'moderate', 'low')),
    repository_source TEXT,
    metadata JSONB DEFAULT '{}',
    event_type VARCHAR(20) NOT NULL CHECK (event_type IN ('discovered', 'updated', 'failed', 'ignored')),
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);

Problem: update_events is specific to package updates, doesn't cover all system events.

3.2 Proposed Schema: Unified System Events

CREATE TABLE system_events (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    agent_id UUID REFERENCES agents(id) ON DELETE CASCADE,
    event_type VARCHAR(50) NOT NULL,        -- 'agent_startup', 'agent_scan', 'server_build', 'download', etc.
    event_subtype VARCHAR(50) NOT NULL,     -- 'success', 'failed', 'info', 'warning', 'critical'
    severity VARCHAR(20) NOT NULL,          -- 'info', 'warning', 'error', 'critical'
    component VARCHAR(50) NOT NULL,         -- 'agent', 'server', 'build', 'download', 'config', 'migration'
    message TEXT,
    metadata JSONB,                         -- Structured error data, stack traces, HTTP status codes, etc.
    created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),

    -- Performance indexes
    INDEX idx_system_events_agent_id (agent_id),
    INDEX idx_system_events_type (event_type, event_subtype),
    INDEX idx_system_events_created (created_at),
    INDEX idx_system_events_severity (severity),
    INDEX idx_system_events_component (component)
);

Benefits:

  • Unified storage for all events (agent + server + system)
  • Rich metadata support for structured error information
  • Proper indexing for efficient queries and UI performance
  • Extensible for new event types without schema changes
  • Replaces multiple ad-hoc logging approaches

Section 4: Classification System

4.1 Event Type Taxonomy

const (
    // Agent lifecycle events
    EventTypeAgentStartup     = "agent_startup"
    EventTypeAgentRegistration = "agent_registration"
    EventTypeAgentCheckIn     = "agent_checkin"
    EventTypeAgentScan        = "agent_scan"
    EventTypeAgentUpdate      = "agent_update"
    EventTypeAgentConfig      = "agent_config"
    EventTypeAgentMigration   = "agent_migration"
    EventTypeAgentShutdown    = "agent_shutdown"

    // Server events
    EventTypeServerBuild      = "server_build"
    EventTypeServerDownload   = "server_download"
    EventTypeServerConfig     = "server_config"
    EventTypeServerAuth       = "server_auth"

    // System events
    EventTypeDownload         = "download"
    EventTypeMigration        = "migration"
    EventTypeError           = "error"
)

4.2 Event Subtype Taxonomy

const (
    // Status subtypes
    SubtypeSuccess     = "success"
    SubtypeFailed      = "failed"
    SubtypeInfo        = "info"
    SubtypeWarning     = "warning"
    SubtypeCritical    = "critical"

    // Specific subtypes for detailed classification
    SubtypeDownloadFailed     = "download_failed"
    SubtypeValidationFailed   = "validation_failed"
    SubtypeConfigCorrupted    = "config_corrupted"
    SubtypeMigrationNeeded    = "migration_needed"
    SubtypePanicRecovered     = "panic_recovered"
    SubtypeTokenExpired       = "token_expired"
    SubtypeNetworkTimeout     = "network_timeout"
    SubtypePermissionDenied   = "permission_denied"
    SubtypeServiceUnavailable = "service_unavailable"
)

4.3 Severity Levels

const (
    SeverityInfo     = "info"     // Normal operations, informational
    SeverityWarning  = "warning"  // Non-critical issues, degraded operation
    SeverityError    = "error"    // Failed operations, user action required
    SeverityCritical = "critical" // System-critical failures, immediate attention
)

4.4 Component Classification

const (
    ComponentAgent     = "agent"      // Agent-side events
    ComponentServer    = "server"     // Server-side events
    ComponentBuild     = "build"      // Build system events
    ComponentDownload  = "download"   // File download events
    ComponentConfig    = "config"     // Configuration events
    ComponentDatabase  = "database"   // Database events
    ComponentNetwork   = "network"    // Network/connectivity events
    ComponentSecurity  = "security"   // Security/authentication events
    ComponentMigration = "migration"  // Migration/update events
)

Section 5: Integration Points Map

5.1 Agent-Side Integration Points

Event Location Current Sink Target Sink Missing Layer
cmd/main.go:261 (config load fail) log.Fatal() system_events table EventService client
cmd/main.go:306 (registration fail) log.Fatal() system_events table EventService client
cmd/main.go:361 (agent runtime fail) log.Fatal() system_events table EventService client
config/config.go:115 (validation fail) error return system_events table EventService client
migration/executor.go:61 (backup fail) fmt.Printf() system_events table EventService client
migration/executor.go:67 (directory migration fail) fmt.Printf() system_events table EventService client
migration/executor.go:105 (validation fail) fmt.Printf() system_events table EventService client
client/client.go:121 (registration API fail) error return system_events table EventService client
client/client.go:174 (token renewal fail) log.Printf() system_events table EventService client
client/client.go:263 (command fetch fail) error return system_events table EventService client

5.2 Server-Side Integration Points

Event Location Current Sink Target Sink Missing Layer
handlers/agents.go:65 (no registration token) HTTP 401 system_events table EventService
handlers/agents.go:72 (invalid token) HTTP 401 system_events table EventService
handlers/agents.go:81 (machine ID conflict) HTTP 409 system_events table EventService
handlers/downloads.go:105 (file not found) HTTP 404 system_events table EventService
handlers/downloads.go:115 (empty file) HTTP 404 system_events table EventService
handlers/agent_setup.go:25 (config build fail) HTTP 500 system_events table EventService
services/agent_lifecycle.go:74 (validation fail) error return system_events table EventService
services/agent_lifecycle.go:80 (agent not found) error return system_events table EventService
services/agent_lifecycle.go:91 (config generation fail) error return system_events table EventService
Database query failures error return system_events table EventService

5.3 Success Events (Currently Missing)

Event Type Current Status Should Log
Agent successful startup Not logged system_events
Agent successful registration Not logged system_events
Agent successful check-in Not logged system_events
Agent successful scan Not logged system_events
Agent successful update Not logged system_events
Agent successful migration Not logged system_events
Server successful build Not logged system_events
Successful configuration generation Not logged system_events
Successful download served Not logged system_events
Token renewal success Not logged system_events

Section 6: Implementation Priority

6.1 Priority P0: Critical Errors Lost Completely

Impact: Server has no visibility into agent failures

  1. Agent Startup Failures (cmd/main.go:259-262)

    • Configuration load failures
    • Agent service startup failures
    • Effort: 2 hours
    • Risk: High (affects agent discovery and monitoring)
  2. Agent Runtime Failures (cmd/main.go:360-362)

    • Main agent loop failures
    • Service binding failures
    • Effort: 1 hour
    • Risk: High (agents disappear without explanation)
  3. Registration Failures (cmd/main.go:305-307, handlers/agents.go:64-74)

    • Invalid/expired tokens
    • Machine ID conflicts
    • Effort: 3 hours
    • Risk: High (security and onboarding issues)
  4. Token Renewal Failures (client/client.go:167-175)

    • Refresh token expiration
    • Network connectivity during renewal
    • Effort: 2 hours
    • Risk: High (agents become permanently offline)

6.2 Priority P1: Errors Logged to Wrong Place

Impact: Errors exist but not queryable in UI

  1. Migration Failures (migration/executor.go:60-108)

    • Backup creation failures
    • Directory migration failures
    • Validation failures
    • Effort: 3 hours
    • Risk: Medium (upgrade reliability)
  2. Download Failures (handlers/downloads.go:100-117)

    • Missing binaries
    • Corrupted files
    • Platform mismatches
    • Effort: 2 hours
    • Risk: Medium (installation failures)
  3. Configuration Generation Failures (services/agent_lifecycle.go:90-92)

    • Build service failures
    • Config template errors
    • Effort: 2 hours
    • Risk: Medium (agent deployment failures)
  4. Scanner/Orchestrator Failures

    • Circuit breaker activations
    • Scanner timeouts
    • Package manager failures
    • Effort: 4 hours
    • Risk: Medium (update reliability)

6.3 Priority P2: Success Events Not Logged

Impact: No visibility into successful operations

  1. Successful Agent Operations

    • Successful check-ins
    • Successful scans
    • Successful updates
    • Successful migrations
    • Effort: 4 hours
    • Risk: Low (operational visibility)
  2. Successful Server Operations

    • Build completions
    • Config generations
    • Download serves
    • Effort: 2 hours
    • Risk: Low (monitoring)

6.4 Priority P3: UI Integration

Impact: Events exist but not visible to users

  1. EventService Implementation

    • Database table creation
    • Event persistence layer
    • Query service
    • Effort: 6 hours
    • Risk: Low (user experience)
  2. UI Components

    • Event history display
    • Filtering and search
    • Real-time updates via WebSocket/SSE
    • Error detail views
    • Effort: 8 hours
    • Risk: Low (user experience)

Section 7: Implementation Strategy

7.1 Phase 1: Foundation (P0 + P1) - 19 hours total

Database Layer (2 hours)

  • Create system_events table migration
  • Add proper indexes for performance
  • Create EventService database queries

EventService Implementation (4 hours)

  • Server-side EventService for persistence
  • Event query and filtering service
  • Event metadata handling

Agent Event Client (3 hours)

  • Lightweight HTTP client for event reporting
  • Local event buffering for offline scenarios
  • Automatic retry with exponential backoff

Critical Error Integration (10 hours)

  • Agent startup/registration failures (5 hours)
  • Download/serve failures (2 hours)
  • Migration failures (3 hours)

7.2 Phase 2: Completion (P2 + P3) - 22 hours total

Success Event Logging (6 hours)

  • Add success event creation throughout codebase
  • Standardize event metadata structures
  • Add event creation to existing placeholder services

HistoryService and UI (8 hours)

  • Event history API endpoints
  • Filtering, pagination, and search
  • Real-time event streaming

Frontend Integration (8 hours)

  • Event history components
  • Agent event detail views
  • System event dashboard
  • Real-time event indicators

7.3 Development Checklist

Foundation Tasks (19 hours)

  • Create system_events table migration (2 hours)
  • Implement server-side EventService (4 hours)
  • Create agent EventClient (3 hours)
  • Add agent startup failure logging (1 hour)
  • Add agent runtime failure logging (1 hour)
  • Add registration failure logging (2 hours)
  • Add token renewal failure logging (2 hours)
  • Add download failure logging (2 hours)
  • Add migration failure logging (3 hours)

UI and Success Tasks (22 hours)

  • Add success event logging (6 hours)
  • Implement HistoryService (4 hours)
  • Create event history UI components (8 hours)
  • Add real-time event updates (4 hours)

Testing Tasks (4 hours)

  • Test error event propagation (1 hour)
  • Test success event propagation (1 hour)
  • Test UI event display (1 hour)
  • Test performance with high event volume (1 hour)

Section 8: Prevention of "12 Commits Later" Syndrome

8.1 Development Process Integration

Add this section to all future RedFlag features:

### Event Logging Requirements
- [ ] Error events identified with proper classification
- [ ] Success events identified and logged
- [ ] EventService integration implemented
- [ ] Event metadata includes relevant technical details
- [ ] UI can display the events with appropriate context
- [ ] Events added to EVENT_CLASSIFICATIONS.md
- [ ] Manual test verifies event propagation and UI display

8.2 Code Review Checklist

For all PRs, reviewers must verify:

  • All error paths create appropriate events
  • Success events created where operation succeeds
  • Event classifications follow established taxonomy
  • No stdout-only logging remaining for important events
  • UI can display the events with helpful context
  • Documentation updated with new event types
  • Performance impact considered for high-volume events

8.3 Automated Testing

Add to test suite:

  • Event creation verification for all error paths
  • Event persistence verification in database
  • UI event display verification
  • Event filtering and search verification
  • Performance benchmarks for high event volumes
  • Event metadata structure validation

8.4 Event Documentation Template

For each new event type, document:

### [Event Name]
**Classification:** agent_scan_failed
**Severity:** error
**Component:** agent
**Trigger:** Package manager scan failure
**Metadata:**
- scanner_type: "apt|dnf|windows|winget"
- error_type: "timeout|permission|corruption"
- duration_ms: scan execution time
- packages_count: packages scanned
**UI Display:** Agent details > Scan History
**User Action:** Check system logs or re-run scan

Conclusion

This audit reveals significant gaps in RedFlag's event visibility based on actual code analysis. 31 integration points were identified where critical events are being lost to stdout or HTTP responses instead of being persisted and made visible to users.

Critical Findings:

  1. Complete Event Loss: Agent startup, registration, and runtime failures exit with log.Fatal() without any server visibility
  2. Security Event Gap: Authentication failures, token issues, and machine ID conflicts return HTTP errors but create no audit trail
  3. Success Event Void: Successful operations are completely invisible, making it impossible to verify system health
  4. Placeholder Services: Build and artifact services have no event logging at all
  5. Migration Opacity: Complex migration operations log locally but server has no visibility into upgrade success/failure

Implementation Impact:

The proposed unified event system with proper classification will provide:

  • Complete operational visibility for all agent and server operations
  • Security audit trail for authentication and authorization events
  • System health monitoring through success/failure event ratios
  • Debugging capability with structured metadata and error details
  • Performance insights through event timing and frequency analysis

Total Implementation Effort: 41 hours across 2 phases

  • Phase 1 (Foundation): 19 hours - Critical error visibility
  • Phase 2 (Completion): 22 hours - Success events and UI integration

This systematic approach ensures no events are missed and provides a complete audit trail for all RedFlag operations, preventing the current "silent failure" problem where critical issues are invisible to administrators.