Files
Redflag/docs/4_LOG/December_2025/2025-12-16_Error-Logging-Implementation-Plan.md

16 KiB

RedFlag v0.1.23.5 → v0.2.0 Implementation Plan

Priority: CRITICAL P0 Error Logging MUST be completed before v0.2.0 release
Architecture: PULL ONLY (No WebSockets/Push mechanisms)
Timeline: 2-3 development sessions (15-17 hours)


Current State Assessment (v0.1.23.5)

What's Working

  1. Agent v0.1.23.5 running and checking in successfully
  2. Server config sync working (all subsystems configured with auto_run=true)
  3. Migration detection working properly (install.log shows proper behavior)
  4. Token preservation working (agent's built-in migration system)
  5. Install script idempotency implemented
  6. HistoryLog build failure fixed (system_events table created)
  7. Registration token expiration fixed (UI now shows correct status)
  8. Heartbeat implementation verified correct (with minor bug fixed)

Critical Gaps (P0 - Must Fix Before v0.2.0)

  1. Agent startup failures invisible to server (log.Fatal before server communication)
  2. Registration failures not logged (invalid tokens, machine ID conflicts)
  3. Token renewal failures cause silent agent death
  4. Migration failures only visible in local logs
  5. Subsystem scanner failures invisible (circuit breakers, timeouts)
  6. No event buffering for offline agents

Implementation Strategy: Phase-Based Approach

Phase 1: Foundation & Verification (2-3 hours)

Goal: Ensure infrastructure is ready before adding error logging

1.1 Verify System Events Table

  • Run migration: cd aggregator-server && go run cmd/server/main.go migrate
  • Verify system_events table created in database
  • Test CreateSystemEvent() query method
  • Confirm indexes are working properly

1.2 Verify Subsystem Configuration

  • Check agent_subsystems table has data for existing agents
  • Verify GetSubsystems() query returns correct data
  • Confirm heartbeat metadata storage working (rapid_polling_enabled, rapid_polling_until)

1.3 Update Documentation

  • Add ERROR_FLOW_AUDIT_CRITICAL_P0_PLAN.md to project roadmap
  • Update DEVELOPMENT_ETHOS.md with event logging requirements
  • Create EVENT_CLASSIFICATIONS.md for reference

Deliverable: Verified infrastructure ready for P0 error logging


Phase 2: Agent-Side Event Buffering (3-4 hours)

Goal: Create local event buffering system for offline resilience

2.1 Create Event Buffer Package

File: aggregator-agent/internal/event/buffer.go (NEW)

Implementation:

// Event buffer with configurable path
type EventBuffer struct {
    filePath string
    maxSize  int
    mu       sync.Mutex
}

// Initialize with config-driven path
func NewEventBuffer(configPath string) *EventBuffer {
    return &EventBuffer{
        filePath: configPath,
        maxSize:  1000,
    }
}

// BufferEvent saves event to local file
func (b *EventBuffer) BufferEvent(event *models.SystemEvent) error

// GetBufferedEvents retrieves and clears buffer
func (b *EventBuffer) GetBufferedEvents() ([]*models.SystemEvent, error)

Key Features:

  • Configurable buffer path (not hardcoded)
  • Thread-safe (sync.Mutex)
  • Circular buffer (max 1000 events)
  • JSON serialization
  • Automatic directory creation

2.2 Integrate Buffer into Agent Config

File: aggregator-agent/internal/config/config.go

Add to Config struct:

type Config struct {
    // ... existing fields ...
    EventBufferPath string `json:"event_buffer_path"`
}

Default value: /var/lib/redflag/events_buffer.json

2.3 Test Buffering System

  • Unit tests for buffer operations
  • Test concurrent writes
  • Test buffer overflow (circular behavior)
  • Test file permissions and directory creation

Deliverable: Working event buffer that survives agent restarts


Phase 3: Critical Error Logging Integration (6-7 hours)

Goal: Add P0 error logging to all critical failure points

3.1 Agent Startup Failures (1 hour)

File: aggregator-agent/cmd/agent/main.go

Locations:

  • Line 259-262: Config load failure
  • Line 305-307: Registration failure
  • Line 360-362: Runtime failure

Implementation pattern:

cfg, err := config.Load(configPath, cliFlags)
if err != nil {
    event := &models.SystemEvent{
        EventType:    "agent_startup",
        EventSubtype: "failed",
        Severity:     "critical",
        Component:    "agent",
        Message:      fmt.Sprintf("Configuration load failed: %v", err),
        Metadata: map[string]interface{}{
            "error_type":    "config_load_failed",
            "error_details": err.Error(),
            "config_path":   configPath,
        },
    }
    eventBuffer.BufferEvent(event) // Buffer before fatal exit
    log.Fatal("Failed to load configuration:", err)
}

3.2 Registration & Token Failures (1.5 hours)

File: aggregator-agent/internal/client/client.go

Locations:

  • Line 121-125: Registration API failure
  • Line 172-175: Token renewal failure
  • Line 263-266: Command fetch failure

Implementation pattern:

if resp.StatusCode != http.StatusOK {
    bodyBytes, _ := io.ReadAll(resp.Body)
    
    event := &models.SystemEvent{
        EventType:    "agent_registration",
        EventSubtype: "failed",
        Severity:     "error",
        Component:    "agent",
        Message:      fmt.Sprintf("Registration failed: %s", resp.Status),
        Metadata: map[string]interface{}{
            "error_type":    "registration_failed",
            "http_status":   resp.StatusCode,
            "response_body": string(bodyBytes),
        },
    }
    eventBuffer.BufferEvent(event)
    
    return nil, fmt.Errorf("registration failed: %s - %s", resp.Status, string(bodyBytes))
}

3.3 Migration Failures (1 hour)

File: aggregator-agent/internal/migration/executor.go

Locations:

  • Line 60-62: Backup creation failure
  • Line 67-69: Directory migration failure
  • Line 75-77: Configuration migration failure
  • Line 96-98: Validation failure

Implementation pattern:

if err := e.createBackups(); err != nil {
    event := &models.SystemEvent{
        EventType:    "agent_migration",
        EventSubtype: "failed",
        Severity:     "error",
        Component:    "migration",
        Message:      fmt.Sprintf("Backup creation failed: %v", err),
        Metadata: map[string]interface{}{
            "error_type":    "backup_creation_failed",
            "migration_from": e.fromVersion,
            "migration_to":   e.toVersion,
        },
    }
    eventBuffer.BufferEvent(event)
    
    return e.completeMigration(false, fmt.Errorf("backup creation failed: %w", err))
}

3.4 Subsystem Scanner Failures (2 hours)

Files: aggregator-agent/internal/orchestrator/*.go

Circuit breaker activations:

// When circuit breaker opens
event := &models.SystemEvent{
    EventType:    "agent_scan",
    EventSubtype: "failed",
    Severity:     "warning",
    Component:    "scanner",
    Message:      fmt.Sprintf("Circuit breaker opened for %s scanner", scannerType),
    Metadata: map[string]interface{}{
        "scanner_type": scannerType,
        "error_type":   "circuit_breaker_activated",
        "failures":     failureCount,
    },
}
eventBuffer.BufferEvent(event)

Scanner timeouts:

// When scanner times out
event := &models.SystemEvent{
    EventType:    "agent_scan",
    EventSubtype: "failed",
    Severity:     "error",
    Component:    "scanner",
    Message:      fmt.Sprintf("%s scanner timed out after %v", scannerType, timeout),
    Metadata: map[string]interface{}{
        "scanner_type": scannerType,
        "error_type":   "timeout",
        "duration_ms":  duration.Milliseconds(),
    },
}
eventBuffer.BufferEvent(event)

3.5 Server-Side Auth Failures (0.5 hours)

File: aggregator-server/internal/api/handlers/agents.go

Locations:

  • Line 64-67: Missing registration token
  • Line 72-74: Invalid/expired token
  • Line 81-84: Machine ID conflict

Implementation pattern:

if registrationToken == "" {
    event := &models.SystemEvent{
        EventType:    "server_auth",
        EventSubtype: "failed",
        Severity:     "warning",
        Component:    "security",
        Message:      "Registration attempt without token",
        Metadata: map[string]interface{}{
            "error_type": "missing_token",
            "client_ip":  c.ClientIP(),
        },
    }
    h.agentQueries.CreateSystemEvent(event) // Don't fail on error
    
    c.JSON(http.StatusUnauthorized, gin.H{"error": "registration token required"})
    return
}

Deliverable: All P0 errors now buffered and will be reported during next check-in


Phase 4: Event Reporting Integration (2-3 hours)

Goal: Report buffered events during agent check-in

4.1 Modify Agent Check-In

File: aggregator-agent/internal/client/client.go

In CheckIn() method:

func (c *Client) CheckIn(agentID string, metrics map[string]interface{}) (*CheckInResponse, error) {
    // ... existing code ...
    
    // Add buffered events to request
    bufferedEvents, err := eventBuffer.GetBufferedEvents()
    if err != nil {
        log.Printf("Warning: Failed to get buffered events: %v", err)
    }
    
    if len(bufferedEvents) > 0 {
        metrics["buffered_events"] = bufferedEvents
        log.Printf("Reporting %d buffered events to server", len(bufferedEvents))
    }
    
    // ... rest of check-in code ...
}

4.2 Modify Server GetCommands

File: aggregator-server/internal/api/handlers/agents.go

In GetCommands() method:

// Process buffered events from agent
if bufferedEvents, exists := metrics.Metadata["buffered_events"]; exists {
    if events, ok := bufferedEvents.([]interface{}); ok && len(events) > 0 {
        stored := 0
        for _, e := range events {
            if eventMap, ok := e.(map[string]interface{}); ok {
                event := &models.SystemEvent{
                    AgentID:      &agentID,
                    EventType:    getString(eventMap, "event_type"),
                    EventSubtype: getString(eventMap, "event_subtype"),
                    Severity:     getString(eventMap, "severity"),
                    Component:    getString(eventMap, "component"),
                    Message:      getString(eventMap, "message"),
                    Metadata:     getJSONB(eventMap, "metadata"),
                    CreatedAt:    getTime(eventMap, "created_at"),
                }
                
                if event.EventType != "" && event.EventSubtype != "" && event.Severity != "" {
                    if err := h.agentQueries.CreateSystemEvent(event); err != nil {
                        log.Printf("Warning: Failed to store buffered event: %v", err)
                    } else {
                        stored++
                    }
                }
            }
        }
        if stored > 0 {
            log.Printf("Stored %d buffered events from agent %s", stored, agentID)
        }
    }
}

4.3 Test End-to-End Flow

  • Simulate agent startup failure → Verify event buffered
  • Start agent → Verify event reported in next check-in
  • Check server database → Verify event stored in system_events table
  • Test offline scenario → Verify events survive agent restart
  • Test multiple failures → Verify all events reported

Deliverable: Complete PULL ONLY event reporting pipeline


Phase 5: UI Integration (2-3 hours) - Optional for v0.2.0

Goal: Display critical errors in UI (can be v0.2.1)

5.1 Create Event History API Endpoint

File: aggregator-server/internal/api/handlers/events.go

// GetAgentEvents handles GET /api/v1/agents/:id/events
func (h *EventHandler) GetAgentEvents(c *gin.Context) {
    agentID := c.Param("id")
    
    // Query parameters
    limit := 50
    if l := c.Query("limit"); l != "" {
        if parsed, err := strconv.Atoi(l); err == nil && parsed > 0 && parsed <= 1000 {
            limit = parsed
        }
    }
    
    severity := c.Query("severity") // "error,critical" filter
    
    events, err := h.agentQueries.GetSystemEvents(agentID, severity, limit)
    if err != nil {
        c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to retrieve events"})
        return
    }
    
    c.JSON(http.StatusOK, gin.H{
        "events": events,
        "total":  len(events),
    })
}

5.2 Add UI Polling

File: aggregator-web/src/hooks/useAgentEvents.ts

// Poll for agent events every 30 seconds
export const useAgentEvents = (agentId: string) => {
  const [events, setEvents] = useState<SystemEvent[]>([]);
  
  useEffect(() => {
    const fetchEvents = async () => {
      const data = await api.get(`/agents/${agentId}/events?severity=error,critical`);
      setEvents(data.events);
    };
    
    // Initial fetch
    fetchEvents();
    
    // Poll every 30 seconds
    const interval = setInterval(fetchEvents, 30000);
    
    return () => clearInterval(interval);
  }, [agentId]);
  
  return events;
};

Testing Checklist

Unit Tests

  • Event buffer concurrent writes
  • Buffer overflow behavior (circular)
  • Event serialization/deserialization
  • GetBufferedEvents clears buffer

Integration Tests

  • Startup failure → event buffered → event reported → event stored
  • Registration failure → event appears in UI within 60 seconds
  • Token renewal failure → event logged → admin notified
  • Offline scenario → events survive restart → all reported when online
  • Multiple subsystem failures → all events captured with correct context

Manual Tests

  • Kill agent process mid-scan → verify event appears in UI
  • Use expired registration token → verify security event logged
  • Disconnect network during token renewal → verify event buffered
  • Trigger migration failure → verify event reported

Success Criteria

Must Have for v0.2.0

  • All 4 P0 error types logged (startup, registration, token, migration)
  • Events survive agent restart (buffered to disk)
  • Events reported within 1-2 check-in cycles (30-60 seconds)
  • PULL ONLY architecture (no WebSockets)
  • Server-side auth failures logged
  • Subsystem context captured in event metadata

Should Have for v0.2.0

  • Subsystem scanner failures logged
  • Basic UI displays critical errors
  • Event buffer path configurable (not hardcoded)

Can Wait for v0.2.1

  • Full event history UI with filtering
  • Success events logged
  • Event analytics and metrics

Risk Mitigation

Risk Mitigation
Agent can't write buffer file Fail silently, log to stdout, don't block startup
Buffer file grows too large Circular buffer (max 1000 events), old events dropped
Server overwhelmed with events Rate limiting in event ingestion, backpressure handling
Sensitive data in metadata Sanitize before buffering, exclude secrets/tokens
Events lost during crash Write buffer before fatal exit, fsync if possible

Timeline Estimate

Total: 15-17 hours over 2-3 sessions

Session 1 (5-6 hours):

  • Phase 1: Foundation verification (2 hours)
  • Phase 2: Event buffering system (3-4 hours)

Session 2 (6-7 hours):

  • Phase 3: Critical error integration (6-7 hours)

Session 3 (4-5 hours):

  • Phase 4: Event reporting integration (2-3 hours)
  • Phase 5: Testing and polish (2 hours)

Next Steps

  1. Verify current state (Run migration, check subsystem table)
  2. Implement event buffering (Create buffer.go package)
  3. Add error logging to critical failure points
  4. Test end-to-end flow
  5. Document and ship v0.2.0

Decision Point: Do we want to include subsystem scanner failures in v0.2.0 P0 scope, or push to v0.2.1? (Adds ~3 hours)