Files
Redflag/docs/historical/criticalissuesorted.md

12 KiB

Critical Issues Resolved - AgentHealth Scanner System

Date: 2025-12-18

Status: RESOLVED


Issue #1: Agent Check-in Interval Override

Problem Description

The agent's polling interval was being incorrectly overridden by scanner subsystem intervals. When the Update scanner was configured for 1440 minutes (24 hours), the agent would sleep for 24 hours instead of the default 5-minute check-in interval, appearing "stuck" and unresponsive.

Root Cause

In aggregator-agent/cmd/agent/main.go, the syncServerConfig() function was incorrectly applying scanner subsystem intervals to the agent's main check-in interval:

// BUGGY CODE (BEFORE)
if intervalMinutes > 0 && intervalMinutes != newCheckInInterval {
    log.Printf("  → %s: interval=%d minutes (changed)", subsystemName, intervalMinutes)
    changes = true
    newCheckInInterval = intervalMinutes // This overrode the agent's check-in interval!
}

Impact

  • Agents would stop checking in for extended periods (hours to days)
  • Appeared as "stuck" or "frozen" agents in the UI
  • Breaks the fundamental promise of 5-minute agent health monitoring
  • Violated ETHOS principle of honest, predictable behavior

Solution Implemented

Separated scanner frequencies from agent check-in frequency:

// FIXED CODE (AFTER)
// Check if interval actually changed (for logging only - don't override agent check-in interval)
if intervalMinutes > 0 {
    log.Printf("  → %s: interval=%d minutes (changed)", subsystemName, intervalMinutes)
    changes = true
    // NOTE: We do NOT update newCheckInInterval here - scanner intervals are 
    // separate from agent check-in interval
}

// NOTE: Server subsystem intervals control scanner frequency, NOT agent check-in frequency
// The agent check-in interval is controlled separately and should not be overridden by scanner intervals

Files Modified

  • aggregator-agent/cmd/agent/main.go:528-606 - syncServerConfig() function

Alternative Approaches Considered

  1. Separate config fields: Could have separate scanner_interval and checkin_interval fields in the server config
  2. Agent-side override: Could add agent-side logic to never allow check-in intervals > 15 minutes
  3. Server-side validation: Could prevent setting scanner intervals that match agent check-in intervals

Decision: Chose the simplest fix that maintains separation of concerns. Scanner intervals control when scanners run, agent check-in interval controls server communication frequency.


Issue #2: Storage/System/Docker Scanners Not Registered

Problem Description

Storage metrics were not appearing in the UI despite the storage scanner being configured and enabled. The agent logs showed:

Error scanning storage: failed to scan storage: scanner not found: storage

Root Cause

Only update scanners (APT, DNF, Windows Update, Winget) were registered with the orchestrator. Storage, System, and Docker scanners were created but never registered, causing orch.ScanSingle(ctx, "storage") to fail.

Registered (Working):

  • APT ✓
  • DNF ✓
  • Windows Update ✓
  • Winget ✓

Not Registered (Broken):

  • Storage ✗
  • System ✗
  • Docker ✗

Impact

  • Storage metrics not collected
  • System metrics not collected
  • Docker scans would fail if using orchestrator
  • Incomplete agent health monitoring
  • Circuit breaker protection missing for these scanners

Solution Implemented

Step 1: Created Scanner Wrappers

Added wrapper implementations in aggregator-agent/internal/orchestrator/scanner_wrappers.go:

// StorageScannerWrapper wraps the Storage scanner to implement the Scanner interface
type StorageScannerWrapper struct {
    scanner *StorageScanner
}

func NewStorageScannerWrapper(s *StorageScanner) *StorageScannerWrapper {
    return &StorageScannerWrapper{scanner: s}
}

func (w *StorageScannerWrapper) IsAvailable() bool {
    return w.scanner.IsAvailable()
}

func (w *StorageScannerWrapper) Scan() ([]client.UpdateReportItem, error) {
    // Storage scanner doesn't return UpdateReportItems, it returns storage metrics
    // This is a limitation of the current interface design
    // For now, return empty slice and handle storage scanning separately
    return []client.UpdateReportItem{}, nil
}

func (w *StorageScannerWrapper) Name() string {
    return w.scanner.Name()
}

Key Architectural Limitation Identified: The Scanner interface expects Scan() ([]client.UpdateReportItem, error), but storage/system scanners return different types (StorageMetric, SystemMetric). This is a fundamental interface design mismatch.

Step 2: Registered All Scanners with Circuit Breakers

In aggregator-agent/cmd/agent/main.go:

// Initialize scanners for storage, system, and docker
storageScanner := orchestrator.NewStorageScanner(version.Version)
systemScanner := orchestrator.NewSystemScanner(version.Version)
dockerScanner, _ := scanner.NewDockerScanner()

// Initialize circuit breakers for all subsystems
storageCB := circuitbreaker.New("Storage", circuitbreaker.Config{...})
systemCB := circuitbreaker.New("System", circuitbreaker.Config{...})
dockerCB := circuitbreaker.New("Docker", circuitbreaker.Config{...})

// Register ALL scanners with the orchestrator
// Update scanners (package management)
scanOrchestrator.RegisterScanner("apt", orchestrator.NewAPTScannerWrapper(aptScanner), aptCB, ...)
scanOrchestrator.RegisterScanner("dnf", orchestrator.NewDNFScannerWrapper(dnfScanner), dnfCB, ...)
// ...

// System scanners (metrics and monitoring) - NEWLY ADDED
scanOrchestrator.RegisterScanner("storage", orchestrator.NewStorageScannerWrapper(storageScanner), storageCB, ...)
scanOrchestrator.RegisterScanner("system", orchestrator.NewSystemScannerWrapper(systemScanner), systemCB, ...)
scanOrchestrator.RegisterScanner("docker", orchestrator.NewDockerScannerWrapper(dockerScanner), dockerCB, ...)

Files Modified

  • aggregator-agent/internal/orchestrator/scanner_wrappers.go:117-162 - Added StorageScannerWrapper and SystemScannerWrapper
  • aggregator-agent/cmd/agent/main.go:654-690 - Registered all scanners with circuit breakers

Architectural Limitation and Technical Debt

The Problem: The current Scanner interface is designed for update scanners that return []client.UpdateReportItem:

type Scanner interface {
    IsAvailable() bool
    Scan() ([]client.UpdateReportItem, error)  // ← Only works for update scanners
    Name() string
}

But storage and system scanners return different types:

  • StorageScanner.ScanStorage() ([]StorageMetric, error)
  • SystemScanner.ScanSystem() ([]SystemMetric, error)

Our Compromise Solution: Created wrappers that implement the Scanner interface but return empty []client.UpdateReportItem{}. The actual scanning is still done directly in the handlers (handleScanStorage, handleScanSystem) using the underlying scanner methods.

Why This Works:

  • Allows registration with orchestrator for ScanSingle() calls
  • Enables circuit breaker protection
  • Maintains existing dedicated reporting endpoints
  • Minimal code changes

Technical Debt Introduced:

  • Wrappers are essentially "shims" that don't perform actual scanning
  • Double initialization of scanners (in orchestrator AND in handlers)
  • Interface mismatch indicates architectural inconsistency

Alternative Approaches Considered

Option A: Generic Scanner Interface (Major Refactor)

type Scanner interface {
    IsAvailable() bool
    Name() string
    // Use generics or interface{} for scan results
    Scan() (interface{}, error)
}

Pros: Unified interface for all scanner types Cons: Major breaking change, requires refactoring all scanner implementations, type safety issues

Option B: Separate Orchestrators

updateOrchestrator := orchestrator.NewOrchestrator()  // For update scanners
metricsOrchestrator := orchestrator.NewOrchestrator() // For metrics scanners

Pros: Clean separation of concerns Cons: More complex agent initialization, duplicate orchestrator logic

Option C: Typed Scanner Registration

type Scanner interface { /* current interface */ }
type MetricsScanner interface {
    ScanMetrics() (interface{}, error)
}

// Register with type checking
scanOrchestrator.RegisterScanner("storage", storageScanner, ...)
scanOrchestrator.RegisterMetricsScanner("storage", storageScanner, ...)

Pros: Type-safe, clear separation Cons: Requires orchestrator to support multiple scanner types, more complex

Option D: Current Approach (Chosen)

  • Create wrappers that satisfy interface but return empty results
  • Keep actual scanning in dedicated handlers
  • Add circuit breaker protection

Pros: Minimal changes, maintains existing architecture, quick to implement Cons: Technical debt, interface mismatch, "shim" wrappers

Ramifications and Future Considerations

Immediate Benefits

Storage metrics now collect successfully
System metrics now collect successfully
All scanners have circuit breaker protection
Consistent error handling across all subsystems
Agent check-in schedule is independent

Technical Debt to Address

  1. Interface Redesign: Consider refactoring to a more flexible scanner interface that can handle different return types
  2. Unified Scanning: Could merge update scanning and metrics collection into a single orchestrated flow
  3. Type Safety: Current approach loses compile-time type safety for metrics scanners
  4. Code Duplication: Scanners are initialized in two places (orchestrator + handlers)

Testing Implications

  • Need to test circuit breaker behavior for storage/system scanners
  • Should verify that wrapper.Scan() is never actually called (should use direct methods)
  • Integration tests needed for full scan flow

Performance Impact

  • Minimal - wrappers are thin proxies
  • Circuit breakers add slight overhead but provide valuable protection
  • No change to actual scanning logic or performance

Recommendations for Future Refactoring

  1. Short Term (Next Release)

    • Add logging to verify wrappers are working as expected
    • Monitor circuit breaker triggers for metrics scanners
    • Document the architectural pattern for future contributors
  2. Medium Term (2-3 Releases)

    • Consider introducing a TypedScanner interface:
      type TypedScanner interface {
          Scanner
          ScanTyped() (TypedScannerResult, error)
      }
      
    • Gradually migrate scanners to new interface
    • Update orchestrator to support typed results
  3. Long Term (Major Version)

    • Complete interface redesign with generics (Go 1.18+)
    • Unified scanning pipeline for all subsystem types
    • Consolidate reporting endpoints

Verification Steps

To verify these fixes work:

  1. Check Agent Logs: Should see successful storage scans

    journalctl -u redflag-agent -f | grep -i storage
    
  2. Check API Response: Should return storage metrics

    curl http://localhost:8080/api/v1/agents/{agent-id}/storage
    
  3. Check UI: AgentStorage component should display metrics

    • Navigate to agent details page
    • Verify "System Resources" section shows disk usage
    • Check last updated timestamp is recent (< 5 minutes)
  4. Check Agent Check-in: Should see check-ins every ~5 minutes

    journalctl -u redflag-agent -f | grep "Checking in"
    

Conclusion

These fixes resolve critical functionality issues while identifying important architectural limitations. The chosen approach balances immediate needs (functionality, stability) with long-term maintainability (minimal changes, clear technical debt).

The interface mismatch between update scanners and metrics scanners represents a fundamental architectural decision point that should be revisited in a future major version. For now, the wrapper pattern provides a pragmatic solution that unblocks critical features while maintaining system stability.

Key Achievement: Agent now correctly separates concerns between check-in frequency (health monitoring) and scanner frequency (data collection), with all scanners properly registered and protected by circuit breakers.