Fimeg/Redflag

Fork 0

Files

Fimeg 484a7f77ce Add docs and project files - force for Culurien

2026-03-28 20:46:24 -04:00

12 KiB

Raw Permalink Blame History

Critical Issues Resolved - AgentHealth Scanner System

Date: 2025-12-18

Status: RESOLVED

Issue #1: Agent Check-in Interval Override

Problem Description

The agent's polling interval was being incorrectly overridden by scanner subsystem intervals. When the Update scanner was configured for 1440 minutes (24 hours), the agent would sleep for 24 hours instead of the default 5-minute check-in interval, appearing "stuck" and unresponsive.

Root Cause

In aggregator-agent/cmd/agent/main.go, the syncServerConfig() function was incorrectly applying scanner subsystem intervals to the agent's main check-in interval:

// BUGGY CODE (BEFORE)
if intervalMinutes > 0 && intervalMinutes != newCheckInInterval {
    log.Printf("  → %s: interval=%d minutes (changed)", subsystemName, intervalMinutes)
    changes = true
    newCheckInInterval = intervalMinutes // This overrode the agent's check-in interval!
}

Impact

Agents would stop checking in for extended periods (hours to days)
Appeared as "stuck" or "frozen" agents in the UI
Breaks the fundamental promise of 5-minute agent health monitoring
Violated ETHOS principle of honest, predictable behavior

Solution Implemented

Separated scanner frequencies from agent check-in frequency:

// FIXED CODE (AFTER)
// Check if interval actually changed (for logging only - don't override agent check-in interval)
if intervalMinutes > 0 {
    log.Printf("  → %s: interval=%d minutes (changed)", subsystemName, intervalMinutes)
    changes = true
    // NOTE: We do NOT update newCheckInInterval here - scanner intervals are 
    // separate from agent check-in interval
}

// NOTE: Server subsystem intervals control scanner frequency, NOT agent check-in frequency
// The agent check-in interval is controlled separately and should not be overridden by scanner intervals

Files Modified

aggregator-agent/cmd/agent/main.go:528-606 - syncServerConfig() function

Alternative Approaches Considered

Separate config fields: Could have separate scanner_interval and checkin_interval fields in the server config
Agent-side override: Could add agent-side logic to never allow check-in intervals > 15 minutes
Server-side validation: Could prevent setting scanner intervals that match agent check-in intervals

Decision: Chose the simplest fix that maintains separation of concerns. Scanner intervals control when scanners run, agent check-in interval controls server communication frequency.

Issue #2: Storage/System/Docker Scanners Not Registered

Problem Description

Storage metrics were not appearing in the UI despite the storage scanner being configured and enabled. The agent logs showed:

Error scanning storage: failed to scan storage: scanner not found: storage

Root Cause

Only update scanners (APT, DNF, Windows Update, Winget) were registered with the orchestrator. Storage, System, and Docker scanners were created but never registered, causing orch.ScanSingle(ctx, "storage") to fail.

Registered (Working):

APT ✓
DNF ✓
Windows Update ✓
Winget ✓

Not Registered (Broken):

Storage ✗
System ✗
Docker ✗

Impact

Storage metrics not collected
System metrics not collected
Docker scans would fail if using orchestrator
Incomplete agent health monitoring
Circuit breaker protection missing for these scanners

Solution Implemented

Step 1: Created Scanner Wrappers

Added wrapper implementations in aggregator-agent/internal/orchestrator/scanner_wrappers.go:

// StorageScannerWrapper wraps the Storage scanner to implement the Scanner interface
type StorageScannerWrapper struct {
    scanner *StorageScanner
}

func NewStorageScannerWrapper(s *StorageScanner) *StorageScannerWrapper {
    return &StorageScannerWrapper{scanner: s}
}

func (w *StorageScannerWrapper) IsAvailable() bool {
    return w.scanner.IsAvailable()
}

func (w *StorageScannerWrapper) Scan() ([]client.UpdateReportItem, error) {
    // Storage scanner doesn't return UpdateReportItems, it returns storage metrics
    // This is a limitation of the current interface design
    // For now, return empty slice and handle storage scanning separately
    return []client.UpdateReportItem{}, nil
}

func (w *StorageScannerWrapper) Name() string {
    return w.scanner.Name()
}

Key Architectural Limitation Identified: The Scanner interface expects Scan() ([]client.UpdateReportItem, error), but storage/system scanners return different types (StorageMetric, SystemMetric). This is a fundamental interface design mismatch.

Step 2: Registered All Scanners with Circuit Breakers

In aggregator-agent/cmd/agent/main.go:

// Initialize scanners for storage, system, and docker
storageScanner := orchestrator.NewStorageScanner(version.Version)
systemScanner := orchestrator.NewSystemScanner(version.Version)
dockerScanner, _ := scanner.NewDockerScanner()

// Initialize circuit breakers for all subsystems
storageCB := circuitbreaker.New("Storage", circuitbreaker.Config{...})
systemCB := circuitbreaker.New("System", circuitbreaker.Config{...})
dockerCB := circuitbreaker.New("Docker", circuitbreaker.Config{...})

// Register ALL scanners with the orchestrator
// Update scanners (package management)
scanOrchestrator.RegisterScanner("apt", orchestrator.NewAPTScannerWrapper(aptScanner), aptCB, ...)
scanOrchestrator.RegisterScanner("dnf", orchestrator.NewDNFScannerWrapper(dnfScanner), dnfCB, ...)
// ...

// System scanners (metrics and monitoring) - NEWLY ADDED
scanOrchestrator.RegisterScanner("storage", orchestrator.NewStorageScannerWrapper(storageScanner), storageCB, ...)
scanOrchestrator.RegisterScanner("system", orchestrator.NewSystemScannerWrapper(systemScanner), systemCB, ...)
scanOrchestrator.RegisterScanner("docker", orchestrator.NewDockerScannerWrapper(dockerScanner), dockerCB, ...)

Files Modified

aggregator-agent/internal/orchestrator/scanner_wrappers.go:117-162 - Added StorageScannerWrapper and SystemScannerWrapper
aggregator-agent/cmd/agent/main.go:654-690 - Registered all scanners with circuit breakers

Architectural Limitation and Technical Debt

The Problem: The current Scanner interface is designed for update scanners that return []client.UpdateReportItem:

type Scanner interface {
    IsAvailable() bool
    Scan() ([]client.UpdateReportItem, error)  // ← Only works for update scanners
    Name() string
}

But storage and system scanners return different types:

StorageScanner.ScanStorage() ([]StorageMetric, error)
SystemScanner.ScanSystem() ([]SystemMetric, error)

Our Compromise Solution: Created wrappers that implement the Scanner interface but return empty []client.UpdateReportItem{}. The actual scanning is still done directly in the handlers (handleScanStorage, handleScanSystem) using the underlying scanner methods.

Why This Works:

Allows registration with orchestrator for ScanSingle() calls
Enables circuit breaker protection
Maintains existing dedicated reporting endpoints
Minimal code changes

Technical Debt Introduced:

Wrappers are essentially "shims" that don't perform actual scanning
Double initialization of scanners (in orchestrator AND in handlers)
Interface mismatch indicates architectural inconsistency

Alternative Approaches Considered

Option A: Generic Scanner Interface (Major Refactor)

type Scanner interface {
    IsAvailable() bool
    Name() string
    // Use generics or interface{} for scan results
    Scan() (interface{}, error)
}

Pros: Unified interface for all scanner types Cons: Major breaking change, requires refactoring all scanner implementations, type safety issues

Option B: Separate Orchestrators

updateOrchestrator := orchestrator.NewOrchestrator()  // For update scanners
metricsOrchestrator := orchestrator.NewOrchestrator() // For metrics scanners

Pros: Clean separation of concerns Cons: More complex agent initialization, duplicate orchestrator logic

Option C: Typed Scanner Registration

type Scanner interface { /* current interface */ }
type MetricsScanner interface {
    ScanMetrics() (interface{}, error)
}

// Register with type checking
scanOrchestrator.RegisterScanner("storage", storageScanner, ...)
scanOrchestrator.RegisterMetricsScanner("storage", storageScanner, ...)

Pros: Type-safe, clear separation Cons: Requires orchestrator to support multiple scanner types, more complex

Option D: Current Approach (Chosen)

Create wrappers that satisfy interface but return empty results
Keep actual scanning in dedicated handlers
Add circuit breaker protection

Pros: Minimal changes, maintains existing architecture, quick to implement Cons: Technical debt, interface mismatch, "shim" wrappers

Ramifications and Future Considerations

Immediate Benefits

✅ Storage metrics now collect successfully
✅ System metrics now collect successfully
✅ All scanners have circuit breaker protection
✅ Consistent error handling across all subsystems
✅ Agent check-in schedule is independent

Technical Debt to Address

Interface Redesign: Consider refactoring to a more flexible scanner interface that can handle different return types
Unified Scanning: Could merge update scanning and metrics collection into a single orchestrated flow
Type Safety: Current approach loses compile-time type safety for metrics scanners
Code Duplication: Scanners are initialized in two places (orchestrator + handlers)

Testing Implications

Need to test circuit breaker behavior for storage/system scanners
Should verify that wrapper.Scan() is never actually called (should use direct methods)
Integration tests needed for full scan flow

Performance Impact

Minimal - wrappers are thin proxies
Circuit breakers add slight overhead but provide valuable protection
No change to actual scanning logic or performance

Recommendations for Future Refactoring

Short Term (Next Release)
- Add logging to verify wrappers are working as expected
- Monitor circuit breaker triggers for metrics scanners
- Document the architectural pattern for future contributors
Medium Term (2-3 Releases)
- Consider introducing a TypedScanner interface:
```
type TypedScanner interface {
    Scanner
    ScanTyped() (TypedScannerResult, error)
}
```
- Gradually migrate scanners to new interface
- Update orchestrator to support typed results
Long Term (Major Version)
- Complete interface redesign with generics (Go 1.18+)
- Unified scanning pipeline for all subsystem types
- Consolidate reporting endpoints

Verification Steps

To verify these fixes work:

Check Agent Logs: Should see successful storage scans
```
journalctl -u redflag-agent -f | grep -i storage
```

Check API Response: Should return storage metrics

curl http://localhost:8080/api/v1/agents/{agent-id}/storage

Check UI: AgentStorage component should display metrics
- Navigate to agent details page
- Verify "System Resources" section shows disk usage
- Check last updated timestamp is recent (< 5 minutes)
Check Agent Check-in: Should see check-ins every ~5 minutes
```
journalctl -u redflag-agent -f | grep "Checking in"
```

Conclusion

These fixes resolve critical functionality issues while identifying important architectural limitations. The chosen approach balances immediate needs (functionality, stability) with long-term maintainability (minimal changes, clear technical debt).

The interface mismatch between update scanners and metrics scanners represents a fundamental architectural decision point that should be revisited in a future major version. For now, the wrapper pattern provides a pragmatic solution that unblocks critical features while maintaining system stability.

Key Achievement: Agent now correctly separates concerns between check-in frequency (health monitoring) and scanner frequency (data collection), with all scanners properly registered and protected by circuit breakers.

12 KiB Raw Permalink Blame History

Critical Issues Resolved - AgentHealth Scanner System

Date: 2025-12-18

Status: RESOLVED

Issue #1: Agent Check-in Interval Override

Problem Description

Root Cause

Impact

Solution Implemented

Files Modified

Alternative Approaches Considered

Issue #2: Storage/System/Docker Scanners Not Registered

Problem Description

Root Cause

Impact

Solution Implemented

Step 1: Created Scanner Wrappers

Step 2: Registered All Scanners with Circuit Breakers

Files Modified

Architectural Limitation and Technical Debt

Alternative Approaches Considered

Option A: Generic Scanner Interface (Major Refactor)

Option B: Separate Orchestrators

Option C: Typed Scanner Registration

Option D: Current Approach (Chosen)

Ramifications and Future Considerations

Immediate Benefits

Technical Debt to Address

Testing Implications

Performance Impact

Recommendations for Future Refactoring

Verification Steps

Conclusion

12 KiB

Raw Permalink Blame History