Files
Redflag/docs/3_BACKLOG/P0-003_Agent-Retry-Status-Analysis.md

3.4 KiB

P0-003 Status Analysis: Agent Retry Logic - OUTDATED

Current Implementation Status: PARTIALLY IMPLEMENTED (Downgraded from P0 to P1)

What EXISTS

  1. Basic Retry Loop: Agent continues checking in after failures

    • Location: aggregator-agent/cmd/agent/main.go lines 945-967
    • On error: logs error, sleeps polling interval, continues loop
  2. Token Renewal Retry: If check-in fails with 401:

    • Attempts token renewal
    • Retries check-in with new token
    • Falls back to normal retry if renewal fails
  3. Event Buffering: System events are buffered when send fails

    • Location: internal/acknowledgment/tracker.go
    • Saves to disk, retries with maxRetry=10
    • Persists across agent restarts
  4. Subsystem Circuit Breakers: Individual scanner protection

    • APT, DNF, Windows Update, Winget have circuit breakers
    • Prevents subsystem scanner failures from stopping agent

What is MISSING

  1. Exponential Backoff: Fixed sleep periods (5s or 5m)

    • Problem: 5 minutes is too long for quick recovery
    • Problem: 5 seconds rapid polling mode could hammer server
    • No progressive backoff based on failure count
  2. Circuit Breaker for Server Connection: Main agent-server connection has no circuit breaker

    • Extends outages by continuing to try when server is completely down
    • No half-open state for gradual recovery
  3. Connection Health Checks: No /health endpoint check before operations

    • Would prevent unnecessary connection attempts
    • Could provide faster detection of server recovery
  4. Adaptive Polling: Polling interval doesn't adapt to failure patterns

    • Successful check-ins don't reset failure counters
    • No gradual backoff when failures persist

Documentation Status: OUTDATED

The P0-003_Agent-No-Retry-Logic.md file describes a scenario where agents permanently stop after the first failure. This is NO LONGER ACCURATE.

Current behavior: Agent retries every polling interval (fixed, no backoff) Described in P0-003: Agent permanently stops after first failure (WRONG)

Documentation needs to be updated to reflect that basic retry exists but needs resilience improvements.

Comparison: What Was Planned vs What Exists

Planned (from P0-003 doc):

  • Exponential backoff: initialDelay=5s, maxDelay=5m, multiplier=2.0
  • Circuit breaker with explicit states (Open/HalfOpen/Closed)
  • Connection health checks before operations
  • Recovery logging

Current Implementation:

  • Fixed sleep: interval = polling_interval (5s or 300s)
  • No circuit breaker for main connection
  • Token renewal retry for 401s only
  • Basic error logging
  • Event buffering to disk

Is This Still a P0?

NO - Should be downgraded to P1:

  • Basic retry EXISTS (not critical)
  • Enhancement needed for exponential backoff
  • Enhancement needed for circuit breaker
  • Could cause server overload (P1 concern, not P0)

Recommendation

Priority: Downgrade to P1 - Important Enhancement

Next Steps:

  1. Update P0-003_Agent-No-Retry-Logic.md documentation
  2. Add exponential backoff to main check-in loop
  3. Implement circuit breaker for agent-server connection
  4. Add /health endpoint validation
  5. Implement adaptive polling based on failure patterns

Priority Order:

  • Focus on REAL P0s: P0-004 (database), P0-001 (rate limiting), P0-002 (session loop)
  • Then: Build Orchestrator signing connection (critical for v0.2.x security)
  • Then: Enhance retry logic (currently works, just not optimal)