3.4 KiB
P0-003 Status Analysis: Agent Retry Logic - OUTDATED
Current Implementation Status: PARTIALLY IMPLEMENTED (Downgraded from P0 to P1)
What EXISTS ✅
-
Basic Retry Loop: Agent continues checking in after failures
- Location:
aggregator-agent/cmd/agent/main.golines 945-967 - On error: logs error, sleeps polling interval, continues loop
- Location:
-
Token Renewal Retry: If check-in fails with 401:
- Attempts token renewal
- Retries check-in with new token
- Falls back to normal retry if renewal fails
-
Event Buffering: System events are buffered when send fails
- Location:
internal/acknowledgment/tracker.go - Saves to disk, retries with maxRetry=10
- Persists across agent restarts
- Location:
-
Subsystem Circuit Breakers: Individual scanner protection
- APT, DNF, Windows Update, Winget have circuit breakers
- Prevents subsystem scanner failures from stopping agent
What is MISSING ❌
-
Exponential Backoff: Fixed sleep periods (5s or 5m)
- Problem: 5 minutes is too long for quick recovery
- Problem: 5 seconds rapid polling mode could hammer server
- No progressive backoff based on failure count
-
Circuit Breaker for Server Connection: Main agent-server connection has no circuit breaker
- Extends outages by continuing to try when server is completely down
- No half-open state for gradual recovery
-
Connection Health Checks: No /health endpoint check before operations
- Would prevent unnecessary connection attempts
- Could provide faster detection of server recovery
-
Adaptive Polling: Polling interval doesn't adapt to failure patterns
- Successful check-ins don't reset failure counters
- No gradual backoff when failures persist
Documentation Status: OUTDATED
The P0-003_Agent-No-Retry-Logic.md file describes a scenario where agents permanently stop after the first failure. This is NO LONGER ACCURATE.
Current behavior: Agent retries every polling interval (fixed, no backoff) Described in P0-003: Agent permanently stops after first failure (WRONG)
Documentation needs to be updated to reflect that basic retry exists but needs resilience improvements.
Comparison: What Was Planned vs What Exists
Planned (from P0-003 doc):
- Exponential backoff:
initialDelay=5s, maxDelay=5m, multiplier=2.0 - Circuit breaker with explicit states (Open/HalfOpen/Closed)
- Connection health checks before operations
- Recovery logging
Current Implementation:
- Fixed sleep:
interval = polling_interval(5s or 300s) - No circuit breaker for main connection
- Token renewal retry for 401s only
- Basic error logging
- Event buffering to disk
Is This Still a P0?
NO - Should be downgraded to P1:
- Basic retry EXISTS (not critical)
- Enhancement needed for exponential backoff
- Enhancement needed for circuit breaker
- Could cause server overload (P1 concern, not P0)
Recommendation
Priority: Downgrade to P1 - Important Enhancement
Next Steps:
- Update P0-003_Agent-No-Retry-Logic.md documentation
- Add exponential backoff to main check-in loop
- Implement circuit breaker for agent-server connection
- Add /health endpoint validation
- Implement adaptive polling based on failure patterns
Priority Order:
- Focus on REAL P0s: P0-004 (database), P0-001 (rate limiting), P0-002 (session loop)
- Then: Build Orchestrator signing connection (critical for v0.2.x security)
- Then: Enhance retry logic (currently works, just not optimal)