Files
Redflag/docs/3_BACKLOG/P0-003_Agent-Retry-Status-Analysis.md

89 lines
3.4 KiB
Markdown

# P0-003 Status Analysis: Agent Retry Logic - OUTDATED
## Current Implementation Status: PARTIALLY IMPLEMENTED (Downgraded from P0 to P1)
### What EXISTS ✅
1. **Basic Retry Loop**: Agent continues checking in after failures
- Location: `aggregator-agent/cmd/agent/main.go` lines 945-967
- On error: logs error, sleeps polling interval, continues loop
2. **Token Renewal Retry**: If check-in fails with 401:
- Attempts token renewal
- Retries check-in with new token
- Falls back to normal retry if renewal fails
3. **Event Buffering**: System events are buffered when send fails
- Location: `internal/acknowledgment/tracker.go`
- Saves to disk, retries with maxRetry=10
- Persists across agent restarts
4. **Subsystem Circuit Breakers**: Individual scanner protection
- APT, DNF, Windows Update, Winget have circuit breakers
- Prevents subsystem scanner failures from stopping agent
### What is MISSING ❌
1. **Exponential Backoff**: Fixed sleep periods (5s or 5m)
- Problem: 5 minutes is too long for quick recovery
- Problem: 5 seconds rapid polling mode could hammer server
- No progressive backoff based on failure count
2. **Circuit Breaker for Server Connection**: Main agent-server connection has no circuit breaker
- Extends outages by continuing to try when server is completely down
- No half-open state for gradual recovery
3. **Connection Health Checks**: No /health endpoint check before operations
- Would prevent unnecessary connection attempts
- Could provide faster detection of server recovery
4. **Adaptive Polling**: Polling interval doesn't adapt to failure patterns
- Successful check-ins don't reset failure counters
- No gradual backoff when failures persist
## Documentation Status: OUTDATED
The P0-003_Agent-No-Retry-Logic.md file describes a scenario where agents **permanently stop** after the first failure. This is **NO LONGER ACCURATE**.
**Current behavior**: Agent retries every polling interval (fixed, no backoff)
**Described in P0-003**: Agent permanently stops after first failure (WRONG)
Documentation needs to be updated to reflect that basic retry exists but needs resilience improvements.
## Comparison: What Was Planned vs What Exists
### Planned (from P0-003 doc):
- Exponential backoff: `initialDelay=5s, maxDelay=5m, multiplier=2.0`
- Circuit breaker with explicit states (Open/HalfOpen/Closed)
- Connection health checks before operations
- Recovery logging
### Current Implementation:
- Fixed sleep: `interval = polling_interval` (5s or 300s)
- No circuit breaker for main connection
- Token renewal retry for 401s only
- Basic error logging
- Event buffering to disk
## Is This Still a P0?
**NO** - Should be downgraded to **P1**:
- Basic retry EXISTS (not critical)
- Enhancement needed for exponential backoff
- Enhancement needed for circuit breaker
- Could cause server overload (P1 concern, not P0)
## Recommendation
**Priority**: Downgrade to **P1 - Important Enhancement**
**Next Steps**:
1. Update P0-003_Agent-No-Retry-Logic.md documentation
2. Add exponential backoff to main check-in loop
3. Implement circuit breaker for agent-server connection
4. Add /health endpoint validation
5. Implement adaptive polling based on failure patterns
**Priority Order**:
- Focus on REAL P0s: P0-004 (database), P0-001 (rate limiting), P0-002 (session loop)
- Then: Build Orchestrator signing connection (critical for v0.2.x security)
- Then: Enhance retry logic (currently works, just not optimal)