89 lines
3.4 KiB
Markdown
89 lines
3.4 KiB
Markdown
# P0-003 Status Analysis: Agent Retry Logic - OUTDATED
|
|
|
|
## Current Implementation Status: PARTIALLY IMPLEMENTED (Downgraded from P0 to P1)
|
|
|
|
### What EXISTS ✅
|
|
1. **Basic Retry Loop**: Agent continues checking in after failures
|
|
- Location: `aggregator-agent/cmd/agent/main.go` lines 945-967
|
|
- On error: logs error, sleeps polling interval, continues loop
|
|
|
|
2. **Token Renewal Retry**: If check-in fails with 401:
|
|
- Attempts token renewal
|
|
- Retries check-in with new token
|
|
- Falls back to normal retry if renewal fails
|
|
|
|
3. **Event Buffering**: System events are buffered when send fails
|
|
- Location: `internal/acknowledgment/tracker.go`
|
|
- Saves to disk, retries with maxRetry=10
|
|
- Persists across agent restarts
|
|
|
|
4. **Subsystem Circuit Breakers**: Individual scanner protection
|
|
- APT, DNF, Windows Update, Winget have circuit breakers
|
|
- Prevents subsystem scanner failures from stopping agent
|
|
|
|
### What is MISSING ❌
|
|
1. **Exponential Backoff**: Fixed sleep periods (5s or 5m)
|
|
- Problem: 5 minutes is too long for quick recovery
|
|
- Problem: 5 seconds rapid polling mode could hammer server
|
|
- No progressive backoff based on failure count
|
|
|
|
2. **Circuit Breaker for Server Connection**: Main agent-server connection has no circuit breaker
|
|
- Extends outages by continuing to try when server is completely down
|
|
- No half-open state for gradual recovery
|
|
|
|
3. **Connection Health Checks**: No /health endpoint check before operations
|
|
- Would prevent unnecessary connection attempts
|
|
- Could provide faster detection of server recovery
|
|
|
|
4. **Adaptive Polling**: Polling interval doesn't adapt to failure patterns
|
|
- Successful check-ins don't reset failure counters
|
|
- No gradual backoff when failures persist
|
|
|
|
## Documentation Status: OUTDATED
|
|
|
|
The P0-003_Agent-No-Retry-Logic.md file describes a scenario where agents **permanently stop** after the first failure. This is **NO LONGER ACCURATE**.
|
|
|
|
**Current behavior**: Agent retries every polling interval (fixed, no backoff)
|
|
**Described in P0-003**: Agent permanently stops after first failure (WRONG)
|
|
|
|
Documentation needs to be updated to reflect that basic retry exists but needs resilience improvements.
|
|
|
|
## Comparison: What Was Planned vs What Exists
|
|
|
|
### Planned (from P0-003 doc):
|
|
- Exponential backoff: `initialDelay=5s, maxDelay=5m, multiplier=2.0`
|
|
- Circuit breaker with explicit states (Open/HalfOpen/Closed)
|
|
- Connection health checks before operations
|
|
- Recovery logging
|
|
|
|
### Current Implementation:
|
|
- Fixed sleep: `interval = polling_interval` (5s or 300s)
|
|
- No circuit breaker for main connection
|
|
- Token renewal retry for 401s only
|
|
- Basic error logging
|
|
- Event buffering to disk
|
|
|
|
## Is This Still a P0?
|
|
|
|
**NO** - Should be downgraded to **P1**:
|
|
- Basic retry EXISTS (not critical)
|
|
- Enhancement needed for exponential backoff
|
|
- Enhancement needed for circuit breaker
|
|
- Could cause server overload (P1 concern, not P0)
|
|
|
|
## Recommendation
|
|
|
|
**Priority**: Downgrade to **P1 - Important Enhancement**
|
|
|
|
**Next Steps**:
|
|
1. Update P0-003_Agent-No-Retry-Logic.md documentation
|
|
2. Add exponential backoff to main check-in loop
|
|
3. Implement circuit breaker for agent-server connection
|
|
4. Add /health endpoint validation
|
|
5. Implement adaptive polling based on failure patterns
|
|
|
|
**Priority Order**:
|
|
- Focus on REAL P0s: P0-004 (database), P0-001 (rate limiting), P0-002 (session loop)
|
|
- Then: Build Orchestrator signing connection (critical for v0.2.x security)
|
|
- Then: Enhance retry logic (currently works, just not optimal)
|