Files
Redflag/docs/3_BACKLOG/P4-001_Agent-Retry-Logic-Resilience.md

247 lines
6.7 KiB
Markdown

# P4-001: Agent Retry Logic and Resilience Implementation
**Priority:** P4 (Technical Debt)
**Source Reference:** From analysis of needsfixingbeforepush.md lines 147-186
**Date Identified:** 2025-11-12
## Problem Description
Agent has zero resilience for server connectivity failures. When the server becomes unavailable (502 Bad Gateway, connection refused, network issues), the agent permanently stops checking in and requires manual restart. This violates distributed system expectations and prevents production deployment.
## Impact
- **Production Blocking:** Server maintenance/upgrades break all agents permanently
- **Operational Burden:** Manual systemctl restart required on every agent after server issues
- **Reliability Violation:** No automatic recovery from transient failures
- **Distributed System Anti-Pattern:** Clients should handle server unavailability gracefully
## Current Behavior
1. Server rebuild/maintenance causes 502 responses
2. Agent receives connection error during check-in
3. Agent gives up permanently and stops all future check-ins
4. Agent process continues running but never recovers
5. Manual intervention required: `sudo systemctl restart redflag-agent`
## Proposed Solution
Implement comprehensive retry logic with exponential backoff:
### 1. Connection Retry Logic
```go
type RetryConfig struct {
MaxRetries: 5
InitialDelay: 5 * time.Second
MaxDelay: 5 * time.Minute
BackoffFactor: 2.0
}
func (a *Agent) checkInWithRetry() error {
var lastErr error
delay := a.retryConfig.InitialDelay
for attempt := 0; attempt < a.retryConfig.MaxRetries; attempt++ {
err := a.performCheckIn()
if err == nil {
return nil
}
lastErr = err
// Retry on server errors, fail fast on client errors
if isClientError(err) {
return fmt.Errorf("client error: %w", err)
}
log.Printf("[Connection] Server unavailable (attempt %d/%d), retrying in %v: %v",
attempt+1, a.retryConfig.MaxRetries, delay, err)
time.Sleep(delay)
delay = time.Duration(float64(delay) * a.retryConfig.BackoffFactor)
if delay > a.retryConfig.MaxDelay {
delay = a.retryConfig.MaxDelay
}
}
return fmt.Errorf("server unavailable after %d attempts: %w", a.retryConfig.MaxRetries, lastErr)
}
```
### 2. Circuit Breaker Pattern
```go
type CircuitBreaker struct {
State State // Closed, Open, HalfOpen
Failures int
LastFailTime time.Time
Timeout time.Duration
Threshold int
}
func (cb *CircuitBreaker) Call(fn func() error) error {
if cb.State == Open {
if time.Since(cb.LastFailTime) > cb.Timeout {
cb.State = HalfOpen
} else {
return errors.New("circuit breaker is open")
}
}
err := fn()
if err != nil {
cb.Failures++
cb.LastFailTime = time.Now()
if cb.Failures >= cb.Threshold {
cb.State = Open
}
return err
}
// Success: reset breaker
cb.Failures = 0
cb.State = Closed
return nil
}
```
### 3. Connection Health Check
```go
func (a *Agent) healthCheck() error {
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
req, _ := http.NewRequestWithContext(ctx, "GET", a.serverURL+"/health", nil)
resp, err := a.httpClient.Do(req)
if err != nil {
return err
}
defer resp.Body.Close()
if resp.StatusCode == 200 {
return nil
}
return fmt.Errorf("health check failed: HTTP %d", resp.StatusCode)
}
```
## Definition of Done
- [ ] Agent implements exponential backoff retry for server connection failures
- [ ] Circuit breaker pattern prevents cascading failures
- [ ] Connection health checks detect server availability before operations
- [ ] Agent recovers automatically after server comes back online
- [ ] Detailed logging for troubleshooting connection issues
- [ ] Retry configuration is tunable via agent config
- [ ] Integration tests verify recovery scenarios
## Implementation Details
### File Locations
- **Primary:** `aggregator-agent/cmd/agent/main.go` (check-in loop)
- **Supporting:** `aggregator-agent/internal/resilience/` (new package)
- **Config:** `aggregator-agent/internal/config/config.go`
### Configuration Options
```json
{
"resilience": {
"max_retries": 5,
"initial_delay_seconds": 5,
"max_delay_minutes": 5,
"backoff_factor": 2.0,
"circuit_breaker_threshold": 3,
"circuit_breaker_timeout_minutes": 5
}
}
```
### Error Classification
```go
func isClientError(err error) bool {
if httpErr, ok := err.(*HTTPError); ok {
return httpErr.StatusCode >= 400 && httpErr.StatusCode < 500
}
return false
}
func isServerError(err error) bool {
if httpErr, ok := err.(*HTTPError); ok {
return httpErr.StatusCode >= 500
}
return strings.Contains(err.Error(), "connection refused")
}
```
## Testing Strategy
### Unit Tests
- Retry logic with exponential backoff
- Circuit breaker state transitions
- Health check timeout handling
- Error classification accuracy
### Integration Tests
- Server restart recovery scenarios
- Network partition simulation
- Long-running stability tests
- Configuration validation
### Manual Test Scenarios
1. **Server Restart Test:**
```bash
# Start with server running
docker-compose up -d
# Verify agent checking in
journalctl -u redflag-agent -f
# Restart server
docker-compose restart server
# Verify agent recovers without manual intervention
```
2. **Extended Downtime Test:**
```bash
# Stop server for 10 minutes
docker-compose stop server
sleep 600
# Start server
docker-compose start server
# Verify agent resumes check-ins
```
3. **Network Partition Test:**
```bash
# Block network access temporarily
iptables -A OUTPUT -d localhost -p tcp --dport 8080 -j DROP
sleep 300
# Remove block
iptables -D OUTPUT -d localhost -p tcp --dport 8080 -j DROP
# Verify agent recovers
```
## Prerequisites
- Circuit breaker pattern implementation exists (`aggregator-agent/internal/circuitbreaker/`)
- HTTP client configuration supports timeouts
- Logging infrastructure supports structured output
## Effort Estimate
**Complexity:** Medium-High
**Effort:** 2-3 days
- Day 1: Retry logic implementation and basic testing
- Day 2: Circuit breaker integration and configuration
- Day 3: Integration testing and error handling refinement
## Success Metrics
- Agent uptime >99.9% during server maintenance windows
- Zero manual interventions required for server restarts
- Recovery time <30 seconds after server becomes available
- Clear error logs for troubleshooting
- No memory leaks in retry logic