247 lines
6.7 KiB
Markdown
247 lines
6.7 KiB
Markdown
# P4-001: Agent Retry Logic and Resilience Implementation
|
|
|
|
**Priority:** P4 (Technical Debt)
|
|
**Source Reference:** From analysis of needsfixingbeforepush.md lines 147-186
|
|
**Date Identified:** 2025-11-12
|
|
|
|
## Problem Description
|
|
|
|
Agent has zero resilience for server connectivity failures. When the server becomes unavailable (502 Bad Gateway, connection refused, network issues), the agent permanently stops checking in and requires manual restart. This violates distributed system expectations and prevents production deployment.
|
|
|
|
## Impact
|
|
|
|
- **Production Blocking:** Server maintenance/upgrades break all agents permanently
|
|
- **Operational Burden:** Manual systemctl restart required on every agent after server issues
|
|
- **Reliability Violation:** No automatic recovery from transient failures
|
|
- **Distributed System Anti-Pattern:** Clients should handle server unavailability gracefully
|
|
|
|
## Current Behavior
|
|
|
|
1. Server rebuild/maintenance causes 502 responses
|
|
2. Agent receives connection error during check-in
|
|
3. Agent gives up permanently and stops all future check-ins
|
|
4. Agent process continues running but never recovers
|
|
5. Manual intervention required: `sudo systemctl restart redflag-agent`
|
|
|
|
## Proposed Solution
|
|
|
|
Implement comprehensive retry logic with exponential backoff:
|
|
|
|
### 1. Connection Retry Logic
|
|
```go
|
|
type RetryConfig struct {
|
|
MaxRetries: 5
|
|
InitialDelay: 5 * time.Second
|
|
MaxDelay: 5 * time.Minute
|
|
BackoffFactor: 2.0
|
|
}
|
|
|
|
func (a *Agent) checkInWithRetry() error {
|
|
var lastErr error
|
|
delay := a.retryConfig.InitialDelay
|
|
|
|
for attempt := 0; attempt < a.retryConfig.MaxRetries; attempt++ {
|
|
err := a.performCheckIn()
|
|
if err == nil {
|
|
return nil
|
|
}
|
|
|
|
lastErr = err
|
|
|
|
// Retry on server errors, fail fast on client errors
|
|
if isClientError(err) {
|
|
return fmt.Errorf("client error: %w", err)
|
|
}
|
|
|
|
log.Printf("[Connection] Server unavailable (attempt %d/%d), retrying in %v: %v",
|
|
attempt+1, a.retryConfig.MaxRetries, delay, err)
|
|
|
|
time.Sleep(delay)
|
|
delay = time.Duration(float64(delay) * a.retryConfig.BackoffFactor)
|
|
if delay > a.retryConfig.MaxDelay {
|
|
delay = a.retryConfig.MaxDelay
|
|
}
|
|
}
|
|
|
|
return fmt.Errorf("server unavailable after %d attempts: %w", a.retryConfig.MaxRetries, lastErr)
|
|
}
|
|
```
|
|
|
|
### 2. Circuit Breaker Pattern
|
|
```go
|
|
type CircuitBreaker struct {
|
|
State State // Closed, Open, HalfOpen
|
|
Failures int
|
|
LastFailTime time.Time
|
|
Timeout time.Duration
|
|
Threshold int
|
|
}
|
|
|
|
func (cb *CircuitBreaker) Call(fn func() error) error {
|
|
if cb.State == Open {
|
|
if time.Since(cb.LastFailTime) > cb.Timeout {
|
|
cb.State = HalfOpen
|
|
} else {
|
|
return errors.New("circuit breaker is open")
|
|
}
|
|
}
|
|
|
|
err := fn()
|
|
if err != nil {
|
|
cb.Failures++
|
|
cb.LastFailTime = time.Now()
|
|
if cb.Failures >= cb.Threshold {
|
|
cb.State = Open
|
|
}
|
|
return err
|
|
}
|
|
|
|
// Success: reset breaker
|
|
cb.Failures = 0
|
|
cb.State = Closed
|
|
return nil
|
|
}
|
|
```
|
|
|
|
### 3. Connection Health Check
|
|
```go
|
|
func (a *Agent) healthCheck() error {
|
|
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
|
|
defer cancel()
|
|
|
|
req, _ := http.NewRequestWithContext(ctx, "GET", a.serverURL+"/health", nil)
|
|
resp, err := a.httpClient.Do(req)
|
|
if err != nil {
|
|
return err
|
|
}
|
|
defer resp.Body.Close()
|
|
|
|
if resp.StatusCode == 200 {
|
|
return nil
|
|
}
|
|
return fmt.Errorf("health check failed: HTTP %d", resp.StatusCode)
|
|
}
|
|
```
|
|
|
|
## Definition of Done
|
|
|
|
- [ ] Agent implements exponential backoff retry for server connection failures
|
|
- [ ] Circuit breaker pattern prevents cascading failures
|
|
- [ ] Connection health checks detect server availability before operations
|
|
- [ ] Agent recovers automatically after server comes back online
|
|
- [ ] Detailed logging for troubleshooting connection issues
|
|
- [ ] Retry configuration is tunable via agent config
|
|
- [ ] Integration tests verify recovery scenarios
|
|
|
|
## Implementation Details
|
|
|
|
### File Locations
|
|
- **Primary:** `aggregator-agent/cmd/agent/main.go` (check-in loop)
|
|
- **Supporting:** `aggregator-agent/internal/resilience/` (new package)
|
|
- **Config:** `aggregator-agent/internal/config/config.go`
|
|
|
|
### Configuration Options
|
|
```json
|
|
{
|
|
"resilience": {
|
|
"max_retries": 5,
|
|
"initial_delay_seconds": 5,
|
|
"max_delay_minutes": 5,
|
|
"backoff_factor": 2.0,
|
|
"circuit_breaker_threshold": 3,
|
|
"circuit_breaker_timeout_minutes": 5
|
|
}
|
|
}
|
|
```
|
|
|
|
### Error Classification
|
|
```go
|
|
func isClientError(err error) bool {
|
|
if httpErr, ok := err.(*HTTPError); ok {
|
|
return httpErr.StatusCode >= 400 && httpErr.StatusCode < 500
|
|
}
|
|
return false
|
|
}
|
|
|
|
func isServerError(err error) bool {
|
|
if httpErr, ok := err.(*HTTPError); ok {
|
|
return httpErr.StatusCode >= 500
|
|
}
|
|
return strings.Contains(err.Error(), "connection refused")
|
|
}
|
|
```
|
|
|
|
## Testing Strategy
|
|
|
|
### Unit Tests
|
|
- Retry logic with exponential backoff
|
|
- Circuit breaker state transitions
|
|
- Health check timeout handling
|
|
- Error classification accuracy
|
|
|
|
### Integration Tests
|
|
- Server restart recovery scenarios
|
|
- Network partition simulation
|
|
- Long-running stability tests
|
|
- Configuration validation
|
|
|
|
### Manual Test Scenarios
|
|
1. **Server Restart Test:**
|
|
```bash
|
|
# Start with server running
|
|
docker-compose up -d
|
|
|
|
# Verify agent checking in
|
|
journalctl -u redflag-agent -f
|
|
|
|
# Restart server
|
|
docker-compose restart server
|
|
|
|
# Verify agent recovers without manual intervention
|
|
```
|
|
|
|
2. **Extended Downtime Test:**
|
|
```bash
|
|
# Stop server for 10 minutes
|
|
docker-compose stop server
|
|
sleep 600
|
|
|
|
# Start server
|
|
docker-compose start server
|
|
|
|
# Verify agent resumes check-ins
|
|
```
|
|
|
|
3. **Network Partition Test:**
|
|
```bash
|
|
# Block network access temporarily
|
|
iptables -A OUTPUT -d localhost -p tcp --dport 8080 -j DROP
|
|
sleep 300
|
|
|
|
# Remove block
|
|
iptables -D OUTPUT -d localhost -p tcp --dport 8080 -j DROP
|
|
|
|
# Verify agent recovers
|
|
```
|
|
|
|
## Prerequisites
|
|
|
|
- Circuit breaker pattern implementation exists (`aggregator-agent/internal/circuitbreaker/`)
|
|
- HTTP client configuration supports timeouts
|
|
- Logging infrastructure supports structured output
|
|
|
|
## Effort Estimate
|
|
|
|
**Complexity:** Medium-High
|
|
**Effort:** 2-3 days
|
|
- Day 1: Retry logic implementation and basic testing
|
|
- Day 2: Circuit breaker integration and configuration
|
|
- Day 3: Integration testing and error handling refinement
|
|
|
|
## Success Metrics
|
|
|
|
- Agent uptime >99.9% during server maintenance windows
|
|
- Zero manual interventions required for server restarts
|
|
- Recovery time <30 seconds after server becomes available
|
|
- Clear error logs for troubleshooting
|
|
- No memory leaks in retry logic |