Redflag/docs/3_BACKLOG/P4-001_Agent-Retry-Logic-Resilience.md

# P4-001: Agent Retry Logic and Resilience Implementation

**Priority:** P4 (Technical Debt)
**Source Reference:** From analysis of needsfixingbeforepush.md lines 147-186
**Date Identified:** 2025-11-12

## Problem Description

Agent has zero resilience for server connectivity failures. When the server becomes unavailable (502 Bad Gateway, connection refused, network issues), the agent permanently stops checking in and requires manual restart. This violates distributed system expectations and prevents production deployment.

## Impact

- **Production Blocking:** Server maintenance/upgrades break all agents permanently
- **Operational Burden:** Manual systemctl restart required on every agent after server issues
- **Reliability Violation:** No automatic recovery from transient failures
- **Distributed System Anti-Pattern:** Clients should handle server unavailability gracefully

## Current Behavior

1. Server rebuild/maintenance causes 502 responses
2. Agent receives connection error during check-in
3. Agent gives up permanently and stops all future check-ins
4. Agent process continues running but never recovers
5. Manual intervention required: `sudo systemctl restart redflag-agent`

## Proposed Solution

Implement comprehensive retry logic with exponential backoff:

### 1. Connection Retry Logic
```go
type RetryConfig struct {
    MaxRetries:    5
    InitialDelay:  5 * time.Second
    MaxDelay:      5 * time.Minute
    BackoffFactor: 2.0
}

func (a *Agent) checkInWithRetry() error {
    var lastErr error
    delay := a.retryConfig.InitialDelay

    for attempt := 0; attempt < a.retryConfig.MaxRetries; attempt++ {
        err := a.performCheckIn()
        if err == nil {
            return nil
        }

        lastErr = err

        // Retry on server errors, fail fast on client errors
        if isClientError(err) {
            return fmt.Errorf("client error: %w", err)
        }

        log.Printf("[Connection] Server unavailable (attempt %d/%d), retrying in %v: %v",
            attempt+1, a.retryConfig.MaxRetries, delay, err)

        time.Sleep(delay)
        delay = time.Duration(float64(delay) * a.retryConfig.BackoffFactor)
        if delay > a.retryConfig.MaxDelay {
            delay = a.retryConfig.MaxDelay
        }
    }

    return fmt.Errorf("server unavailable after %d attempts: %w", a.retryConfig.MaxRetries, lastErr)
}
```

### 2. Circuit Breaker Pattern
```go
type CircuitBreaker struct {
    State         State // Closed, Open, HalfOpen
    Failures      int
    LastFailTime  time.Time
    Timeout       time.Duration
    Threshold     int
}

func (cb *CircuitBreaker) Call(fn func() error) error {
    if cb.State == Open {
        if time.Since(cb.LastFailTime) > cb.Timeout {
            cb.State = HalfOpen
        } else {
            return errors.New("circuit breaker is open")
        }
    }

    err := fn()
    if err != nil {
        cb.Failures++
        cb.LastFailTime = time.Now()
        if cb.Failures >= cb.Threshold {
            cb.State = Open
        }
        return err
    }

    // Success: reset breaker
    cb.Failures = 0
    cb.State = Closed
    return nil
}
```

### 3. Connection Health Check
```go
func (a *Agent) healthCheck() error {
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer cancel()

    req, _ := http.NewRequestWithContext(ctx, "GET", a.serverURL+"/health", nil)
    resp, err := a.httpClient.Do(req)
    if err != nil {
        return err
    }
    defer resp.Body.Close()

    if resp.StatusCode == 200 {
        return nil
    }
    return fmt.Errorf("health check failed: HTTP %d", resp.StatusCode)
}
```

## Definition of Done

- [ ] Agent implements exponential backoff retry for server connection failures
- [ ] Circuit breaker pattern prevents cascading failures
- [ ] Connection health checks detect server availability before operations
- [ ] Agent recovers automatically after server comes back online
- [ ] Detailed logging for troubleshooting connection issues
- [ ] Retry configuration is tunable via agent config
- [ ] Integration tests verify recovery scenarios

## Implementation Details

### File Locations
- **Primary:** `aggregator-agent/cmd/agent/main.go` (check-in loop)
- **Supporting:** `aggregator-agent/internal/resilience/` (new package)
- **Config:** `aggregator-agent/internal/config/config.go`

### Configuration Options
```json
{
  "resilience": {
    "max_retries": 5,
    "initial_delay_seconds": 5,
    "max_delay_minutes": 5,
    "backoff_factor": 2.0,
    "circuit_breaker_threshold": 3,
    "circuit_breaker_timeout_minutes": 5
  }
}
```

### Error Classification
```go
func isClientError(err error) bool {
    if httpErr, ok := err.(*HTTPError); ok {
        return httpErr.StatusCode >= 400 && httpErr.StatusCode < 500
    }
    return false
}

func isServerError(err error) bool {
    if httpErr, ok := err.(*HTTPError); ok {
        return httpErr.StatusCode >= 500
    }
    return strings.Contains(err.Error(), "connection refused")
}
```

## Testing Strategy

### Unit Tests
- Retry logic with exponential backoff
- Circuit breaker state transitions
- Health check timeout handling
- Error classification accuracy

### Integration Tests
- Server restart recovery scenarios
- Network partition simulation
- Long-running stability tests
- Configuration validation

### Manual Test Scenarios
1. **Server Restart Test:**
   ```bash
   # Start with server running
   docker-compose up -d

   # Verify agent checking in
   journalctl -u redflag-agent -f

   # Restart server
   docker-compose restart server

   # Verify agent recovers without manual intervention
   ```

2. **Extended Downtime Test:**
   ```bash
   # Stop server for 10 minutes
   docker-compose stop server
   sleep 600

   # Start server
   docker-compose start server

   # Verify agent resumes check-ins
   ```

3. **Network Partition Test:**
   ```bash
   # Block network access temporarily
   iptables -A OUTPUT -d localhost -p tcp --dport 8080 -j DROP
   sleep 300

   # Remove block
   iptables -D OUTPUT -d localhost -p tcp --dport 8080 -j DROP

   # Verify agent recovers
   ```

## Prerequisites

- Circuit breaker pattern implementation exists (`aggregator-agent/internal/circuitbreaker/`)
- HTTP client configuration supports timeouts
- Logging infrastructure supports structured output

## Effort Estimate

**Complexity:** Medium-High
**Effort:** 2-3 days
- Day 1: Retry logic implementation and basic testing
- Day 2: Circuit breaker integration and configuration
- Day 3: Integration testing and error handling refinement

## Success Metrics

- Agent uptime >99.9% during server maintenance windows
- Zero manual interventions required for server restarts
- Recovery time <30 seconds after server becomes available
- Clear error logs for troubleshooting
- No memory leaks in retry logic