# P0-003: Agent No Retry Logic

**Priority:** P0 (Critical)
**Source Reference:** From needsfixingbeforepush.md line 147
**Date Identified:** 2025-11-12

## Problem Description

Agent permanently stops checking in after encountering a server connection failure (502 Bad Gateway or connection refused). No retry logic, exponential backoff, or circuit breaker pattern is implemented, requiring manual agent restart to recover.

## Reproduction Steps

1. Start agent and server normally
2. Trigger server failure/rebuild:
   ```bash
   docker-compose restart server
   # OR rebuild server causing temporary 502 responses
   ```
3. Agent receives connection error during check-in:
   ```
   Post "http://localhost:8080/api/v1/agents/.../commands": dial tcp 127.0.0.1:8080: connect: connection refused
   ```
4. **BUG:** Agent gives up permanently and stops all future check-ins
5. Agent process continues running but never recovers
6. Manual intervention required:
   ```bash
   sudo systemctl restart redflag-agent
   ```

## Root Cause Analysis

The agent's check-in loop lacks resilience patterns for handling temporary server failures:

1. **No Retry Logic:** Single failure causes permanent stop
2. **No Exponential Backoff:** No progressive delay between retry attempts
3. **No Circuit Breaker:** No pattern for handling repeated failures
4. **No Connection Health Checks:** No pre-request connectivity validation
5. **No Recovery Logging:** No visibility into recovery attempts

## Current Vulnerable Code Pattern

```go
// Current vulnerable implementation (hypothetical)
func (a *Agent) checkIn() {
    for {
        // Make request to server
        resp, err := http.Post(serverURL + "/commands", ...)
        if err != nil {
            log.Printf("Check-in failed: %v", err)
            return  // ❌ Gives up immediately
        }
        processResponse(resp)
        time.Sleep(5 * time.Minute)
    }
}
```

## Proposed Solution

Implement comprehensive resilience patterns:

### 1. Exponential Backoff Retry
```go
type RetryConfig struct {
    InitialDelay     time.Duration
    MaxDelay         time.Duration
    MaxRetries       int
    BackoffMultiplier float64
}

func (a *Agent) checkInWithRetry() {
    retryConfig := RetryConfig{
        InitialDelay:     5 * time.Second,
        MaxDelay:         5 * time.Minute,
        MaxRetries:       10,
        BackoffMultiplier: 2.0,
    }

    for {
        err := a.withRetry(func() error {
            return a.performCheckIn()
        }, retryConfig)

        if err == nil {
            // Success, reset retry counter
            retryConfig.CurrentDelay = retryConfig.InitialDelay
        }

        time.Sleep(5 * time.Minute)  // Normal check-in interval
    }
}
```

### 2. Circuit Breaker Pattern
```go
type CircuitBreaker struct {
    State           State  // Closed, Open, HalfOpen
    Failures        int
    FailureThreshold int
    Timeout         time.Duration
    LastFailureTime time.Time
}

func (cb *CircuitBreaker) Call(operation func() error) error {
    if cb.State == Open {
        if time.Since(cb.LastFailureTime) > cb.Timeout {
            cb.State = HalfOpen
        } else {
            return ErrCircuitBreakerOpen
        }
    }

    err := operation()
    if err != nil {
        cb.Failures++
        if cb.Failures >= cb.FailureThreshold {
            cb.State = Open
            cb.LastFailureTime = time.Now()
        }
        return err
    }

    // Success
    cb.Failures = 0
    cb.State = Closed
    return nil
}
```

### 3. Connection Health Check
```go
func (a *Agent) healthCheck() error {
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer cancel()

    req, _ := http.NewRequestWithContext(ctx, "GET", a.serverURL+"/health", nil)
    resp, err := a.httpClient.Do(req)
    if err != nil {
        return fmt.Errorf("health check failed: %w", err)
    }
    defer resp.Body.Close()

    if resp.StatusCode != 200 {
        return fmt.Errorf("health check returned: %d", resp.StatusCode)
    }

    return nil
}
```

## Definition of Done

- [ ] Agent automatically retries failed check-ins with exponential backoff
- [ ] Circuit breaker prevents overwhelming struggling server
- [ ] Connection health checks validate server availability before operations
- [ ] Recovery attempts are logged for debugging
- [ ] Agent resumes normal operation when server comes back online
- [ ] Configurable retry parameters for different environments

## Test Plan

1. **Basic Recovery Test:**
   ```bash
   # Start agent and monitor logs
   sudo journalctl -u redflag-agent -f

   # In another terminal, restart server
   docker-compose restart server

   # Expected: Agent logs show retry attempts with backoff
   # Expected: Agent resumes check-ins when server recovers
   # Expected: No manual intervention required
   ```

2. **Extended Failure Test:**
   ```bash
   # Stop server for extended period
   docker-compose stop server
   sleep 10  # Agent should try multiple times

   # Start server
   docker-compose start server

   # Expected: Agent detects server recovery and resumes
   # Expected: No manual systemctl restart needed
   ```

3. **Circuit Breaker Test:**
   ```bash
   # Simulate repeated failures
   for i in {1..20}; do
     docker-compose restart server
     sleep 2
   done

   # Expected: Circuit breaker opens after threshold
   # Expected: Agent stops trying for configured timeout period
   # Expected: Circuit breaker enters half-open state after timeout
   ```

4. **Configuration Test:**
   ```bash
   # Test with different retry configurations
   # Verify configurable parameters work correctly
   # Test edge cases (max retries = 0, very long delays, etc.)
   ```

## Files to Modify

- `aggregator-agent/cmd/agent/main.go` (check-in loop logic)
- `aggregator-agent/internal/resilience/` (new package for retry/circuit breaker)
- `aggregator-agent/internal/health/` (new package for health checks)
- Agent configuration files for retry parameters

## Impact

- **Production Reliability:** Enables automatic recovery from server maintenance
- **Operational Efficiency:** Eliminates need for manual agent restarts
- **User Experience:** Transparent handling of server issues
- **Scalability:** Supports large deployments with automatic recovery
- **Monitoring:** Provides visibility into recovery attempts

## Configuration Options

```yaml
# Agent config additions
resilience:
  retry:
    initial_delay: 5s
    max_delay: 5m
    max_retries: 10
    backoff_multiplier: 2.0

  circuit_breaker:
    failure_threshold: 5
    timeout: 30s
    half_open_max_calls: 3

  health_check:
    enabled: true
    interval: 30s
    timeout: 5s
```

## Monitoring and Observability

### Metrics to Track
- Retry attempt counts
- Circuit breaker state changes
- Connection failure rates
- Recovery time statistics
- Health check success/failure rates

### Log Examples
```
2025/11/12 14:30:15 [RETRY] Check-in failed, retry 1/10 in 5s: connection refused
2025/11/12 14:30:20 [RETRY] Check-in failed, retry 2/10 in 10s: connection refused
2025/11/12 14:30:35 [CIRCUIT_BREAKER] Opening circuit after 5 consecutive failures
2025/11/12 14:31:05 [CIRCUIT_BREAKER] Entering half-open state
2025/11/12 14:31:05 [RECOVERY] Health check passed, resuming normal operations
2025/11/12 14:31:05 [CHECKIN] Successfully checked in after server recovery
```

## Verification Commands

After implementation:

```bash
# Monitor agent during server restart
sudo journalctl -u redflag-agent -f | grep -E "(RETRY|CIRCUIT|RECOVERY|HEALTH)"

# Test recovery without manual intervention
docker-compose stop server
sleep 15
docker-compose start server

# Agent should recover automatically
```