Files
Redflag/docs/3_BACKLOG/P4-001_Agent-Retry-Logic-Resilience.md

6.7 KiB

P4-001: Agent Retry Logic and Resilience Implementation

Priority: P4 (Technical Debt) Source Reference: From analysis of needsfixingbeforepush.md lines 147-186 Date Identified: 2025-11-12

Problem Description

Agent has zero resilience for server connectivity failures. When the server becomes unavailable (502 Bad Gateway, connection refused, network issues), the agent permanently stops checking in and requires manual restart. This violates distributed system expectations and prevents production deployment.

Impact

  • Production Blocking: Server maintenance/upgrades break all agents permanently
  • Operational Burden: Manual systemctl restart required on every agent after server issues
  • Reliability Violation: No automatic recovery from transient failures
  • Distributed System Anti-Pattern: Clients should handle server unavailability gracefully

Current Behavior

  1. Server rebuild/maintenance causes 502 responses
  2. Agent receives connection error during check-in
  3. Agent gives up permanently and stops all future check-ins
  4. Agent process continues running but never recovers
  5. Manual intervention required: sudo systemctl restart redflag-agent

Proposed Solution

Implement comprehensive retry logic with exponential backoff:

1. Connection Retry Logic

type RetryConfig struct {
    MaxRetries:    5
    InitialDelay:  5 * time.Second
    MaxDelay:      5 * time.Minute
    BackoffFactor: 2.0
}

func (a *Agent) checkInWithRetry() error {
    var lastErr error
    delay := a.retryConfig.InitialDelay

    for attempt := 0; attempt < a.retryConfig.MaxRetries; attempt++ {
        err := a.performCheckIn()
        if err == nil {
            return nil
        }

        lastErr = err

        // Retry on server errors, fail fast on client errors
        if isClientError(err) {
            return fmt.Errorf("client error: %w", err)
        }

        log.Printf("[Connection] Server unavailable (attempt %d/%d), retrying in %v: %v",
            attempt+1, a.retryConfig.MaxRetries, delay, err)

        time.Sleep(delay)
        delay = time.Duration(float64(delay) * a.retryConfig.BackoffFactor)
        if delay > a.retryConfig.MaxDelay {
            delay = a.retryConfig.MaxDelay
        }
    }

    return fmt.Errorf("server unavailable after %d attempts: %w", a.retryConfig.MaxRetries, lastErr)
}

2. Circuit Breaker Pattern

type CircuitBreaker struct {
    State         State // Closed, Open, HalfOpen
    Failures      int
    LastFailTime  time.Time
    Timeout       time.Duration
    Threshold     int
}

func (cb *CircuitBreaker) Call(fn func() error) error {
    if cb.State == Open {
        if time.Since(cb.LastFailTime) > cb.Timeout {
            cb.State = HalfOpen
        } else {
            return errors.New("circuit breaker is open")
        }
    }

    err := fn()
    if err != nil {
        cb.Failures++
        cb.LastFailTime = time.Now()
        if cb.Failures >= cb.Threshold {
            cb.State = Open
        }
        return err
    }

    // Success: reset breaker
    cb.Failures = 0
    cb.State = Closed
    return nil
}

3. Connection Health Check

func (a *Agent) healthCheck() error {
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer cancel()

    req, _ := http.NewRequestWithContext(ctx, "GET", a.serverURL+"/health", nil)
    resp, err := a.httpClient.Do(req)
    if err != nil {
        return err
    }
    defer resp.Body.Close()

    if resp.StatusCode == 200 {
        return nil
    }
    return fmt.Errorf("health check failed: HTTP %d", resp.StatusCode)
}

Definition of Done

  • Agent implements exponential backoff retry for server connection failures
  • Circuit breaker pattern prevents cascading failures
  • Connection health checks detect server availability before operations
  • Agent recovers automatically after server comes back online
  • Detailed logging for troubleshooting connection issues
  • Retry configuration is tunable via agent config
  • Integration tests verify recovery scenarios

Implementation Details

File Locations

  • Primary: aggregator-agent/cmd/agent/main.go (check-in loop)
  • Supporting: aggregator-agent/internal/resilience/ (new package)
  • Config: aggregator-agent/internal/config/config.go

Configuration Options

{
  "resilience": {
    "max_retries": 5,
    "initial_delay_seconds": 5,
    "max_delay_minutes": 5,
    "backoff_factor": 2.0,
    "circuit_breaker_threshold": 3,
    "circuit_breaker_timeout_minutes": 5
  }
}

Error Classification

func isClientError(err error) bool {
    if httpErr, ok := err.(*HTTPError); ok {
        return httpErr.StatusCode >= 400 && httpErr.StatusCode < 500
    }
    return false
}

func isServerError(err error) bool {
    if httpErr, ok := err.(*HTTPError); ok {
        return httpErr.StatusCode >= 500
    }
    return strings.Contains(err.Error(), "connection refused")
}

Testing Strategy

Unit Tests

  • Retry logic with exponential backoff
  • Circuit breaker state transitions
  • Health check timeout handling
  • Error classification accuracy

Integration Tests

  • Server restart recovery scenarios
  • Network partition simulation
  • Long-running stability tests
  • Configuration validation

Manual Test Scenarios

  1. Server Restart Test:

    # Start with server running
    docker-compose up -d
    
    # Verify agent checking in
    journalctl -u redflag-agent -f
    
    # Restart server
    docker-compose restart server
    
    # Verify agent recovers without manual intervention
    
  2. Extended Downtime Test:

    # Stop server for 10 minutes
    docker-compose stop server
    sleep 600
    
    # Start server
    docker-compose start server
    
    # Verify agent resumes check-ins
    
  3. Network Partition Test:

    # Block network access temporarily
    iptables -A OUTPUT -d localhost -p tcp --dport 8080 -j DROP
    sleep 300
    
    # Remove block
    iptables -D OUTPUT -d localhost -p tcp --dport 8080 -j DROP
    
    # Verify agent recovers
    

Prerequisites

  • Circuit breaker pattern implementation exists (aggregator-agent/internal/circuitbreaker/)
  • HTTP client configuration supports timeouts
  • Logging infrastructure supports structured output

Effort Estimate

Complexity: Medium-High Effort: 2-3 days

  • Day 1: Retry logic implementation and basic testing
  • Day 2: Circuit breaker integration and configuration
  • Day 3: Integration testing and error handling refinement

Success Metrics

  • Agent uptime >99.9% during server maintenance windows
  • Zero manual interventions required for server restarts
  • Recovery time <30 seconds after server becomes available
  • Clear error logs for troubleshooting
  • No memory leaks in retry logic