Files
Redflag/docs/3_BACKLOG/P0-003_Agent-No-Retry-Logic.md

7.5 KiB

P0-003: Agent No Retry Logic

Priority: P0 (Critical) Source Reference: From needsfixingbeforepush.md line 147 Date Identified: 2025-11-12

Problem Description

Agent permanently stops checking in after encountering a server connection failure (502 Bad Gateway or connection refused). No retry logic, exponential backoff, or circuit breaker pattern is implemented, requiring manual agent restart to recover.

Reproduction Steps

  1. Start agent and server normally
  2. Trigger server failure/rebuild:
    docker-compose restart server
    # OR rebuild server causing temporary 502 responses
    
  3. Agent receives connection error during check-in:
    Post "http://localhost:8080/api/v1/agents/.../commands": dial tcp 127.0.0.1:8080: connect: connection refused
    
  4. BUG: Agent gives up permanently and stops all future check-ins
  5. Agent process continues running but never recovers
  6. Manual intervention required:
    sudo systemctl restart redflag-agent
    

Root Cause Analysis

The agent's check-in loop lacks resilience patterns for handling temporary server failures:

  1. No Retry Logic: Single failure causes permanent stop
  2. No Exponential Backoff: No progressive delay between retry attempts
  3. No Circuit Breaker: No pattern for handling repeated failures
  4. No Connection Health Checks: No pre-request connectivity validation
  5. No Recovery Logging: No visibility into recovery attempts

Current Vulnerable Code Pattern

// Current vulnerable implementation (hypothetical)
func (a *Agent) checkIn() {
    for {
        // Make request to server
        resp, err := http.Post(serverURL + "/commands", ...)
        if err != nil {
            log.Printf("Check-in failed: %v", err)
            return  // ❌ Gives up immediately
        }
        processResponse(resp)
        time.Sleep(5 * time.Minute)
    }
}

Proposed Solution

Implement comprehensive resilience patterns:

1. Exponential Backoff Retry

type RetryConfig struct {
    InitialDelay     time.Duration
    MaxDelay         time.Duration
    MaxRetries       int
    BackoffMultiplier float64
}

func (a *Agent) checkInWithRetry() {
    retryConfig := RetryConfig{
        InitialDelay:     5 * time.Second,
        MaxDelay:         5 * time.Minute,
        MaxRetries:       10,
        BackoffMultiplier: 2.0,
    }

    for {
        err := a.withRetry(func() error {
            return a.performCheckIn()
        }, retryConfig)

        if err == nil {
            // Success, reset retry counter
            retryConfig.CurrentDelay = retryConfig.InitialDelay
        }

        time.Sleep(5 * time.Minute)  // Normal check-in interval
    }
}

2. Circuit Breaker Pattern

type CircuitBreaker struct {
    State           State  // Closed, Open, HalfOpen
    Failures        int
    FailureThreshold int
    Timeout         time.Duration
    LastFailureTime time.Time
}

func (cb *CircuitBreaker) Call(operation func() error) error {
    if cb.State == Open {
        if time.Since(cb.LastFailureTime) > cb.Timeout {
            cb.State = HalfOpen
        } else {
            return ErrCircuitBreakerOpen
        }
    }

    err := operation()
    if err != nil {
        cb.Failures++
        if cb.Failures >= cb.FailureThreshold {
            cb.State = Open
            cb.LastFailureTime = time.Now()
        }
        return err
    }

    // Success
    cb.Failures = 0
    cb.State = Closed
    return nil
}

3. Connection Health Check

func (a *Agent) healthCheck() error {
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer cancel()

    req, _ := http.NewRequestWithContext(ctx, "GET", a.serverURL+"/health", nil)
    resp, err := a.httpClient.Do(req)
    if err != nil {
        return fmt.Errorf("health check failed: %w", err)
    }
    defer resp.Body.Close()

    if resp.StatusCode != 200 {
        return fmt.Errorf("health check returned: %d", resp.StatusCode)
    }

    return nil
}

Definition of Done

  • Agent automatically retries failed check-ins with exponential backoff
  • Circuit breaker prevents overwhelming struggling server
  • Connection health checks validate server availability before operations
  • Recovery attempts are logged for debugging
  • Agent resumes normal operation when server comes back online
  • Configurable retry parameters for different environments

Test Plan

  1. Basic Recovery Test:

    # Start agent and monitor logs
    sudo journalctl -u redflag-agent -f
    
    # In another terminal, restart server
    docker-compose restart server
    
    # Expected: Agent logs show retry attempts with backoff
    # Expected: Agent resumes check-ins when server recovers
    # Expected: No manual intervention required
    
  2. Extended Failure Test:

    # Stop server for extended period
    docker-compose stop server
    sleep 10  # Agent should try multiple times
    
    # Start server
    docker-compose start server
    
    # Expected: Agent detects server recovery and resumes
    # Expected: No manual systemctl restart needed
    
  3. Circuit Breaker Test:

    # Simulate repeated failures
    for i in {1..20}; do
      docker-compose restart server
      sleep 2
    done
    
    # Expected: Circuit breaker opens after threshold
    # Expected: Agent stops trying for configured timeout period
    # Expected: Circuit breaker enters half-open state after timeout
    
  4. Configuration Test:

    # Test with different retry configurations
    # Verify configurable parameters work correctly
    # Test edge cases (max retries = 0, very long delays, etc.)
    

Files to Modify

  • aggregator-agent/cmd/agent/main.go (check-in loop logic)
  • aggregator-agent/internal/resilience/ (new package for retry/circuit breaker)
  • aggregator-agent/internal/health/ (new package for health checks)
  • Agent configuration files for retry parameters

Impact

  • Production Reliability: Enables automatic recovery from server maintenance
  • Operational Efficiency: Eliminates need for manual agent restarts
  • User Experience: Transparent handling of server issues
  • Scalability: Supports large deployments with automatic recovery
  • Monitoring: Provides visibility into recovery attempts

Configuration Options

# Agent config additions
resilience:
  retry:
    initial_delay: 5s
    max_delay: 5m
    max_retries: 10
    backoff_multiplier: 2.0

  circuit_breaker:
    failure_threshold: 5
    timeout: 30s
    half_open_max_calls: 3

  health_check:
    enabled: true
    interval: 30s
    timeout: 5s

Monitoring and Observability

Metrics to Track

  • Retry attempt counts
  • Circuit breaker state changes
  • Connection failure rates
  • Recovery time statistics
  • Health check success/failure rates

Log Examples

2025/11/12 14:30:15 [RETRY] Check-in failed, retry 1/10 in 5s: connection refused
2025/11/12 14:30:20 [RETRY] Check-in failed, retry 2/10 in 10s: connection refused
2025/11/12 14:30:35 [CIRCUIT_BREAKER] Opening circuit after 5 consecutive failures
2025/11/12 14:31:05 [CIRCUIT_BREAKER] Entering half-open state
2025/11/12 14:31:05 [RECOVERY] Health check passed, resuming normal operations
2025/11/12 14:31:05 [CHECKIN] Successfully checked in after server recovery

Verification Commands

After implementation:

# Monitor agent during server restart
sudo journalctl -u redflag-agent -f | grep -E "(RETRY|CIRCUIT|RECOVERY|HEALTH)"

# Test recovery without manual intervention
docker-compose stop server
sleep 15
docker-compose start server

# Agent should recover automatically