Files

Fimeg 484a7f77ce Add docs and project files - force for Culurien

2026-03-28 20:46:24 -04:00

7.5 KiB

Raw Permalink Blame History

P0-003: Agent No Retry Logic

Priority: P0 (Critical) Source Reference: From needsfixingbeforepush.md line 147 Date Identified: 2025-11-12

Problem Description

Agent permanently stops checking in after encountering a server connection failure (502 Bad Gateway or connection refused). No retry logic, exponential backoff, or circuit breaker pattern is implemented, requiring manual agent restart to recover.

Reproduction Steps

Start agent and server normally

Trigger server failure/rebuild:

docker-compose restart server
# OR rebuild server causing temporary 502 responses

Agent receives connection error during check-in:

Post "http://localhost:8080/api/v1/agents/.../commands": dial tcp 127.0.0.1:8080: connect: connection refused

BUG: Agent gives up permanently and stops all future check-ins
Agent process continues running but never recovers
Manual intervention required:
```
sudo systemctl restart redflag-agent
```

Root Cause Analysis

The agent's check-in loop lacks resilience patterns for handling temporary server failures:

No Retry Logic: Single failure causes permanent stop
No Exponential Backoff: No progressive delay between retry attempts
No Circuit Breaker: No pattern for handling repeated failures
No Connection Health Checks: No pre-request connectivity validation
No Recovery Logging: No visibility into recovery attempts

Current Vulnerable Code Pattern

// Current vulnerable implementation (hypothetical)
func (a *Agent) checkIn() {
    for {
        // Make request to server
        resp, err := http.Post(serverURL + "/commands", ...)
        if err != nil {
            log.Printf("Check-in failed: %v", err)
            return  // ❌ Gives up immediately
        }
        processResponse(resp)
        time.Sleep(5 * time.Minute)
    }
}

Proposed Solution

Implement comprehensive resilience patterns:

1. Exponential Backoff Retry

type RetryConfig struct {
    InitialDelay     time.Duration
    MaxDelay         time.Duration
    MaxRetries       int
    BackoffMultiplier float64
}

func (a *Agent) checkInWithRetry() {
    retryConfig := RetryConfig{
        InitialDelay:     5 * time.Second,
        MaxDelay:         5 * time.Minute,
        MaxRetries:       10,
        BackoffMultiplier: 2.0,
    }

    for {
        err := a.withRetry(func() error {
            return a.performCheckIn()
        }, retryConfig)

        if err == nil {
            // Success, reset retry counter
            retryConfig.CurrentDelay = retryConfig.InitialDelay
        }

        time.Sleep(5 * time.Minute)  // Normal check-in interval
    }
}

2. Circuit Breaker Pattern

type CircuitBreaker struct {
    State           State  // Closed, Open, HalfOpen
    Failures        int
    FailureThreshold int
    Timeout         time.Duration
    LastFailureTime time.Time
}

func (cb *CircuitBreaker) Call(operation func() error) error {
    if cb.State == Open {
        if time.Since(cb.LastFailureTime) > cb.Timeout {
            cb.State = HalfOpen
        } else {
            return ErrCircuitBreakerOpen
        }
    }

    err := operation()
    if err != nil {
        cb.Failures++
        if cb.Failures >= cb.FailureThreshold {
            cb.State = Open
            cb.LastFailureTime = time.Now()
        }
        return err
    }

    // Success
    cb.Failures = 0
    cb.State = Closed
    return nil
}

3. Connection Health Check

func (a *Agent) healthCheck() error {
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer cancel()

    req, _ := http.NewRequestWithContext(ctx, "GET", a.serverURL+"/health", nil)
    resp, err := a.httpClient.Do(req)
    if err != nil {
        return fmt.Errorf("health check failed: %w", err)
    }
    defer resp.Body.Close()

    if resp.StatusCode != 200 {
        return fmt.Errorf("health check returned: %d", resp.StatusCode)
    }

    return nil
}

Definition of Done

Agent automatically retries failed check-ins with exponential backoff
Circuit breaker prevents overwhelming struggling server
Connection health checks validate server availability before operations
Recovery attempts are logged for debugging
Agent resumes normal operation when server comes back online
Configurable retry parameters for different environments

Test Plan

Basic Recovery Test:

# Start agent and monitor logs
sudo journalctl -u redflag-agent -f

# In another terminal, restart server
docker-compose restart server

# Expected: Agent logs show retry attempts with backoff
# Expected: Agent resumes check-ins when server recovers
# Expected: No manual intervention required

Extended Failure Test:

# Stop server for extended period
docker-compose stop server
sleep 10  # Agent should try multiple times

# Start server
docker-compose start server

# Expected: Agent detects server recovery and resumes
# Expected: No manual systemctl restart needed

Circuit Breaker Test:

# Simulate repeated failures
for i in {1..20}; do
  docker-compose restart server
  sleep 2
done

# Expected: Circuit breaker opens after threshold
# Expected: Agent stops trying for configured timeout period
# Expected: Circuit breaker enters half-open state after timeout

Configuration Test:

# Test with different retry configurations
# Verify configurable parameters work correctly
# Test edge cases (max retries = 0, very long delays, etc.)

Files to Modify

aggregator-agent/cmd/agent/main.go (check-in loop logic)
aggregator-agent/internal/resilience/ (new package for retry/circuit breaker)
aggregator-agent/internal/health/ (new package for health checks)
Agent configuration files for retry parameters

Impact

Production Reliability: Enables automatic recovery from server maintenance
Operational Efficiency: Eliminates need for manual agent restarts
User Experience: Transparent handling of server issues
Scalability: Supports large deployments with automatic recovery
Monitoring: Provides visibility into recovery attempts

Configuration Options

# Agent config additions
resilience:
  retry:
    initial_delay: 5s
    max_delay: 5m
    max_retries: 10
    backoff_multiplier: 2.0

  circuit_breaker:
    failure_threshold: 5
    timeout: 30s
    half_open_max_calls: 3

  health_check:
    enabled: true
    interval: 30s
    timeout: 5s

Monitoring and Observability

Metrics to Track

Retry attempt counts
Circuit breaker state changes
Connection failure rates
Recovery time statistics
Health check success/failure rates

Log Examples

2025/11/12 14:30:15 [RETRY] Check-in failed, retry 1/10 in 5s: connection refused
2025/11/12 14:30:20 [RETRY] Check-in failed, retry 2/10 in 10s: connection refused
2025/11/12 14:30:35 [CIRCUIT_BREAKER] Opening circuit after 5 consecutive failures
2025/11/12 14:31:05 [CIRCUIT_BREAKER] Entering half-open state
2025/11/12 14:31:05 [RECOVERY] Health check passed, resuming normal operations
2025/11/12 14:31:05 [CHECKIN] Successfully checked in after server recovery

Verification Commands

After implementation:

# Monitor agent during server restart
sudo journalctl -u redflag-agent -f | grep -E "(RETRY|CIRCUIT|RECOVERY|HEALTH)"

# Test recovery without manual intervention
docker-compose stop server
sleep 15
docker-compose start server

# Agent should recover automatically

7.5 KiB Raw Permalink Blame History