Files

Fimeg 484a7f77ce Add docs and project files - force for Culurien

2026-03-28 20:46:24 -04:00

6.7 KiB

Raw Permalink Blame History

P4-001: Agent Retry Logic and Resilience Implementation

Priority: P4 (Technical Debt) Source Reference: From analysis of needsfixingbeforepush.md lines 147-186 Date Identified: 2025-11-12

Problem Description

Agent has zero resilience for server connectivity failures. When the server becomes unavailable (502 Bad Gateway, connection refused, network issues), the agent permanently stops checking in and requires manual restart. This violates distributed system expectations and prevents production deployment.

Impact

Production Blocking: Server maintenance/upgrades break all agents permanently
Operational Burden: Manual systemctl restart required on every agent after server issues
Reliability Violation: No automatic recovery from transient failures
Distributed System Anti-Pattern: Clients should handle server unavailability gracefully

Current Behavior

Server rebuild/maintenance causes 502 responses
Agent receives connection error during check-in
Agent gives up permanently and stops all future check-ins
Agent process continues running but never recovers
Manual intervention required: sudo systemctl restart redflag-agent

Proposed Solution

Implement comprehensive retry logic with exponential backoff:

1. Connection Retry Logic

type RetryConfig struct {
    MaxRetries:    5
    InitialDelay:  5 * time.Second
    MaxDelay:      5 * time.Minute
    BackoffFactor: 2.0
}

func (a *Agent) checkInWithRetry() error {
    var lastErr error
    delay := a.retryConfig.InitialDelay

    for attempt := 0; attempt < a.retryConfig.MaxRetries; attempt++ {
        err := a.performCheckIn()
        if err == nil {
            return nil
        }

        lastErr = err

        // Retry on server errors, fail fast on client errors
        if isClientError(err) {
            return fmt.Errorf("client error: %w", err)
        }

        log.Printf("[Connection] Server unavailable (attempt %d/%d), retrying in %v: %v",
            attempt+1, a.retryConfig.MaxRetries, delay, err)

        time.Sleep(delay)
        delay = time.Duration(float64(delay) * a.retryConfig.BackoffFactor)
        if delay > a.retryConfig.MaxDelay {
            delay = a.retryConfig.MaxDelay
        }
    }

    return fmt.Errorf("server unavailable after %d attempts: %w", a.retryConfig.MaxRetries, lastErr)
}

2. Circuit Breaker Pattern

type CircuitBreaker struct {
    State         State // Closed, Open, HalfOpen
    Failures      int
    LastFailTime  time.Time
    Timeout       time.Duration
    Threshold     int
}

func (cb *CircuitBreaker) Call(fn func() error) error {
    if cb.State == Open {
        if time.Since(cb.LastFailTime) > cb.Timeout {
            cb.State = HalfOpen
        } else {
            return errors.New("circuit breaker is open")
        }
    }

    err := fn()
    if err != nil {
        cb.Failures++
        cb.LastFailTime = time.Now()
        if cb.Failures >= cb.Threshold {
            cb.State = Open
        }
        return err
    }

    // Success: reset breaker
    cb.Failures = 0
    cb.State = Closed
    return nil
}

3. Connection Health Check

func (a *Agent) healthCheck() error {
    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
    defer cancel()

    req, _ := http.NewRequestWithContext(ctx, "GET", a.serverURL+"/health", nil)
    resp, err := a.httpClient.Do(req)
    if err != nil {
        return err
    }
    defer resp.Body.Close()

    if resp.StatusCode == 200 {
        return nil
    }
    return fmt.Errorf("health check failed: HTTP %d", resp.StatusCode)
}

Definition of Done

Agent implements exponential backoff retry for server connection failures
Circuit breaker pattern prevents cascading failures
Connection health checks detect server availability before operations
Agent recovers automatically after server comes back online
Detailed logging for troubleshooting connection issues
Retry configuration is tunable via agent config
Integration tests verify recovery scenarios

Implementation Details

File Locations

Primary: aggregator-agent/cmd/agent/main.go (check-in loop)
Supporting: aggregator-agent/internal/resilience/ (new package)
Config: aggregator-agent/internal/config/config.go

Configuration Options

{
  "resilience": {
    "max_retries": 5,
    "initial_delay_seconds": 5,
    "max_delay_minutes": 5,
    "backoff_factor": 2.0,
    "circuit_breaker_threshold": 3,
    "circuit_breaker_timeout_minutes": 5
  }
}

Error Classification

func isClientError(err error) bool {
    if httpErr, ok := err.(*HTTPError); ok {
        return httpErr.StatusCode >= 400 && httpErr.StatusCode < 500
    }
    return false
}

func isServerError(err error) bool {
    if httpErr, ok := err.(*HTTPError); ok {
        return httpErr.StatusCode >= 500
    }
    return strings.Contains(err.Error(), "connection refused")
}

Testing Strategy

Unit Tests

Retry logic with exponential backoff
Circuit breaker state transitions
Health check timeout handling
Error classification accuracy

Integration Tests

Server restart recovery scenarios
Network partition simulation
Long-running stability tests
Configuration validation

Manual Test Scenarios

Server Restart Test:

# Start with server running
docker-compose up -d

# Verify agent checking in
journalctl -u redflag-agent -f

# Restart server
docker-compose restart server

# Verify agent recovers without manual intervention

Extended Downtime Test:

# Stop server for 10 minutes
docker-compose stop server
sleep 600

# Start server
docker-compose start server

# Verify agent resumes check-ins

Network Partition Test:

# Block network access temporarily
iptables -A OUTPUT -d localhost -p tcp --dport 8080 -j DROP
sleep 300

# Remove block
iptables -D OUTPUT -d localhost -p tcp --dport 8080 -j DROP

# Verify agent recovers

Prerequisites

Circuit breaker pattern implementation exists (aggregator-agent/internal/circuitbreaker/)
HTTP client configuration supports timeouts
Logging infrastructure supports structured output

Effort Estimate

Complexity: Medium-High Effort: 2-3 days

Day 1: Retry logic implementation and basic testing
Day 2: Circuit breaker integration and configuration
Day 3: Integration testing and error handling refinement

Success Metrics

Agent uptime >99.9% during server maintenance windows
Zero manual interventions required for server restarts
Recovery time <30 seconds after server becomes available
Clear error logs for troubleshooting
No memory leaks in retry logic

6.7 KiB Raw Permalink Blame History