7.5 KiB
7.5 KiB
P0-003: Agent No Retry Logic
Priority: P0 (Critical) Source Reference: From needsfixingbeforepush.md line 147 Date Identified: 2025-11-12
Problem Description
Agent permanently stops checking in after encountering a server connection failure (502 Bad Gateway or connection refused). No retry logic, exponential backoff, or circuit breaker pattern is implemented, requiring manual agent restart to recover.
Reproduction Steps
- Start agent and server normally
- Trigger server failure/rebuild:
docker-compose restart server # OR rebuild server causing temporary 502 responses - Agent receives connection error during check-in:
Post "http://localhost:8080/api/v1/agents/.../commands": dial tcp 127.0.0.1:8080: connect: connection refused - BUG: Agent gives up permanently and stops all future check-ins
- Agent process continues running but never recovers
- Manual intervention required:
sudo systemctl restart redflag-agent
Root Cause Analysis
The agent's check-in loop lacks resilience patterns for handling temporary server failures:
- No Retry Logic: Single failure causes permanent stop
- No Exponential Backoff: No progressive delay between retry attempts
- No Circuit Breaker: No pattern for handling repeated failures
- No Connection Health Checks: No pre-request connectivity validation
- No Recovery Logging: No visibility into recovery attempts
Current Vulnerable Code Pattern
// Current vulnerable implementation (hypothetical)
func (a *Agent) checkIn() {
for {
// Make request to server
resp, err := http.Post(serverURL + "/commands", ...)
if err != nil {
log.Printf("Check-in failed: %v", err)
return // ❌ Gives up immediately
}
processResponse(resp)
time.Sleep(5 * time.Minute)
}
}
Proposed Solution
Implement comprehensive resilience patterns:
1. Exponential Backoff Retry
type RetryConfig struct {
InitialDelay time.Duration
MaxDelay time.Duration
MaxRetries int
BackoffMultiplier float64
}
func (a *Agent) checkInWithRetry() {
retryConfig := RetryConfig{
InitialDelay: 5 * time.Second,
MaxDelay: 5 * time.Minute,
MaxRetries: 10,
BackoffMultiplier: 2.0,
}
for {
err := a.withRetry(func() error {
return a.performCheckIn()
}, retryConfig)
if err == nil {
// Success, reset retry counter
retryConfig.CurrentDelay = retryConfig.InitialDelay
}
time.Sleep(5 * time.Minute) // Normal check-in interval
}
}
2. Circuit Breaker Pattern
type CircuitBreaker struct {
State State // Closed, Open, HalfOpen
Failures int
FailureThreshold int
Timeout time.Duration
LastFailureTime time.Time
}
func (cb *CircuitBreaker) Call(operation func() error) error {
if cb.State == Open {
if time.Since(cb.LastFailureTime) > cb.Timeout {
cb.State = HalfOpen
} else {
return ErrCircuitBreakerOpen
}
}
err := operation()
if err != nil {
cb.Failures++
if cb.Failures >= cb.FailureThreshold {
cb.State = Open
cb.LastFailureTime = time.Now()
}
return err
}
// Success
cb.Failures = 0
cb.State = Closed
return nil
}
3. Connection Health Check
func (a *Agent) healthCheck() error {
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
req, _ := http.NewRequestWithContext(ctx, "GET", a.serverURL+"/health", nil)
resp, err := a.httpClient.Do(req)
if err != nil {
return fmt.Errorf("health check failed: %w", err)
}
defer resp.Body.Close()
if resp.StatusCode != 200 {
return fmt.Errorf("health check returned: %d", resp.StatusCode)
}
return nil
}
Definition of Done
- Agent automatically retries failed check-ins with exponential backoff
- Circuit breaker prevents overwhelming struggling server
- Connection health checks validate server availability before operations
- Recovery attempts are logged for debugging
- Agent resumes normal operation when server comes back online
- Configurable retry parameters for different environments
Test Plan
-
Basic Recovery Test:
# Start agent and monitor logs sudo journalctl -u redflag-agent -f # In another terminal, restart server docker-compose restart server # Expected: Agent logs show retry attempts with backoff # Expected: Agent resumes check-ins when server recovers # Expected: No manual intervention required -
Extended Failure Test:
# Stop server for extended period docker-compose stop server sleep 10 # Agent should try multiple times # Start server docker-compose start server # Expected: Agent detects server recovery and resumes # Expected: No manual systemctl restart needed -
Circuit Breaker Test:
# Simulate repeated failures for i in {1..20}; do docker-compose restart server sleep 2 done # Expected: Circuit breaker opens after threshold # Expected: Agent stops trying for configured timeout period # Expected: Circuit breaker enters half-open state after timeout -
Configuration Test:
# Test with different retry configurations # Verify configurable parameters work correctly # Test edge cases (max retries = 0, very long delays, etc.)
Files to Modify
aggregator-agent/cmd/agent/main.go(check-in loop logic)aggregator-agent/internal/resilience/(new package for retry/circuit breaker)aggregator-agent/internal/health/(new package for health checks)- Agent configuration files for retry parameters
Impact
- Production Reliability: Enables automatic recovery from server maintenance
- Operational Efficiency: Eliminates need for manual agent restarts
- User Experience: Transparent handling of server issues
- Scalability: Supports large deployments with automatic recovery
- Monitoring: Provides visibility into recovery attempts
Configuration Options
# Agent config additions
resilience:
retry:
initial_delay: 5s
max_delay: 5m
max_retries: 10
backoff_multiplier: 2.0
circuit_breaker:
failure_threshold: 5
timeout: 30s
half_open_max_calls: 3
health_check:
enabled: true
interval: 30s
timeout: 5s
Monitoring and Observability
Metrics to Track
- Retry attempt counts
- Circuit breaker state changes
- Connection failure rates
- Recovery time statistics
- Health check success/failure rates
Log Examples
2025/11/12 14:30:15 [RETRY] Check-in failed, retry 1/10 in 5s: connection refused
2025/11/12 14:30:20 [RETRY] Check-in failed, retry 2/10 in 10s: connection refused
2025/11/12 14:30:35 [CIRCUIT_BREAKER] Opening circuit after 5 consecutive failures
2025/11/12 14:31:05 [CIRCUIT_BREAKER] Entering half-open state
2025/11/12 14:31:05 [RECOVERY] Health check passed, resuming normal operations
2025/11/12 14:31:05 [CHECKIN] Successfully checked in after server recovery
Verification Commands
After implementation:
# Monitor agent during server restart
sudo journalctl -u redflag-agent -f | grep -E "(RETRY|CIRCUIT|RECOVERY|HEALTH)"
# Test recovery without manual intervention
docker-compose stop server
sleep 15
docker-compose start server
# Agent should recover automatically