# P0-003: Agent No Retry Logic **Priority:** P0 (Critical) **Source Reference:** From needsfixingbeforepush.md line 147 **Date Identified:** 2025-11-12 ## Problem Description Agent permanently stops checking in after encountering a server connection failure (502 Bad Gateway or connection refused). No retry logic, exponential backoff, or circuit breaker pattern is implemented, requiring manual agent restart to recover. ## Reproduction Steps 1. Start agent and server normally 2. Trigger server failure/rebuild: ```bash docker-compose restart server # OR rebuild server causing temporary 502 responses ``` 3. Agent receives connection error during check-in: ``` Post "http://localhost:8080/api/v1/agents/.../commands": dial tcp 127.0.0.1:8080: connect: connection refused ``` 4. **BUG:** Agent gives up permanently and stops all future check-ins 5. Agent process continues running but never recovers 6. Manual intervention required: ```bash sudo systemctl restart redflag-agent ``` ## Root Cause Analysis The agent's check-in loop lacks resilience patterns for handling temporary server failures: 1. **No Retry Logic:** Single failure causes permanent stop 2. **No Exponential Backoff:** No progressive delay between retry attempts 3. **No Circuit Breaker:** No pattern for handling repeated failures 4. **No Connection Health Checks:** No pre-request connectivity validation 5. **No Recovery Logging:** No visibility into recovery attempts ## Current Vulnerable Code Pattern ```go // Current vulnerable implementation (hypothetical) func (a *Agent) checkIn() { for { // Make request to server resp, err := http.Post(serverURL + "/commands", ...) if err != nil { log.Printf("Check-in failed: %v", err) return // ❌ Gives up immediately } processResponse(resp) time.Sleep(5 * time.Minute) } } ``` ## Proposed Solution Implement comprehensive resilience patterns: ### 1. Exponential Backoff Retry ```go type RetryConfig struct { InitialDelay time.Duration MaxDelay time.Duration MaxRetries int BackoffMultiplier float64 } func (a *Agent) checkInWithRetry() { retryConfig := RetryConfig{ InitialDelay: 5 * time.Second, MaxDelay: 5 * time.Minute, MaxRetries: 10, BackoffMultiplier: 2.0, } for { err := a.withRetry(func() error { return a.performCheckIn() }, retryConfig) if err == nil { // Success, reset retry counter retryConfig.CurrentDelay = retryConfig.InitialDelay } time.Sleep(5 * time.Minute) // Normal check-in interval } } ``` ### 2. Circuit Breaker Pattern ```go type CircuitBreaker struct { State State // Closed, Open, HalfOpen Failures int FailureThreshold int Timeout time.Duration LastFailureTime time.Time } func (cb *CircuitBreaker) Call(operation func() error) error { if cb.State == Open { if time.Since(cb.LastFailureTime) > cb.Timeout { cb.State = HalfOpen } else { return ErrCircuitBreakerOpen } } err := operation() if err != nil { cb.Failures++ if cb.Failures >= cb.FailureThreshold { cb.State = Open cb.LastFailureTime = time.Now() } return err } // Success cb.Failures = 0 cb.State = Closed return nil } ``` ### 3. Connection Health Check ```go func (a *Agent) healthCheck() error { ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second) defer cancel() req, _ := http.NewRequestWithContext(ctx, "GET", a.serverURL+"/health", nil) resp, err := a.httpClient.Do(req) if err != nil { return fmt.Errorf("health check failed: %w", err) } defer resp.Body.Close() if resp.StatusCode != 200 { return fmt.Errorf("health check returned: %d", resp.StatusCode) } return nil } ``` ## Definition of Done - [ ] Agent automatically retries failed check-ins with exponential backoff - [ ] Circuit breaker prevents overwhelming struggling server - [ ] Connection health checks validate server availability before operations - [ ] Recovery attempts are logged for debugging - [ ] Agent resumes normal operation when server comes back online - [ ] Configurable retry parameters for different environments ## Test Plan 1. **Basic Recovery Test:** ```bash # Start agent and monitor logs sudo journalctl -u redflag-agent -f # In another terminal, restart server docker-compose restart server # Expected: Agent logs show retry attempts with backoff # Expected: Agent resumes check-ins when server recovers # Expected: No manual intervention required ``` 2. **Extended Failure Test:** ```bash # Stop server for extended period docker-compose stop server sleep 10 # Agent should try multiple times # Start server docker-compose start server # Expected: Agent detects server recovery and resumes # Expected: No manual systemctl restart needed ``` 3. **Circuit Breaker Test:** ```bash # Simulate repeated failures for i in {1..20}; do docker-compose restart server sleep 2 done # Expected: Circuit breaker opens after threshold # Expected: Agent stops trying for configured timeout period # Expected: Circuit breaker enters half-open state after timeout ``` 4. **Configuration Test:** ```bash # Test with different retry configurations # Verify configurable parameters work correctly # Test edge cases (max retries = 0, very long delays, etc.) ``` ## Files to Modify - `aggregator-agent/cmd/agent/main.go` (check-in loop logic) - `aggregator-agent/internal/resilience/` (new package for retry/circuit breaker) - `aggregator-agent/internal/health/` (new package for health checks) - Agent configuration files for retry parameters ## Impact - **Production Reliability:** Enables automatic recovery from server maintenance - **Operational Efficiency:** Eliminates need for manual agent restarts - **User Experience:** Transparent handling of server issues - **Scalability:** Supports large deployments with automatic recovery - **Monitoring:** Provides visibility into recovery attempts ## Configuration Options ```yaml # Agent config additions resilience: retry: initial_delay: 5s max_delay: 5m max_retries: 10 backoff_multiplier: 2.0 circuit_breaker: failure_threshold: 5 timeout: 30s half_open_max_calls: 3 health_check: enabled: true interval: 30s timeout: 5s ``` ## Monitoring and Observability ### Metrics to Track - Retry attempt counts - Circuit breaker state changes - Connection failure rates - Recovery time statistics - Health check success/failure rates ### Log Examples ``` 2025/11/12 14:30:15 [RETRY] Check-in failed, retry 1/10 in 5s: connection refused 2025/11/12 14:30:20 [RETRY] Check-in failed, retry 2/10 in 10s: connection refused 2025/11/12 14:30:35 [CIRCUIT_BREAKER] Opening circuit after 5 consecutive failures 2025/11/12 14:31:05 [CIRCUIT_BREAKER] Entering half-open state 2025/11/12 14:31:05 [RECOVERY] Health check passed, resuming normal operations 2025/11/12 14:31:05 [CHECKIN] Successfully checked in after server recovery ``` ## Verification Commands After implementation: ```bash # Monitor agent during server restart sudo journalctl -u redflag-agent -f | grep -E "(RETRY|CIRCUIT|RECOVERY|HEALTH)" # Test recovery without manual intervention docker-compose stop server sleep 15 docker-compose start server # Agent should recover automatically ```