Add docs and project files - force for Culurien

2026-03-28 20:46:24 -04:00
parent dc61797423
commit 484a7f77ce
343 changed files with 119530 additions and 0 deletions
--- a/docs/3_BACKLOG/P0-003_Agent-No-Retry-Logic.md
+++ b/docs/3_BACKLOG/P0-003_Agent-No-Retry-Logic.md
@@ -0,0 +1,278 @@
+# P0-003: Agent No Retry Logic
+
+**Priority:** P0 (Critical)
+**Source Reference:** From needsfixingbeforepush.md line 147
+**Date Identified:** 2025-11-12
+
+## Problem Description
+
+Agent permanently stops checking in after encountering a server connection failure (502 Bad Gateway or connection refused). No retry logic, exponential backoff, or circuit breaker pattern is implemented, requiring manual agent restart to recover.
+
+## Reproduction Steps
+
+1. Start agent and server normally
+2. Trigger server failure/rebuild:
+   ```bash
+   docker-compose restart server
+   # OR rebuild server causing temporary 502 responses
+   ```
+3. Agent receives connection error during check-in:
+   ```
+   Post "http://localhost:8080/api/v1/agents/.../commands": dial tcp 127.0.0.1:8080: connect: connection refused
+   ```
+4. **BUG:** Agent gives up permanently and stops all future check-ins
+5. Agent process continues running but never recovers
+6. Manual intervention required:
+   ```bash
+   sudo systemctl restart redflag-agent
+   ```
+
+## Root Cause Analysis
+
+The agent's check-in loop lacks resilience patterns for handling temporary server failures:
+
+1. **No Retry Logic:** Single failure causes permanent stop
+2. **No Exponential Backoff:** No progressive delay between retry attempts
+3. **No Circuit Breaker:** No pattern for handling repeated failures
+4. **No Connection Health Checks:** No pre-request connectivity validation
+5. **No Recovery Logging:** No visibility into recovery attempts
+
+## Current Vulnerable Code Pattern
+
+```go
+// Current vulnerable implementation (hypothetical)
+func (a *Agent) checkIn() {
+    for {
+        // Make request to server
+        resp, err := http.Post(serverURL + "/commands", ...)
+        if err != nil {
+            log.Printf("Check-in failed: %v", err)
+            return  // ❌ Gives up immediately
+        }
+        processResponse(resp)
+        time.Sleep(5 * time.Minute)
+    }
+}
+```
+
+## Proposed Solution
+
+Implement comprehensive resilience patterns:
+
+### 1. Exponential Backoff Retry
+```go
+type RetryConfig struct {
+    InitialDelay     time.Duration
+    MaxDelay         time.Duration
+    MaxRetries       int
+    BackoffMultiplier float64
+}
+
+func (a *Agent) checkInWithRetry() {
+    retryConfig := RetryConfig{
+        InitialDelay:     5 * time.Second,
+        MaxDelay:         5 * time.Minute,
+        MaxRetries:       10,
+        BackoffMultiplier: 2.0,
+    }
+
+    for {
+        err := a.withRetry(func() error {
+            return a.performCheckIn()
+        }, retryConfig)
+
+        if err == nil {
+            // Success, reset retry counter
+            retryConfig.CurrentDelay = retryConfig.InitialDelay
+        }
+
+        time.Sleep(5 * time.Minute)  // Normal check-in interval
+    }
+}
+```
+
+### 2. Circuit Breaker Pattern
+```go
+type CircuitBreaker struct {
+    State           State  // Closed, Open, HalfOpen
+    Failures        int
+    FailureThreshold int
+    Timeout         time.Duration
+    LastFailureTime time.Time
+}
+
+func (cb *CircuitBreaker) Call(operation func() error) error {
+    if cb.State == Open {
+        if time.Since(cb.LastFailureTime) > cb.Timeout {
+            cb.State = HalfOpen
+        } else {
+            return ErrCircuitBreakerOpen
+        }
+    }
+
+    err := operation()
+    if err != nil {
+        cb.Failures++
+        if cb.Failures >= cb.FailureThreshold {
+            cb.State = Open
+            cb.LastFailureTime = time.Now()
+        }
+        return err
+    }
+
+    // Success
+    cb.Failures = 0
+    cb.State = Closed
+    return nil
+}
+```
+
+### 3. Connection Health Check
+```go
+func (a *Agent) healthCheck() error {
+    ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
+    defer cancel()
+
+    req, _ := http.NewRequestWithContext(ctx, "GET", a.serverURL+"/health", nil)
+    resp, err := a.httpClient.Do(req)
+    if err != nil {
+        return fmt.Errorf("health check failed: %w", err)
+    }
+    defer resp.Body.Close()
+
+    if resp.StatusCode != 200 {
+        return fmt.Errorf("health check returned: %d", resp.StatusCode)
+    }
+
+    return nil
+}
+```
+
+## Definition of Done
+
+- [ ] Agent automatically retries failed check-ins with exponential backoff
+- [ ] Circuit breaker prevents overwhelming struggling server
+- [ ] Connection health checks validate server availability before operations
+- [ ] Recovery attempts are logged for debugging
+- [ ] Agent resumes normal operation when server comes back online
+- [ ] Configurable retry parameters for different environments
+
+## Test Plan
+
+1. **Basic Recovery Test:**
+   ```bash
+   # Start agent and monitor logs
+   sudo journalctl -u redflag-agent -f
+
+   # In another terminal, restart server
+   docker-compose restart server
+
+   # Expected: Agent logs show retry attempts with backoff
+   # Expected: Agent resumes check-ins when server recovers
+   # Expected: No manual intervention required
+   ```
+
+2. **Extended Failure Test:**
+   ```bash
+   # Stop server for extended period
+   docker-compose stop server
+   sleep 10  # Agent should try multiple times
+
+   # Start server
+   docker-compose start server
+
+   # Expected: Agent detects server recovery and resumes
+   # Expected: No manual systemctl restart needed
+   ```
+
+3. **Circuit Breaker Test:**
+   ```bash
+   # Simulate repeated failures
+   for i in {1..20}; do
+     docker-compose restart server
+     sleep 2
+   done
+
+   # Expected: Circuit breaker opens after threshold
+   # Expected: Agent stops trying for configured timeout period
+   # Expected: Circuit breaker enters half-open state after timeout
+   ```
+
+4. **Configuration Test:**
+   ```bash
+   # Test with different retry configurations
+   # Verify configurable parameters work correctly
+   # Test edge cases (max retries = 0, very long delays, etc.)
+   ```
+
+## Files to Modify
+
+- `aggregator-agent/cmd/agent/main.go` (check-in loop logic)
+- `aggregator-agent/internal/resilience/` (new package for retry/circuit breaker)
+- `aggregator-agent/internal/health/` (new package for health checks)
+- Agent configuration files for retry parameters
+
+## Impact
+
+- **Production Reliability:** Enables automatic recovery from server maintenance
+- **Operational Efficiency:** Eliminates need for manual agent restarts
+- **User Experience:** Transparent handling of server issues
+- **Scalability:** Supports large deployments with automatic recovery
+- **Monitoring:** Provides visibility into recovery attempts
+
+## Configuration Options
+
+```yaml
+# Agent config additions
+resilience:
+  retry:
+    initial_delay: 5s
+    max_delay: 5m
+    max_retries: 10
+    backoff_multiplier: 2.0
+
+  circuit_breaker:
+    failure_threshold: 5
+    timeout: 30s
+    half_open_max_calls: 3
+
+  health_check:
+    enabled: true
+    interval: 30s
+    timeout: 5s
+```
+
+## Monitoring and Observability
+
+### Metrics to Track
+- Retry attempt counts
+- Circuit breaker state changes
+- Connection failure rates
+- Recovery time statistics
+- Health check success/failure rates
+
+### Log Examples
+```
+2025/11/12 14:30:15 [RETRY] Check-in failed, retry 1/10 in 5s: connection refused
+2025/11/12 14:30:20 [RETRY] Check-in failed, retry 2/10 in 10s: connection refused
+2025/11/12 14:30:35 [CIRCUIT_BREAKER] Opening circuit after 5 consecutive failures
+2025/11/12 14:31:05 [CIRCUIT_BREAKER] Entering half-open state
+2025/11/12 14:31:05 [RECOVERY] Health check passed, resuming normal operations
+2025/11/12 14:31:05 [CHECKIN] Successfully checked in after server recovery
+```
+
+## Verification Commands
+
+After implementation:
+
+```bash
+# Monitor agent during server restart
+sudo journalctl -u redflag-agent -f | grep -E "(RETRY|CIRCUIT|RECOVERY|HEALTH)"
+
+# Test recovery without manual intervention
+docker-compose stop server
+sleep 15
+docker-compose start server
+
+# Agent should recover automatically
+```