Add docs and project files - force for Culurien

This commit is contained in:
Fimeg
2026-03-28 20:46:24 -04:00
parent dc61797423
commit 484a7f77ce
343 changed files with 119530 additions and 0 deletions

View File

@@ -0,0 +1,278 @@
# P0-003: Agent No Retry Logic
**Priority:** P0 (Critical)
**Source Reference:** From needsfixingbeforepush.md line 147
**Date Identified:** 2025-11-12
## Problem Description
Agent permanently stops checking in after encountering a server connection failure (502 Bad Gateway or connection refused). No retry logic, exponential backoff, or circuit breaker pattern is implemented, requiring manual agent restart to recover.
## Reproduction Steps
1. Start agent and server normally
2. Trigger server failure/rebuild:
```bash
docker-compose restart server
# OR rebuild server causing temporary 502 responses
```
3. Agent receives connection error during check-in:
```
Post "http://localhost:8080/api/v1/agents/.../commands": dial tcp 127.0.0.1:8080: connect: connection refused
```
4. **BUG:** Agent gives up permanently and stops all future check-ins
5. Agent process continues running but never recovers
6. Manual intervention required:
```bash
sudo systemctl restart redflag-agent
```
## Root Cause Analysis
The agent's check-in loop lacks resilience patterns for handling temporary server failures:
1. **No Retry Logic:** Single failure causes permanent stop
2. **No Exponential Backoff:** No progressive delay between retry attempts
3. **No Circuit Breaker:** No pattern for handling repeated failures
4. **No Connection Health Checks:** No pre-request connectivity validation
5. **No Recovery Logging:** No visibility into recovery attempts
## Current Vulnerable Code Pattern
```go
// Current vulnerable implementation (hypothetical)
func (a *Agent) checkIn() {
for {
// Make request to server
resp, err := http.Post(serverURL + "/commands", ...)
if err != nil {
log.Printf("Check-in failed: %v", err)
return // ❌ Gives up immediately
}
processResponse(resp)
time.Sleep(5 * time.Minute)
}
}
```
## Proposed Solution
Implement comprehensive resilience patterns:
### 1. Exponential Backoff Retry
```go
type RetryConfig struct {
InitialDelay time.Duration
MaxDelay time.Duration
MaxRetries int
BackoffMultiplier float64
}
func (a *Agent) checkInWithRetry() {
retryConfig := RetryConfig{
InitialDelay: 5 * time.Second,
MaxDelay: 5 * time.Minute,
MaxRetries: 10,
BackoffMultiplier: 2.0,
}
for {
err := a.withRetry(func() error {
return a.performCheckIn()
}, retryConfig)
if err == nil {
// Success, reset retry counter
retryConfig.CurrentDelay = retryConfig.InitialDelay
}
time.Sleep(5 * time.Minute) // Normal check-in interval
}
}
```
### 2. Circuit Breaker Pattern
```go
type CircuitBreaker struct {
State State // Closed, Open, HalfOpen
Failures int
FailureThreshold int
Timeout time.Duration
LastFailureTime time.Time
}
func (cb *CircuitBreaker) Call(operation func() error) error {
if cb.State == Open {
if time.Since(cb.LastFailureTime) > cb.Timeout {
cb.State = HalfOpen
} else {
return ErrCircuitBreakerOpen
}
}
err := operation()
if err != nil {
cb.Failures++
if cb.Failures >= cb.FailureThreshold {
cb.State = Open
cb.LastFailureTime = time.Now()
}
return err
}
// Success
cb.Failures = 0
cb.State = Closed
return nil
}
```
### 3. Connection Health Check
```go
func (a *Agent) healthCheck() error {
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
req, _ := http.NewRequestWithContext(ctx, "GET", a.serverURL+"/health", nil)
resp, err := a.httpClient.Do(req)
if err != nil {
return fmt.Errorf("health check failed: %w", err)
}
defer resp.Body.Close()
if resp.StatusCode != 200 {
return fmt.Errorf("health check returned: %d", resp.StatusCode)
}
return nil
}
```
## Definition of Done
- [ ] Agent automatically retries failed check-ins with exponential backoff
- [ ] Circuit breaker prevents overwhelming struggling server
- [ ] Connection health checks validate server availability before operations
- [ ] Recovery attempts are logged for debugging
- [ ] Agent resumes normal operation when server comes back online
- [ ] Configurable retry parameters for different environments
## Test Plan
1. **Basic Recovery Test:**
```bash
# Start agent and monitor logs
sudo journalctl -u redflag-agent -f
# In another terminal, restart server
docker-compose restart server
# Expected: Agent logs show retry attempts with backoff
# Expected: Agent resumes check-ins when server recovers
# Expected: No manual intervention required
```
2. **Extended Failure Test:**
```bash
# Stop server for extended period
docker-compose stop server
sleep 10 # Agent should try multiple times
# Start server
docker-compose start server
# Expected: Agent detects server recovery and resumes
# Expected: No manual systemctl restart needed
```
3. **Circuit Breaker Test:**
```bash
# Simulate repeated failures
for i in {1..20}; do
docker-compose restart server
sleep 2
done
# Expected: Circuit breaker opens after threshold
# Expected: Agent stops trying for configured timeout period
# Expected: Circuit breaker enters half-open state after timeout
```
4. **Configuration Test:**
```bash
# Test with different retry configurations
# Verify configurable parameters work correctly
# Test edge cases (max retries = 0, very long delays, etc.)
```
## Files to Modify
- `aggregator-agent/cmd/agent/main.go` (check-in loop logic)
- `aggregator-agent/internal/resilience/` (new package for retry/circuit breaker)
- `aggregator-agent/internal/health/` (new package for health checks)
- Agent configuration files for retry parameters
## Impact
- **Production Reliability:** Enables automatic recovery from server maintenance
- **Operational Efficiency:** Eliminates need for manual agent restarts
- **User Experience:** Transparent handling of server issues
- **Scalability:** Supports large deployments with automatic recovery
- **Monitoring:** Provides visibility into recovery attempts
## Configuration Options
```yaml
# Agent config additions
resilience:
retry:
initial_delay: 5s
max_delay: 5m
max_retries: 10
backoff_multiplier: 2.0
circuit_breaker:
failure_threshold: 5
timeout: 30s
half_open_max_calls: 3
health_check:
enabled: true
interval: 30s
timeout: 5s
```
## Monitoring and Observability
### Metrics to Track
- Retry attempt counts
- Circuit breaker state changes
- Connection failure rates
- Recovery time statistics
- Health check success/failure rates
### Log Examples
```
2025/11/12 14:30:15 [RETRY] Check-in failed, retry 1/10 in 5s: connection refused
2025/11/12 14:30:20 [RETRY] Check-in failed, retry 2/10 in 10s: connection refused
2025/11/12 14:30:35 [CIRCUIT_BREAKER] Opening circuit after 5 consecutive failures
2025/11/12 14:31:05 [CIRCUIT_BREAKER] Entering half-open state
2025/11/12 14:31:05 [RECOVERY] Health check passed, resuming normal operations
2025/11/12 14:31:05 [CHECKIN] Successfully checked in after server recovery
```
## Verification Commands
After implementation:
```bash
# Monitor agent during server restart
sudo journalctl -u redflag-agent -f | grep -E "(RETRY|CIRCUIT|RECOVERY|HEALTH)"
# Test recovery without manual intervention
docker-compose stop server
sleep 15
docker-compose start server
# Agent should recover automatically
```