Add docs and project files - force for Culurien
This commit is contained in:
278
docs/3_BACKLOG/P0-003_Agent-No-Retry-Logic.md
Normal file
278
docs/3_BACKLOG/P0-003_Agent-No-Retry-Logic.md
Normal file
@@ -0,0 +1,278 @@
|
||||
# P0-003: Agent No Retry Logic
|
||||
|
||||
**Priority:** P0 (Critical)
|
||||
**Source Reference:** From needsfixingbeforepush.md line 147
|
||||
**Date Identified:** 2025-11-12
|
||||
|
||||
## Problem Description
|
||||
|
||||
Agent permanently stops checking in after encountering a server connection failure (502 Bad Gateway or connection refused). No retry logic, exponential backoff, or circuit breaker pattern is implemented, requiring manual agent restart to recover.
|
||||
|
||||
## Reproduction Steps
|
||||
|
||||
1. Start agent and server normally
|
||||
2. Trigger server failure/rebuild:
|
||||
```bash
|
||||
docker-compose restart server
|
||||
# OR rebuild server causing temporary 502 responses
|
||||
```
|
||||
3. Agent receives connection error during check-in:
|
||||
```
|
||||
Post "http://localhost:8080/api/v1/agents/.../commands": dial tcp 127.0.0.1:8080: connect: connection refused
|
||||
```
|
||||
4. **BUG:** Agent gives up permanently and stops all future check-ins
|
||||
5. Agent process continues running but never recovers
|
||||
6. Manual intervention required:
|
||||
```bash
|
||||
sudo systemctl restart redflag-agent
|
||||
```
|
||||
|
||||
## Root Cause Analysis
|
||||
|
||||
The agent's check-in loop lacks resilience patterns for handling temporary server failures:
|
||||
|
||||
1. **No Retry Logic:** Single failure causes permanent stop
|
||||
2. **No Exponential Backoff:** No progressive delay between retry attempts
|
||||
3. **No Circuit Breaker:** No pattern for handling repeated failures
|
||||
4. **No Connection Health Checks:** No pre-request connectivity validation
|
||||
5. **No Recovery Logging:** No visibility into recovery attempts
|
||||
|
||||
## Current Vulnerable Code Pattern
|
||||
|
||||
```go
|
||||
// Current vulnerable implementation (hypothetical)
|
||||
func (a *Agent) checkIn() {
|
||||
for {
|
||||
// Make request to server
|
||||
resp, err := http.Post(serverURL + "/commands", ...)
|
||||
if err != nil {
|
||||
log.Printf("Check-in failed: %v", err)
|
||||
return // ❌ Gives up immediately
|
||||
}
|
||||
processResponse(resp)
|
||||
time.Sleep(5 * time.Minute)
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Proposed Solution
|
||||
|
||||
Implement comprehensive resilience patterns:
|
||||
|
||||
### 1. Exponential Backoff Retry
|
||||
```go
|
||||
type RetryConfig struct {
|
||||
InitialDelay time.Duration
|
||||
MaxDelay time.Duration
|
||||
MaxRetries int
|
||||
BackoffMultiplier float64
|
||||
}
|
||||
|
||||
func (a *Agent) checkInWithRetry() {
|
||||
retryConfig := RetryConfig{
|
||||
InitialDelay: 5 * time.Second,
|
||||
MaxDelay: 5 * time.Minute,
|
||||
MaxRetries: 10,
|
||||
BackoffMultiplier: 2.0,
|
||||
}
|
||||
|
||||
for {
|
||||
err := a.withRetry(func() error {
|
||||
return a.performCheckIn()
|
||||
}, retryConfig)
|
||||
|
||||
if err == nil {
|
||||
// Success, reset retry counter
|
||||
retryConfig.CurrentDelay = retryConfig.InitialDelay
|
||||
}
|
||||
|
||||
time.Sleep(5 * time.Minute) // Normal check-in interval
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Circuit Breaker Pattern
|
||||
```go
|
||||
type CircuitBreaker struct {
|
||||
State State // Closed, Open, HalfOpen
|
||||
Failures int
|
||||
FailureThreshold int
|
||||
Timeout time.Duration
|
||||
LastFailureTime time.Time
|
||||
}
|
||||
|
||||
func (cb *CircuitBreaker) Call(operation func() error) error {
|
||||
if cb.State == Open {
|
||||
if time.Since(cb.LastFailureTime) > cb.Timeout {
|
||||
cb.State = HalfOpen
|
||||
} else {
|
||||
return ErrCircuitBreakerOpen
|
||||
}
|
||||
}
|
||||
|
||||
err := operation()
|
||||
if err != nil {
|
||||
cb.Failures++
|
||||
if cb.Failures >= cb.FailureThreshold {
|
||||
cb.State = Open
|
||||
cb.LastFailureTime = time.Now()
|
||||
}
|
||||
return err
|
||||
}
|
||||
|
||||
// Success
|
||||
cb.Failures = 0
|
||||
cb.State = Closed
|
||||
return nil
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Connection Health Check
|
||||
```go
|
||||
func (a *Agent) healthCheck() error {
|
||||
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
|
||||
defer cancel()
|
||||
|
||||
req, _ := http.NewRequestWithContext(ctx, "GET", a.serverURL+"/health", nil)
|
||||
resp, err := a.httpClient.Do(req)
|
||||
if err != nil {
|
||||
return fmt.Errorf("health check failed: %w", err)
|
||||
}
|
||||
defer resp.Body.Close()
|
||||
|
||||
if resp.StatusCode != 200 {
|
||||
return fmt.Errorf("health check returned: %d", resp.StatusCode)
|
||||
}
|
||||
|
||||
return nil
|
||||
}
|
||||
```
|
||||
|
||||
## Definition of Done
|
||||
|
||||
- [ ] Agent automatically retries failed check-ins with exponential backoff
|
||||
- [ ] Circuit breaker prevents overwhelming struggling server
|
||||
- [ ] Connection health checks validate server availability before operations
|
||||
- [ ] Recovery attempts are logged for debugging
|
||||
- [ ] Agent resumes normal operation when server comes back online
|
||||
- [ ] Configurable retry parameters for different environments
|
||||
|
||||
## Test Plan
|
||||
|
||||
1. **Basic Recovery Test:**
|
||||
```bash
|
||||
# Start agent and monitor logs
|
||||
sudo journalctl -u redflag-agent -f
|
||||
|
||||
# In another terminal, restart server
|
||||
docker-compose restart server
|
||||
|
||||
# Expected: Agent logs show retry attempts with backoff
|
||||
# Expected: Agent resumes check-ins when server recovers
|
||||
# Expected: No manual intervention required
|
||||
```
|
||||
|
||||
2. **Extended Failure Test:**
|
||||
```bash
|
||||
# Stop server for extended period
|
||||
docker-compose stop server
|
||||
sleep 10 # Agent should try multiple times
|
||||
|
||||
# Start server
|
||||
docker-compose start server
|
||||
|
||||
# Expected: Agent detects server recovery and resumes
|
||||
# Expected: No manual systemctl restart needed
|
||||
```
|
||||
|
||||
3. **Circuit Breaker Test:**
|
||||
```bash
|
||||
# Simulate repeated failures
|
||||
for i in {1..20}; do
|
||||
docker-compose restart server
|
||||
sleep 2
|
||||
done
|
||||
|
||||
# Expected: Circuit breaker opens after threshold
|
||||
# Expected: Agent stops trying for configured timeout period
|
||||
# Expected: Circuit breaker enters half-open state after timeout
|
||||
```
|
||||
|
||||
4. **Configuration Test:**
|
||||
```bash
|
||||
# Test with different retry configurations
|
||||
# Verify configurable parameters work correctly
|
||||
# Test edge cases (max retries = 0, very long delays, etc.)
|
||||
```
|
||||
|
||||
## Files to Modify
|
||||
|
||||
- `aggregator-agent/cmd/agent/main.go` (check-in loop logic)
|
||||
- `aggregator-agent/internal/resilience/` (new package for retry/circuit breaker)
|
||||
- `aggregator-agent/internal/health/` (new package for health checks)
|
||||
- Agent configuration files for retry parameters
|
||||
|
||||
## Impact
|
||||
|
||||
- **Production Reliability:** Enables automatic recovery from server maintenance
|
||||
- **Operational Efficiency:** Eliminates need for manual agent restarts
|
||||
- **User Experience:** Transparent handling of server issues
|
||||
- **Scalability:** Supports large deployments with automatic recovery
|
||||
- **Monitoring:** Provides visibility into recovery attempts
|
||||
|
||||
## Configuration Options
|
||||
|
||||
```yaml
|
||||
# Agent config additions
|
||||
resilience:
|
||||
retry:
|
||||
initial_delay: 5s
|
||||
max_delay: 5m
|
||||
max_retries: 10
|
||||
backoff_multiplier: 2.0
|
||||
|
||||
circuit_breaker:
|
||||
failure_threshold: 5
|
||||
timeout: 30s
|
||||
half_open_max_calls: 3
|
||||
|
||||
health_check:
|
||||
enabled: true
|
||||
interval: 30s
|
||||
timeout: 5s
|
||||
```
|
||||
|
||||
## Monitoring and Observability
|
||||
|
||||
### Metrics to Track
|
||||
- Retry attempt counts
|
||||
- Circuit breaker state changes
|
||||
- Connection failure rates
|
||||
- Recovery time statistics
|
||||
- Health check success/failure rates
|
||||
|
||||
### Log Examples
|
||||
```
|
||||
2025/11/12 14:30:15 [RETRY] Check-in failed, retry 1/10 in 5s: connection refused
|
||||
2025/11/12 14:30:20 [RETRY] Check-in failed, retry 2/10 in 10s: connection refused
|
||||
2025/11/12 14:30:35 [CIRCUIT_BREAKER] Opening circuit after 5 consecutive failures
|
||||
2025/11/12 14:31:05 [CIRCUIT_BREAKER] Entering half-open state
|
||||
2025/11/12 14:31:05 [RECOVERY] Health check passed, resuming normal operations
|
||||
2025/11/12 14:31:05 [CHECKIN] Successfully checked in after server recovery
|
||||
```
|
||||
|
||||
## Verification Commands
|
||||
|
||||
After implementation:
|
||||
|
||||
```bash
|
||||
# Monitor agent during server restart
|
||||
sudo journalctl -u redflag-agent -f | grep -E "(RETRY|CIRCUIT|RECOVERY|HEALTH)"
|
||||
|
||||
# Test recovery without manual intervention
|
||||
docker-compose stop server
|
||||
sleep 15
|
||||
docker-compose start server
|
||||
|
||||
# Agent should recover automatically
|
||||
```
|
||||
Reference in New Issue
Block a user