278 lines
7.5 KiB
Markdown
278 lines
7.5 KiB
Markdown
# P0-003: Agent No Retry Logic
|
|
|
|
**Priority:** P0 (Critical)
|
|
**Source Reference:** From needsfixingbeforepush.md line 147
|
|
**Date Identified:** 2025-11-12
|
|
|
|
## Problem Description
|
|
|
|
Agent permanently stops checking in after encountering a server connection failure (502 Bad Gateway or connection refused). No retry logic, exponential backoff, or circuit breaker pattern is implemented, requiring manual agent restart to recover.
|
|
|
|
## Reproduction Steps
|
|
|
|
1. Start agent and server normally
|
|
2. Trigger server failure/rebuild:
|
|
```bash
|
|
docker-compose restart server
|
|
# OR rebuild server causing temporary 502 responses
|
|
```
|
|
3. Agent receives connection error during check-in:
|
|
```
|
|
Post "http://localhost:8080/api/v1/agents/.../commands": dial tcp 127.0.0.1:8080: connect: connection refused
|
|
```
|
|
4. **BUG:** Agent gives up permanently and stops all future check-ins
|
|
5. Agent process continues running but never recovers
|
|
6. Manual intervention required:
|
|
```bash
|
|
sudo systemctl restart redflag-agent
|
|
```
|
|
|
|
## Root Cause Analysis
|
|
|
|
The agent's check-in loop lacks resilience patterns for handling temporary server failures:
|
|
|
|
1. **No Retry Logic:** Single failure causes permanent stop
|
|
2. **No Exponential Backoff:** No progressive delay between retry attempts
|
|
3. **No Circuit Breaker:** No pattern for handling repeated failures
|
|
4. **No Connection Health Checks:** No pre-request connectivity validation
|
|
5. **No Recovery Logging:** No visibility into recovery attempts
|
|
|
|
## Current Vulnerable Code Pattern
|
|
|
|
```go
|
|
// Current vulnerable implementation (hypothetical)
|
|
func (a *Agent) checkIn() {
|
|
for {
|
|
// Make request to server
|
|
resp, err := http.Post(serverURL + "/commands", ...)
|
|
if err != nil {
|
|
log.Printf("Check-in failed: %v", err)
|
|
return // ❌ Gives up immediately
|
|
}
|
|
processResponse(resp)
|
|
time.Sleep(5 * time.Minute)
|
|
}
|
|
}
|
|
```
|
|
|
|
## Proposed Solution
|
|
|
|
Implement comprehensive resilience patterns:
|
|
|
|
### 1. Exponential Backoff Retry
|
|
```go
|
|
type RetryConfig struct {
|
|
InitialDelay time.Duration
|
|
MaxDelay time.Duration
|
|
MaxRetries int
|
|
BackoffMultiplier float64
|
|
}
|
|
|
|
func (a *Agent) checkInWithRetry() {
|
|
retryConfig := RetryConfig{
|
|
InitialDelay: 5 * time.Second,
|
|
MaxDelay: 5 * time.Minute,
|
|
MaxRetries: 10,
|
|
BackoffMultiplier: 2.0,
|
|
}
|
|
|
|
for {
|
|
err := a.withRetry(func() error {
|
|
return a.performCheckIn()
|
|
}, retryConfig)
|
|
|
|
if err == nil {
|
|
// Success, reset retry counter
|
|
retryConfig.CurrentDelay = retryConfig.InitialDelay
|
|
}
|
|
|
|
time.Sleep(5 * time.Minute) // Normal check-in interval
|
|
}
|
|
}
|
|
```
|
|
|
|
### 2. Circuit Breaker Pattern
|
|
```go
|
|
type CircuitBreaker struct {
|
|
State State // Closed, Open, HalfOpen
|
|
Failures int
|
|
FailureThreshold int
|
|
Timeout time.Duration
|
|
LastFailureTime time.Time
|
|
}
|
|
|
|
func (cb *CircuitBreaker) Call(operation func() error) error {
|
|
if cb.State == Open {
|
|
if time.Since(cb.LastFailureTime) > cb.Timeout {
|
|
cb.State = HalfOpen
|
|
} else {
|
|
return ErrCircuitBreakerOpen
|
|
}
|
|
}
|
|
|
|
err := operation()
|
|
if err != nil {
|
|
cb.Failures++
|
|
if cb.Failures >= cb.FailureThreshold {
|
|
cb.State = Open
|
|
cb.LastFailureTime = time.Now()
|
|
}
|
|
return err
|
|
}
|
|
|
|
// Success
|
|
cb.Failures = 0
|
|
cb.State = Closed
|
|
return nil
|
|
}
|
|
```
|
|
|
|
### 3. Connection Health Check
|
|
```go
|
|
func (a *Agent) healthCheck() error {
|
|
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
|
|
defer cancel()
|
|
|
|
req, _ := http.NewRequestWithContext(ctx, "GET", a.serverURL+"/health", nil)
|
|
resp, err := a.httpClient.Do(req)
|
|
if err != nil {
|
|
return fmt.Errorf("health check failed: %w", err)
|
|
}
|
|
defer resp.Body.Close()
|
|
|
|
if resp.StatusCode != 200 {
|
|
return fmt.Errorf("health check returned: %d", resp.StatusCode)
|
|
}
|
|
|
|
return nil
|
|
}
|
|
```
|
|
|
|
## Definition of Done
|
|
|
|
- [ ] Agent automatically retries failed check-ins with exponential backoff
|
|
- [ ] Circuit breaker prevents overwhelming struggling server
|
|
- [ ] Connection health checks validate server availability before operations
|
|
- [ ] Recovery attempts are logged for debugging
|
|
- [ ] Agent resumes normal operation when server comes back online
|
|
- [ ] Configurable retry parameters for different environments
|
|
|
|
## Test Plan
|
|
|
|
1. **Basic Recovery Test:**
|
|
```bash
|
|
# Start agent and monitor logs
|
|
sudo journalctl -u redflag-agent -f
|
|
|
|
# In another terminal, restart server
|
|
docker-compose restart server
|
|
|
|
# Expected: Agent logs show retry attempts with backoff
|
|
# Expected: Agent resumes check-ins when server recovers
|
|
# Expected: No manual intervention required
|
|
```
|
|
|
|
2. **Extended Failure Test:**
|
|
```bash
|
|
# Stop server for extended period
|
|
docker-compose stop server
|
|
sleep 10 # Agent should try multiple times
|
|
|
|
# Start server
|
|
docker-compose start server
|
|
|
|
# Expected: Agent detects server recovery and resumes
|
|
# Expected: No manual systemctl restart needed
|
|
```
|
|
|
|
3. **Circuit Breaker Test:**
|
|
```bash
|
|
# Simulate repeated failures
|
|
for i in {1..20}; do
|
|
docker-compose restart server
|
|
sleep 2
|
|
done
|
|
|
|
# Expected: Circuit breaker opens after threshold
|
|
# Expected: Agent stops trying for configured timeout period
|
|
# Expected: Circuit breaker enters half-open state after timeout
|
|
```
|
|
|
|
4. **Configuration Test:**
|
|
```bash
|
|
# Test with different retry configurations
|
|
# Verify configurable parameters work correctly
|
|
# Test edge cases (max retries = 0, very long delays, etc.)
|
|
```
|
|
|
|
## Files to Modify
|
|
|
|
- `aggregator-agent/cmd/agent/main.go` (check-in loop logic)
|
|
- `aggregator-agent/internal/resilience/` (new package for retry/circuit breaker)
|
|
- `aggregator-agent/internal/health/` (new package for health checks)
|
|
- Agent configuration files for retry parameters
|
|
|
|
## Impact
|
|
|
|
- **Production Reliability:** Enables automatic recovery from server maintenance
|
|
- **Operational Efficiency:** Eliminates need for manual agent restarts
|
|
- **User Experience:** Transparent handling of server issues
|
|
- **Scalability:** Supports large deployments with automatic recovery
|
|
- **Monitoring:** Provides visibility into recovery attempts
|
|
|
|
## Configuration Options
|
|
|
|
```yaml
|
|
# Agent config additions
|
|
resilience:
|
|
retry:
|
|
initial_delay: 5s
|
|
max_delay: 5m
|
|
max_retries: 10
|
|
backoff_multiplier: 2.0
|
|
|
|
circuit_breaker:
|
|
failure_threshold: 5
|
|
timeout: 30s
|
|
half_open_max_calls: 3
|
|
|
|
health_check:
|
|
enabled: true
|
|
interval: 30s
|
|
timeout: 5s
|
|
```
|
|
|
|
## Monitoring and Observability
|
|
|
|
### Metrics to Track
|
|
- Retry attempt counts
|
|
- Circuit breaker state changes
|
|
- Connection failure rates
|
|
- Recovery time statistics
|
|
- Health check success/failure rates
|
|
|
|
### Log Examples
|
|
```
|
|
2025/11/12 14:30:15 [RETRY] Check-in failed, retry 1/10 in 5s: connection refused
|
|
2025/11/12 14:30:20 [RETRY] Check-in failed, retry 2/10 in 10s: connection refused
|
|
2025/11/12 14:30:35 [CIRCUIT_BREAKER] Opening circuit after 5 consecutive failures
|
|
2025/11/12 14:31:05 [CIRCUIT_BREAKER] Entering half-open state
|
|
2025/11/12 14:31:05 [RECOVERY] Health check passed, resuming normal operations
|
|
2025/11/12 14:31:05 [CHECKIN] Successfully checked in after server recovery
|
|
```
|
|
|
|
## Verification Commands
|
|
|
|
After implementation:
|
|
|
|
```bash
|
|
# Monitor agent during server restart
|
|
sudo journalctl -u redflag-agent -f | grep -E "(RETRY|CIRCUIT|RECOVERY|HEALTH)"
|
|
|
|
# Test recovery without manual intervention
|
|
docker-compose stop server
|
|
sleep 15
|
|
docker-compose start server
|
|
|
|
# Agent should recover automatically
|
|
``` |