Files
Redflag/docs/3_BACKLOG/P1-002_Agent-Timeout-Handling.md

307 lines
9.0 KiB
Markdown

# P1-002: Agent Timeout Handling Too Aggressive
**Priority:** P1 (Major)
**Source Reference:** From needsfixingbeforepush.md line 226
**Date Identified:** 2025-11-12
## Problem Description
Agent uses timeout as catch-all for all scanner operations, but many operations already capture and return proper errors. Timeouts mask real error conditions and prevent proper error handling, causing false timeout errors when operations are actually working but just taking longer than expected.
## Current Behavior
- DNF scanner timeout: 45 seconds (far too short for bulk operations)
- Scanner timeout triggers even when scanner already reported proper error
- Timeout kills scanner process mid-operation
- No distinction between slow operation vs actual hang
- Generic "scan timeout after 45s" errors hide real issues
### Example Error
```
2025/11/02 19:54:27 [dnf] Scan failed: scan timeout after 45s
```
**Reality:** DNF was still working, just takes >45s for large update lists. Real DNF errors (network, permissions, etc.) were already captured but masked by timeout.
## Root Cause Analysis
1. **Uniform Timeout Value:** All scanners use same 45-second timeout regardless of operation complexity
2. **Timeout Overrides Real Errors:** Scanner-specific errors are replaced with generic timeout message
3. **No Progress Detection:** No way to distinguish between "working but slow" vs "actually hung"
4. **No User Configuration:** Timeouts are hardcoded, not tunable per environment
5. **Kill vs Graceful:** Timeout kills process instead of allowing graceful completion
## Affected Components
All scanner subsystems in `aggregator-agent/internal/scanner/`:
- DNF scanner (most affected - large package lists)
- APT scanner
- Docker scanner
- Windows Update scanner
- Winget scanner
- Storage/Disk scanner
## Proposed Solution
### 1. Scanner-Specific Timeout Configuration
```go
type ScannerConfig struct {
Timeout time.Duration
SlowThreshold time.Duration // Time before considering operation "slow"
ProgressCheck time.Duration // Interval to check for progress
}
var DefaultScannerConfigs = map[string]ScannerConfig{
"dnf": {
Timeout: 5 * time.Minute, // DNF can be slow with large repos
SlowThreshold: 2 * time.Minute, // Warn if taking >2min
ProgressCheck: 30 * time.Second, // Check progress every 30s
},
"docker": {
Timeout: 2 * time.Minute, // Registry queries
SlowThreshold: 45 * time.Second,
ProgressCheck: 15 * time.Second,
},
"apt": {
Timeout: 3 * time.Minute, // Package index updates
SlowThreshold: 1 * time.Minute,
ProgressCheck: 20 * time.Second,
},
"storage": {
Timeout: 1 * time.Minute, // Filesystem scans
SlowThreshold: 20 * time.Second,
ProgressCheck: 10 * time.Second,
},
}
```
### 2. Progress-Based Timeout Detection
```go
type ProgressTracker struct {
LastProgress time.Time
LastOutput string
Counter int64
}
func (pt *ProgressTracker) Update() {
pt.LastProgress = time.Now()
pt.Counter++
}
func (pt *ProgressTracker) IsStalled(timeout time.Duration) bool {
return time.Since(pt.LastProgress) > timeout
}
```
### 3. Enhanced Error Handling
```go
func (s *Scanner) executeWithProgressTracking(cmd *exec.Cmd, config ScannerConfig) (*ScanResult, error) {
progress := &ProgressTracker{}
var stdout, stderr bytes.Buffer
var result *ScanResult
cmd.Stdout = &stdout
cmd.Stderr = &stderr
// Start progress monitor
progressCtx, progressCancel := context.WithCancel(context.Background())
go func() {
ticker := time.NewTicker(config.ProgressCheck)
defer ticker.Stop()
for {
select {
case <-ticker.C:
progress.Update()
if progress.IsStalled(config.Timeout) {
cmd.Process.Kill() // Force kill if truly stuck
progressCancel()
return
}
case <-progressCtx.Done():
return
}
}
}()
defer progressCancel()
// Execute command
err := cmd.Run()
progressCancel() // Stop progress monitor
// Check for real errors first
if err != nil {
if exitError, ok := err.(*exec.ExitError); ok {
return nil, fmt.Errorf("scanner failed with exit code %d: %s",
exitError.ExitCode(), stderr.String())
}
return nil, fmt.Errorf("scanner execution failed: %w", err)
}
// Parse real results
result, parseErr := s.parseOutput(stdout.String(), stderr.String())
if parseErr != nil {
return nil, fmt.Errorf("failed to parse scanner output: %w", parseErr)
}
return result, nil
}
```
### 4. User-Configurable Timeouts
```yaml
# /etc/aggregator/config.json additions
{
"scanner_timeouts": {
"dnf": "5m",
"apt": "3m",
"docker": "2m",
"storage": "1m",
"default": "2m"
}
}
```
## Definition of Done
- [ ] Scanner-specific timeout values appropriate for each subsystem
- [ ] Progress tracking distinguishes between slow vs stuck operations
- [ ] Real scanner errors are preserved, not masked by timeouts
- [ ] User-configurable timeout values per scanner backend
- [ ] Proper error messages reflect actual scanner failures
- [ ] Configurable slow-operation warnings for monitoring
## Test Plan
### 1. Large Package List Test
```bash
# Test DNF with many packages (Fedora/CentOS)
sudo dnf update --downloadonly --downloaddir=/tmp/test
# Measure actual time, should be >45s but <5min
# Agent should complete successfully, not timeout after 45s
```
### 2. Network Error Test
```bash
# Simulate network connectivity issue
sudo iptables -A OUTPUT -p tcp --dport 443 -j DROP
# Run scanner - should return network error, not timeout
# Expected: "failed to resolve host" or similar
# NOT: "scan timeout after 45s"
sudo iptables -D OUTPUT -p tcp --dport 443 -j DROP
```
### 3. Permission Error Test
```bash
# Run scanner as non-root user
su - nobody -s /bin/bash -c "redflag-agent --scan-only storage"
# Should return permission error, not timeout
# Expected: "permission denied" or similar
# NOT: "scan timeout after 45s"
```
### 4. Configuration Test
```bash
# Test custom timeout configuration
echo '{"scanner_timeouts":{"dnf":"10m"}}' | sudo tee /etc/aggregator/custom-timeouts.json
# Agent should use custom 10-minute timeout for DNF
```
### 5. Progress Detection Test
```bash
# Monitor scanner logs during slow operation
sudo journalctl -u redflag-agent -f | grep -E "(progress|slow|timeout)"
# Should see progress logs, not immediate timeout
```
## Files to Modify
- `aggregator-agent/internal/scanner/*.go` (all scanner implementations)
- `aggregator-agent/cmd/agent/main.go` (scanner execution logic)
- `aggregator-agent/internal/config/` (timeout configuration)
- Add progress tracking utilities
- Update error handling patterns
## Impact
- **False Error Reduction:** Eliminates fake timeout errors when operations are working
- **Better Debugging:** Real error messages enable proper troubleshooting
- **User Experience:** Scans complete successfully even on large systems
- **Monitoring:** Clear distinction between slow vs broken operations
- **Flexibility:** Users can tune timeouts for their environment
## Configuration Examples
### Production Environment
```json
{
"scanner_timeouts": {
"dnf": "10m", // Large enterprise repos
"apt": "8m", // Slow network environments
"docker": "5m", // Remote registries
"storage": "2m", // Large filesystems
"default": "5m"
}
}
```
### Development Environment
```json
{
"scanner_timeouts": {
"dnf": "2m",
"apt": "1m30s",
"docker": "1m",
"storage": "30s",
"default": "2m"
}
}
```
## Monitoring and Logging
### Enhanced Log Messages
```
2025/11/12 14:30:15 [dnf] Starting scan...
2025/11/12 14:31:15 [dnf] [PROGRESS] Scan in progress, 45 packages processed
2025/11/12 14:32:15 [dnf] [SLOW] Scan taking longer than expected (60s elapsed)
2025/11/12 14:33:00 [dnf] [PROGRESS] Scan continuing, 89 packages processed
2025/11/12 14:34:30 [dnf] Scan completed: found 124 updates (2m15s elapsed)
```
### Error Message Examples
```
# Instead of: "scan timeout after 45s"
# Real network error:
[dnf] Scan failed: Failed to download metadata for repo 'updates': Cannot download repomd.xml
# Real permission error:
[storage] Scan failed: Permission denied accessing /var/log/audit/audit.log
# Real dependency error:
[apt] Scan failed: Unable to locate package dependency-resolver
```
## Verification Commands
After implementation:
```bash
# Test that large operations complete
time redflag-agent --scan-only updates
# Should complete successfully, not timeout at 45s
# Test that real errors are preserved
sudo -u nobody redflag-agent --scan-only storage
# Should show permission error, not timeout
# Monitor progress tracking
sudo journalctl -u redflag-agent -f | grep PROGRESS
# Should show progress updates during long operations
```