# P1-002: Agent Timeout Handling Too Aggressive **Priority:** P1 (Major) **Source Reference:** From needsfixingbeforepush.md line 226 **Date Identified:** 2025-11-12 ## Problem Description Agent uses timeout as catch-all for all scanner operations, but many operations already capture and return proper errors. Timeouts mask real error conditions and prevent proper error handling, causing false timeout errors when operations are actually working but just taking longer than expected. ## Current Behavior - DNF scanner timeout: 45 seconds (far too short for bulk operations) - Scanner timeout triggers even when scanner already reported proper error - Timeout kills scanner process mid-operation - No distinction between slow operation vs actual hang - Generic "scan timeout after 45s" errors hide real issues ### Example Error ``` 2025/11/02 19:54:27 [dnf] Scan failed: scan timeout after 45s ``` **Reality:** DNF was still working, just takes >45s for large update lists. Real DNF errors (network, permissions, etc.) were already captured but masked by timeout. ## Root Cause Analysis 1. **Uniform Timeout Value:** All scanners use same 45-second timeout regardless of operation complexity 2. **Timeout Overrides Real Errors:** Scanner-specific errors are replaced with generic timeout message 3. **No Progress Detection:** No way to distinguish between "working but slow" vs "actually hung" 4. **No User Configuration:** Timeouts are hardcoded, not tunable per environment 5. **Kill vs Graceful:** Timeout kills process instead of allowing graceful completion ## Affected Components All scanner subsystems in `aggregator-agent/internal/scanner/`: - DNF scanner (most affected - large package lists) - APT scanner - Docker scanner - Windows Update scanner - Winget scanner - Storage/Disk scanner ## Proposed Solution ### 1. Scanner-Specific Timeout Configuration ```go type ScannerConfig struct { Timeout time.Duration SlowThreshold time.Duration // Time before considering operation "slow" ProgressCheck time.Duration // Interval to check for progress } var DefaultScannerConfigs = map[string]ScannerConfig{ "dnf": { Timeout: 5 * time.Minute, // DNF can be slow with large repos SlowThreshold: 2 * time.Minute, // Warn if taking >2min ProgressCheck: 30 * time.Second, // Check progress every 30s }, "docker": { Timeout: 2 * time.Minute, // Registry queries SlowThreshold: 45 * time.Second, ProgressCheck: 15 * time.Second, }, "apt": { Timeout: 3 * time.Minute, // Package index updates SlowThreshold: 1 * time.Minute, ProgressCheck: 20 * time.Second, }, "storage": { Timeout: 1 * time.Minute, // Filesystem scans SlowThreshold: 20 * time.Second, ProgressCheck: 10 * time.Second, }, } ``` ### 2. Progress-Based Timeout Detection ```go type ProgressTracker struct { LastProgress time.Time LastOutput string Counter int64 } func (pt *ProgressTracker) Update() { pt.LastProgress = time.Now() pt.Counter++ } func (pt *ProgressTracker) IsStalled(timeout time.Duration) bool { return time.Since(pt.LastProgress) > timeout } ``` ### 3. Enhanced Error Handling ```go func (s *Scanner) executeWithProgressTracking(cmd *exec.Cmd, config ScannerConfig) (*ScanResult, error) { progress := &ProgressTracker{} var stdout, stderr bytes.Buffer var result *ScanResult cmd.Stdout = &stdout cmd.Stderr = &stderr // Start progress monitor progressCtx, progressCancel := context.WithCancel(context.Background()) go func() { ticker := time.NewTicker(config.ProgressCheck) defer ticker.Stop() for { select { case <-ticker.C: progress.Update() if progress.IsStalled(config.Timeout) { cmd.Process.Kill() // Force kill if truly stuck progressCancel() return } case <-progressCtx.Done(): return } } }() defer progressCancel() // Execute command err := cmd.Run() progressCancel() // Stop progress monitor // Check for real errors first if err != nil { if exitError, ok := err.(*exec.ExitError); ok { return nil, fmt.Errorf("scanner failed with exit code %d: %s", exitError.ExitCode(), stderr.String()) } return nil, fmt.Errorf("scanner execution failed: %w", err) } // Parse real results result, parseErr := s.parseOutput(stdout.String(), stderr.String()) if parseErr != nil { return nil, fmt.Errorf("failed to parse scanner output: %w", parseErr) } return result, nil } ``` ### 4. User-Configurable Timeouts ```yaml # /etc/aggregator/config.json additions { "scanner_timeouts": { "dnf": "5m", "apt": "3m", "docker": "2m", "storage": "1m", "default": "2m" } } ``` ## Definition of Done - [ ] Scanner-specific timeout values appropriate for each subsystem - [ ] Progress tracking distinguishes between slow vs stuck operations - [ ] Real scanner errors are preserved, not masked by timeouts - [ ] User-configurable timeout values per scanner backend - [ ] Proper error messages reflect actual scanner failures - [ ] Configurable slow-operation warnings for monitoring ## Test Plan ### 1. Large Package List Test ```bash # Test DNF with many packages (Fedora/CentOS) sudo dnf update --downloadonly --downloaddir=/tmp/test # Measure actual time, should be >45s but <5min # Agent should complete successfully, not timeout after 45s ``` ### 2. Network Error Test ```bash # Simulate network connectivity issue sudo iptables -A OUTPUT -p tcp --dport 443 -j DROP # Run scanner - should return network error, not timeout # Expected: "failed to resolve host" or similar # NOT: "scan timeout after 45s" sudo iptables -D OUTPUT -p tcp --dport 443 -j DROP ``` ### 3. Permission Error Test ```bash # Run scanner as non-root user su - nobody -s /bin/bash -c "redflag-agent --scan-only storage" # Should return permission error, not timeout # Expected: "permission denied" or similar # NOT: "scan timeout after 45s" ``` ### 4. Configuration Test ```bash # Test custom timeout configuration echo '{"scanner_timeouts":{"dnf":"10m"}}' | sudo tee /etc/aggregator/custom-timeouts.json # Agent should use custom 10-minute timeout for DNF ``` ### 5. Progress Detection Test ```bash # Monitor scanner logs during slow operation sudo journalctl -u redflag-agent -f | grep -E "(progress|slow|timeout)" # Should see progress logs, not immediate timeout ``` ## Files to Modify - `aggregator-agent/internal/scanner/*.go` (all scanner implementations) - `aggregator-agent/cmd/agent/main.go` (scanner execution logic) - `aggregator-agent/internal/config/` (timeout configuration) - Add progress tracking utilities - Update error handling patterns ## Impact - **False Error Reduction:** Eliminates fake timeout errors when operations are working - **Better Debugging:** Real error messages enable proper troubleshooting - **User Experience:** Scans complete successfully even on large systems - **Monitoring:** Clear distinction between slow vs broken operations - **Flexibility:** Users can tune timeouts for their environment ## Configuration Examples ### Production Environment ```json { "scanner_timeouts": { "dnf": "10m", // Large enterprise repos "apt": "8m", // Slow network environments "docker": "5m", // Remote registries "storage": "2m", // Large filesystems "default": "5m" } } ``` ### Development Environment ```json { "scanner_timeouts": { "dnf": "2m", "apt": "1m30s", "docker": "1m", "storage": "30s", "default": "2m" } } ``` ## Monitoring and Logging ### Enhanced Log Messages ``` 2025/11/12 14:30:15 [dnf] Starting scan... 2025/11/12 14:31:15 [dnf] [PROGRESS] Scan in progress, 45 packages processed 2025/11/12 14:32:15 [dnf] [SLOW] Scan taking longer than expected (60s elapsed) 2025/11/12 14:33:00 [dnf] [PROGRESS] Scan continuing, 89 packages processed 2025/11/12 14:34:30 [dnf] Scan completed: found 124 updates (2m15s elapsed) ``` ### Error Message Examples ``` # Instead of: "scan timeout after 45s" # Real network error: [dnf] Scan failed: Failed to download metadata for repo 'updates': Cannot download repomd.xml # Real permission error: [storage] Scan failed: Permission denied accessing /var/log/audit/audit.log # Real dependency error: [apt] Scan failed: Unable to locate package dependency-resolver ``` ## Verification Commands After implementation: ```bash # Test that large operations complete time redflag-agent --scan-only updates # Should complete successfully, not timeout at 45s # Test that real errors are preserved sudo -u nobody redflag-agent --scan-only storage # Should show permission error, not timeout # Monitor progress tracking sudo journalctl -u redflag-agent -f | grep PROGRESS # Should show progress updates during long operations ```