# P4-002: Scanner Timeout Optimization and Error Handling **Priority:** P4 (Technical Debt) **Source Reference:** From analysis of needsfixingbeforepush.md lines 226-270 **Date Identified:** 2025-11-12 ## Problem Description Agent uses a universal 45-second timeout for all scanner operations, which masks real error conditions and prevents proper error handling. Many scanner operations already capture and return proper errors, but timeouts kill scanners mid-operation, preventing meaningful error messages from reaching users. ## Impact - **False Timeouts:** Legitimate slow operations fail unnecessarily - **Error Masking:** Real scanner errors are replaced with generic "timeout" messages - **Troubleshooting Difficulty:** Logs don't reflect actual problems - **User Experience:** Users can't distinguish between slow operations vs actual hangs - **Resource Waste:** Operations are killed when they could succeed given more time ## Current Behavior - DNF scanner timeout: 45 seconds (far too short for bulk operations) - Universal timeout applied to all scanners regardless of operation type - Timeout kills scanner process even when scanner reported proper error - No distinction between "no progress" hang vs "slow but working" ## Specific Examples ``` 2025/11/02 19:54:27 [dnf] Scan failed: scan timeout after 45s ``` - DNF was actively working, just taking >45s for large update lists - Real DNF errors (network, permissions, etc.) already captured by scanner - Timeout prevents proper error propagation to user ## Proposed Solution Implement scanner-specific timeout strategies and better error handling: ### 1. Per-Scanner Timeout Configuration ```go type ScannerTimeoutConfig struct { DNF time.Duration `yaml:"dnf"` APT time.Duration `yaml:"apt"` Docker time.Duration `yaml:"docker"` Windows time.Duration `yaml:"windows"` Winget time.Duration `yaml:"winget"` Storage time.Duration `yaml:"storage"` } var DefaultTimeouts = ScannerTimeoutConfig{ DNF: 5 * time.Minute, // Large package lists APT: 3 * time.Minute, // Generally faster Docker: 2 * time.Minute, // Registry queries Windows: 10 * time.Minute, // Windows Update can be slow Winget: 3 * time.Minute, // Package manager queries Storage: 1 * time.Minute, // Filesystem operations } ``` ### 2. Progress-Based Timeout Detection ```go type ProgressTracker struct { LastProgress time.Time CheckInterval time.Duration MaxStaleTime time.Duration } func (pt *ProgressTracker) CheckProgress() bool { now := time.Now() if now.Sub(pt.LastProgress) > pt.MaxStaleTime { return false // No progress for too long } return true } // Scanner implementation updates progress func (s *DNFScanner) scanWithProgress() ([]UpdateReportItem, error) { pt := &ProgressTracker{ CheckInterval: 30 * time.Second, MaxStaleTime: 2 * time.Minute, } result := make(chan []UpdateReportItem, 1) errors := make(chan error, 1) go func() { updates, err := s.performDNFScan() if err != nil { errors <- err return } result <- updates }() // Monitor for progress or completion ticker := time.NewTicker(pt.CheckInterval) defer ticker.Stop() for { select { case updates := <-result: return updates, nil case err := <-errors: return nil, err case <-ticker.C: if !pt.CheckProgress() { return nil, fmt.Errorf("scanner appears stuck - no progress for %v", pt.MaxStaleTime) } case <-time.After(s.timeout): return nil, fmt.Errorf("scanner timeout after %v", s.timeout) } } } ``` ### 3. Smart Error Preservation ```go func (s *ScannerWrapper) ExecuteWithTimeout(scanner Scanner, timeout time.Duration) ([]UpdateReportItem, error) { ctx, cancel := context.WithTimeout(context.Background(), timeout) defer cancel() done := make(chan struct{}) var result []UpdateReportItem var scanErr error go func() { defer close(done) result, scanErr = scanner.ScanForUpdates() }() select { case <-done: // Scanner completed - return its actual error return result, scanErr case <-ctx.Done(): // Timeout - check if scanner provided progress info if progressInfo := scanner.GetLastProgress(); progressInfo != "" { return nil, fmt.Errorf("scanner timeout after %v (last progress: %s)", timeout, progressInfo) } return nil, fmt.Errorf("scanner timeout after %v (no progress detected)", timeout) } } ``` ### 4. User-Configurable Timeouts ```json { "scanners": { "timeouts": { "dnf": "5m", "apt": "3m", "docker": "2m", "windows": "10m", "winget": "3m", "storage": "1m" }, "progress_detection": { "enabled": true, "check_interval": "30s", "max_stale_time": "2m" } } } ``` ## Definition of Done - [ ] Scanner-specific timeouts implemented and configurable - [ ] Progress-based timeout detection differentiates hangs from slow operations - [ ] Scanner's actual error messages preserved when available - [ ] Users can tune timeouts per scanner backend in settings - [ ] Clear distinction between "no progress" vs "operation in progress" - [ ] Backward compatibility with existing configuration - [ ] Enhanced logging shows scanner progress and timeout reasons ## Implementation Details ### File Locations - **Primary:** `aggregator-agent/internal/orchestrator/scanner_wrappers.go` - **Config:** `aggregator-agent/internal/config/config.go` - **Scanners:** `aggregator-agent/internal/scanner/*.go` (add progress tracking) ### Configuration Integration ```go type AgentConfig struct { // ... existing fields ... ScannerTimeouts ScannerTimeoutConfig `json:"scanner_timeouts"` } func (c *AgentConfig) GetTimeout(scannerType string) time.Duration { switch scannerType { case "dnf": return c.ScannerTimeouts.DNF case "apt": return c.ScannerTimeouts.APT // ... other cases default: return 2 * time.Minute // sensible default } } ``` ### Scanner Interface Enhancement ```go type Scanner interface { ScanForUpdates() ([]UpdateReportItem, error) GetLastProgress() string // New: return human-readable progress info IsMakingProgress() bool // New: quick check if scanner is active } ``` ### Enhanced Error Reporting ```go type ScannerError struct { Type string `json:"type"` // "timeout", "permission", "network", etc. Scanner string `json:"scanner"` // "dnf", "apt", etc. Message string `json:"message"` // Human-readable error Details string `json:"details"` // Technical details Timestamp time.Time `json:"timestamp"` Duration time.Duration `json:"duration"` } func (e ScannerError) Error() string { return fmt.Sprintf("[%s] %s: %s", e.Scanner, e.Type, e.Message) } ``` ## Testing Strategy ### Unit Tests - Per-scanner timeout configuration - Progress tracking accuracy - Error preservation logic - Configuration validation ### Integration Tests - Large package list handling (simulated DNF bulk operations) - Slow network conditions - Permission error scenarios - Scanner progress detection ### Manual Test Scenarios 1. **Large Update Lists:** - Configure test system with many available updates - Verify DNF scanner completes within 5-minute window - Check that timeout messages include progress info 2. **Network Issues:** - Block package manager network access - Verify scanner returns network error, not timeout - Confirm meaningful error messages 3. **Configuration Testing:** - Test with custom timeout values - Verify configuration changes take effect - Test invalid configuration handling ## Prerequisites - Scanner wrapper architecture exists - Configuration system supports nested structures - Logging infrastructure supports structured output - Context cancellation pattern available ## Effort Estimate **Complexity:** Medium **Effort:** 2-3 days - Day 1: Timeout configuration and basic implementation - Day 2: Progress tracking and error preservation - Day 3: Scanner interface enhancements and testing ## Success Metrics - Reduction in false timeout errors by >80% - Users receive meaningful error messages for scanner failures - Large update lists complete successfully without timeout - Configuration changes take effect without restart - Scanner progress visible in logs - No regression in scanner reliability ## Monitoring Track these metrics after implementation: - Scanner timeout rate (by scanner type) - Average scanner duration (by scanner type) - Error message clarity score (user feedback) - User configuration changes to timeouts - Scanner success rate improvement