8.8 KiB
8.8 KiB
P4-002: Scanner Timeout Optimization and Error Handling
Priority: P4 (Technical Debt) Source Reference: From analysis of needsfixingbeforepush.md lines 226-270 Date Identified: 2025-11-12
Problem Description
Agent uses a universal 45-second timeout for all scanner operations, which masks real error conditions and prevents proper error handling. Many scanner operations already capture and return proper errors, but timeouts kill scanners mid-operation, preventing meaningful error messages from reaching users.
Impact
- False Timeouts: Legitimate slow operations fail unnecessarily
- Error Masking: Real scanner errors are replaced with generic "timeout" messages
- Troubleshooting Difficulty: Logs don't reflect actual problems
- User Experience: Users can't distinguish between slow operations vs actual hangs
- Resource Waste: Operations are killed when they could succeed given more time
Current Behavior
- DNF scanner timeout: 45 seconds (far too short for bulk operations)
- Universal timeout applied to all scanners regardless of operation type
- Timeout kills scanner process even when scanner reported proper error
- No distinction between "no progress" hang vs "slow but working"
Specific Examples
2025/11/02 19:54:27 [dnf] Scan failed: scan timeout after 45s
- DNF was actively working, just taking >45s for large update lists
- Real DNF errors (network, permissions, etc.) already captured by scanner
- Timeout prevents proper error propagation to user
Proposed Solution
Implement scanner-specific timeout strategies and better error handling:
1. Per-Scanner Timeout Configuration
type ScannerTimeoutConfig struct {
DNF time.Duration `yaml:"dnf"`
APT time.Duration `yaml:"apt"`
Docker time.Duration `yaml:"docker"`
Windows time.Duration `yaml:"windows"`
Winget time.Duration `yaml:"winget"`
Storage time.Duration `yaml:"storage"`
}
var DefaultTimeouts = ScannerTimeoutConfig{
DNF: 5 * time.Minute, // Large package lists
APT: 3 * time.Minute, // Generally faster
Docker: 2 * time.Minute, // Registry queries
Windows: 10 * time.Minute, // Windows Update can be slow
Winget: 3 * time.Minute, // Package manager queries
Storage: 1 * time.Minute, // Filesystem operations
}
2. Progress-Based Timeout Detection
type ProgressTracker struct {
LastProgress time.Time
CheckInterval time.Duration
MaxStaleTime time.Duration
}
func (pt *ProgressTracker) CheckProgress() bool {
now := time.Now()
if now.Sub(pt.LastProgress) > pt.MaxStaleTime {
return false // No progress for too long
}
return true
}
// Scanner implementation updates progress
func (s *DNFScanner) scanWithProgress() ([]UpdateReportItem, error) {
pt := &ProgressTracker{
CheckInterval: 30 * time.Second,
MaxStaleTime: 2 * time.Minute,
}
result := make(chan []UpdateReportItem, 1)
errors := make(chan error, 1)
go func() {
updates, err := s.performDNFScan()
if err != nil {
errors <- err
return
}
result <- updates
}()
// Monitor for progress or completion
ticker := time.NewTicker(pt.CheckInterval)
defer ticker.Stop()
for {
select {
case updates := <-result:
return updates, nil
case err := <-errors:
return nil, err
case <-ticker.C:
if !pt.CheckProgress() {
return nil, fmt.Errorf("scanner appears stuck - no progress for %v", pt.MaxStaleTime)
}
case <-time.After(s.timeout):
return nil, fmt.Errorf("scanner timeout after %v", s.timeout)
}
}
}
3. Smart Error Preservation
func (s *ScannerWrapper) ExecuteWithTimeout(scanner Scanner, timeout time.Duration) ([]UpdateReportItem, error) {
ctx, cancel := context.WithTimeout(context.Background(), timeout)
defer cancel()
done := make(chan struct{})
var result []UpdateReportItem
var scanErr error
go func() {
defer close(done)
result, scanErr = scanner.ScanForUpdates()
}()
select {
case <-done:
// Scanner completed - return its actual error
return result, scanErr
case <-ctx.Done():
// Timeout - check if scanner provided progress info
if progressInfo := scanner.GetLastProgress(); progressInfo != "" {
return nil, fmt.Errorf("scanner timeout after %v (last progress: %s)", timeout, progressInfo)
}
return nil, fmt.Errorf("scanner timeout after %v (no progress detected)", timeout)
}
}
4. User-Configurable Timeouts
{
"scanners": {
"timeouts": {
"dnf": "5m",
"apt": "3m",
"docker": "2m",
"windows": "10m",
"winget": "3m",
"storage": "1m"
},
"progress_detection": {
"enabled": true,
"check_interval": "30s",
"max_stale_time": "2m"
}
}
}
Definition of Done
- Scanner-specific timeouts implemented and configurable
- Progress-based timeout detection differentiates hangs from slow operations
- Scanner's actual error messages preserved when available
- Users can tune timeouts per scanner backend in settings
- Clear distinction between "no progress" vs "operation in progress"
- Backward compatibility with existing configuration
- Enhanced logging shows scanner progress and timeout reasons
Implementation Details
File Locations
- Primary:
aggregator-agent/internal/orchestrator/scanner_wrappers.go - Config:
aggregator-agent/internal/config/config.go - Scanners:
aggregator-agent/internal/scanner/*.go(add progress tracking)
Configuration Integration
type AgentConfig struct {
// ... existing fields ...
ScannerTimeouts ScannerTimeoutConfig `json:"scanner_timeouts"`
}
func (c *AgentConfig) GetTimeout(scannerType string) time.Duration {
switch scannerType {
case "dnf":
return c.ScannerTimeouts.DNF
case "apt":
return c.ScannerTimeouts.APT
// ... other cases
default:
return 2 * time.Minute // sensible default
}
}
Scanner Interface Enhancement
type Scanner interface {
ScanForUpdates() ([]UpdateReportItem, error)
GetLastProgress() string // New: return human-readable progress info
IsMakingProgress() bool // New: quick check if scanner is active
}
Enhanced Error Reporting
type ScannerError struct {
Type string `json:"type"` // "timeout", "permission", "network", etc.
Scanner string `json:"scanner"` // "dnf", "apt", etc.
Message string `json:"message"` // Human-readable error
Details string `json:"details"` // Technical details
Timestamp time.Time `json:"timestamp"`
Duration time.Duration `json:"duration"`
}
func (e ScannerError) Error() string {
return fmt.Sprintf("[%s] %s: %s", e.Scanner, e.Type, e.Message)
}
Testing Strategy
Unit Tests
- Per-scanner timeout configuration
- Progress tracking accuracy
- Error preservation logic
- Configuration validation
Integration Tests
- Large package list handling (simulated DNF bulk operations)
- Slow network conditions
- Permission error scenarios
- Scanner progress detection
Manual Test Scenarios
-
Large Update Lists:
- Configure test system with many available updates
- Verify DNF scanner completes within 5-minute window
- Check that timeout messages include progress info
-
Network Issues:
- Block package manager network access
- Verify scanner returns network error, not timeout
- Confirm meaningful error messages
-
Configuration Testing:
- Test with custom timeout values
- Verify configuration changes take effect
- Test invalid configuration handling
Prerequisites
- Scanner wrapper architecture exists
- Configuration system supports nested structures
- Logging infrastructure supports structured output
- Context cancellation pattern available
Effort Estimate
Complexity: Medium Effort: 2-3 days
- Day 1: Timeout configuration and basic implementation
- Day 2: Progress tracking and error preservation
- Day 3: Scanner interface enhancements and testing
Success Metrics
- Reduction in false timeout errors by >80%
- Users receive meaningful error messages for scanner failures
- Large update lists complete successfully without timeout
- Configuration changes take effect without restart
- Scanner progress visible in logs
- No regression in scanner reliability
Monitoring
Track these metrics after implementation:
- Scanner timeout rate (by scanner type)
- Average scanner duration (by scanner type)
- Error message clarity score (user feedback)
- User configuration changes to timeouts
- Scanner success rate improvement