Files
Redflag/docs/3_BACKLOG/P4-002_Scanner-Timeout-Optimization.md

8.8 KiB

P4-002: Scanner Timeout Optimization and Error Handling

Priority: P4 (Technical Debt) Source Reference: From analysis of needsfixingbeforepush.md lines 226-270 Date Identified: 2025-11-12

Problem Description

Agent uses a universal 45-second timeout for all scanner operations, which masks real error conditions and prevents proper error handling. Many scanner operations already capture and return proper errors, but timeouts kill scanners mid-operation, preventing meaningful error messages from reaching users.

Impact

  • False Timeouts: Legitimate slow operations fail unnecessarily
  • Error Masking: Real scanner errors are replaced with generic "timeout" messages
  • Troubleshooting Difficulty: Logs don't reflect actual problems
  • User Experience: Users can't distinguish between slow operations vs actual hangs
  • Resource Waste: Operations are killed when they could succeed given more time

Current Behavior

  • DNF scanner timeout: 45 seconds (far too short for bulk operations)
  • Universal timeout applied to all scanners regardless of operation type
  • Timeout kills scanner process even when scanner reported proper error
  • No distinction between "no progress" hang vs "slow but working"

Specific Examples

2025/11/02 19:54:27 [dnf] Scan failed: scan timeout after 45s
  • DNF was actively working, just taking >45s for large update lists
  • Real DNF errors (network, permissions, etc.) already captured by scanner
  • Timeout prevents proper error propagation to user

Proposed Solution

Implement scanner-specific timeout strategies and better error handling:

1. Per-Scanner Timeout Configuration

type ScannerTimeoutConfig struct {
    DNF      time.Duration `yaml:"dnf"`
    APT      time.Duration `yaml:"apt"`
    Docker   time.Duration `yaml:"docker"`
    Windows  time.Duration `yaml:"windows"`
    Winget   time.Duration `yaml:"winget"`
    Storage  time.Duration `yaml:"storage"`
}

var DefaultTimeouts = ScannerTimeoutConfig{
    DNF:     5 * time.Minute,      // Large package lists
    APT:     3 * time.Minute,      // Generally faster
    Docker:  2 * time.Minute,      // Registry queries
    Windows: 10 * time.Minute,     // Windows Update can be slow
    Winget:  3 * time.Minute,      // Package manager queries
    Storage: 1 * time.Minute,      // Filesystem operations
}

2. Progress-Based Timeout Detection

type ProgressTracker struct {
    LastProgress time.Time
    CheckInterval time.Duration
    MaxStaleTime  time.Duration
}

func (pt *ProgressTracker) CheckProgress() bool {
    now := time.Now()
    if now.Sub(pt.LastProgress) > pt.MaxStaleTime {
        return false // No progress for too long
    }
    return true
}

// Scanner implementation updates progress
func (s *DNFScanner) scanWithProgress() ([]UpdateReportItem, error) {
    pt := &ProgressTracker{
        CheckInterval: 30 * time.Second,
        MaxStaleTime:  2 * time.Minute,
    }

    result := make(chan []UpdateReportItem, 1)
    errors := make(chan error, 1)

    go func() {
        updates, err := s.performDNFScan()
        if err != nil {
            errors <- err
            return
        }
        result <- updates
    }()

    // Monitor for progress or completion
    ticker := time.NewTicker(pt.CheckInterval)
    defer ticker.Stop()

    for {
        select {
        case updates := <-result:
            return updates, nil
        case err := <-errors:
            return nil, err
        case <-ticker.C:
            if !pt.CheckProgress() {
                return nil, fmt.Errorf("scanner appears stuck - no progress for %v", pt.MaxStaleTime)
            }
        case <-time.After(s.timeout):
            return nil, fmt.Errorf("scanner timeout after %v", s.timeout)
        }
    }
}

3. Smart Error Preservation

func (s *ScannerWrapper) ExecuteWithTimeout(scanner Scanner, timeout time.Duration) ([]UpdateReportItem, error) {
    ctx, cancel := context.WithTimeout(context.Background(), timeout)
    defer cancel()

    done := make(chan struct{})
    var result []UpdateReportItem
    var scanErr error

    go func() {
        defer close(done)
        result, scanErr = scanner.ScanForUpdates()
    }()

    select {
    case <-done:
        // Scanner completed - return its actual error
        return result, scanErr
    case <-ctx.Done():
        // Timeout - check if scanner provided progress info
        if progressInfo := scanner.GetLastProgress(); progressInfo != "" {
            return nil, fmt.Errorf("scanner timeout after %v (last progress: %s)", timeout, progressInfo)
        }
        return nil, fmt.Errorf("scanner timeout after %v (no progress detected)", timeout)
    }
}

4. User-Configurable Timeouts

{
  "scanners": {
    "timeouts": {
      "dnf": "5m",
      "apt": "3m",
      "docker": "2m",
      "windows": "10m",
      "winget": "3m",
      "storage": "1m"
    },
    "progress_detection": {
      "enabled": true,
      "check_interval": "30s",
      "max_stale_time": "2m"
    }
  }
}

Definition of Done

  • Scanner-specific timeouts implemented and configurable
  • Progress-based timeout detection differentiates hangs from slow operations
  • Scanner's actual error messages preserved when available
  • Users can tune timeouts per scanner backend in settings
  • Clear distinction between "no progress" vs "operation in progress"
  • Backward compatibility with existing configuration
  • Enhanced logging shows scanner progress and timeout reasons

Implementation Details

File Locations

  • Primary: aggregator-agent/internal/orchestrator/scanner_wrappers.go
  • Config: aggregator-agent/internal/config/config.go
  • Scanners: aggregator-agent/internal/scanner/*.go (add progress tracking)

Configuration Integration

type AgentConfig struct {
    // ... existing fields ...
    ScannerTimeouts ScannerTimeoutConfig `json:"scanner_timeouts"`
}

func (c *AgentConfig) GetTimeout(scannerType string) time.Duration {
    switch scannerType {
    case "dnf":
        return c.ScannerTimeouts.DNF
    case "apt":
        return c.ScannerTimeouts.APT
    // ... other cases
    default:
        return 2 * time.Minute // sensible default
    }
}

Scanner Interface Enhancement

type Scanner interface {
    ScanForUpdates() ([]UpdateReportItem, error)
    GetLastProgress() string  // New: return human-readable progress info
    IsMakingProgress() bool   // New: quick check if scanner is active
}

Enhanced Error Reporting

type ScannerError struct {
    Type       string    `json:"type"`       // "timeout", "permission", "network", etc.
    Scanner    string    `json:"scanner"`    // "dnf", "apt", etc.
    Message    string    `json:"message"`    // Human-readable error
    Details    string    `json:"details"`    // Technical details
    Timestamp  time.Time `json:"timestamp"`
    Duration   time.Duration `json:"duration"`
}

func (e ScannerError) Error() string {
    return fmt.Sprintf("[%s] %s: %s", e.Scanner, e.Type, e.Message)
}

Testing Strategy

Unit Tests

  • Per-scanner timeout configuration
  • Progress tracking accuracy
  • Error preservation logic
  • Configuration validation

Integration Tests

  • Large package list handling (simulated DNF bulk operations)
  • Slow network conditions
  • Permission error scenarios
  • Scanner progress detection

Manual Test Scenarios

  1. Large Update Lists:

    • Configure test system with many available updates
    • Verify DNF scanner completes within 5-minute window
    • Check that timeout messages include progress info
  2. Network Issues:

    • Block package manager network access
    • Verify scanner returns network error, not timeout
    • Confirm meaningful error messages
  3. Configuration Testing:

    • Test with custom timeout values
    • Verify configuration changes take effect
    • Test invalid configuration handling

Prerequisites

  • Scanner wrapper architecture exists
  • Configuration system supports nested structures
  • Logging infrastructure supports structured output
  • Context cancellation pattern available

Effort Estimate

Complexity: Medium Effort: 2-3 days

  • Day 1: Timeout configuration and basic implementation
  • Day 2: Progress tracking and error preservation
  • Day 3: Scanner interface enhancements and testing

Success Metrics

  • Reduction in false timeout errors by >80%
  • Users receive meaningful error messages for scanner failures
  • Large update lists complete successfully without timeout
  • Configuration changes take effect without restart
  • Scanner progress visible in logs
  • No regression in scanner reliability

Monitoring

Track these metrics after implementation:

  • Scanner timeout rate (by scanner type)
  • Average scanner duration (by scanner type)
  • Error message clarity score (user feedback)
  • User configuration changes to timeouts
  • Scanner success rate improvement