Files

Fimeg 484a7f77ce Add docs and project files - force for Culurien

2026-03-28 20:46:24 -04:00

8.8 KiB

Raw Permalink Blame History

P4-002: Scanner Timeout Optimization and Error Handling

Priority: P4 (Technical Debt) Source Reference: From analysis of needsfixingbeforepush.md lines 226-270 Date Identified: 2025-11-12

Problem Description

Agent uses a universal 45-second timeout for all scanner operations, which masks real error conditions and prevents proper error handling. Many scanner operations already capture and return proper errors, but timeouts kill scanners mid-operation, preventing meaningful error messages from reaching users.

Impact

False Timeouts: Legitimate slow operations fail unnecessarily
Error Masking: Real scanner errors are replaced with generic "timeout" messages
Troubleshooting Difficulty: Logs don't reflect actual problems
User Experience: Users can't distinguish between slow operations vs actual hangs
Resource Waste: Operations are killed when they could succeed given more time

Current Behavior

DNF scanner timeout: 45 seconds (far too short for bulk operations)
Universal timeout applied to all scanners regardless of operation type
Timeout kills scanner process even when scanner reported proper error
No distinction between "no progress" hang vs "slow but working"

Specific Examples

2025/11/02 19:54:27 [dnf] Scan failed: scan timeout after 45s

DNF was actively working, just taking >45s for large update lists
Real DNF errors (network, permissions, etc.) already captured by scanner
Timeout prevents proper error propagation to user

Proposed Solution

Implement scanner-specific timeout strategies and better error handling:

1. Per-Scanner Timeout Configuration

type ScannerTimeoutConfig struct {
    DNF      time.Duration `yaml:"dnf"`
    APT      time.Duration `yaml:"apt"`
    Docker   time.Duration `yaml:"docker"`
    Windows  time.Duration `yaml:"windows"`
    Winget   time.Duration `yaml:"winget"`
    Storage  time.Duration `yaml:"storage"`
}

var DefaultTimeouts = ScannerTimeoutConfig{
    DNF:     5 * time.Minute,      // Large package lists
    APT:     3 * time.Minute,      // Generally faster
    Docker:  2 * time.Minute,      // Registry queries
    Windows: 10 * time.Minute,     // Windows Update can be slow
    Winget:  3 * time.Minute,      // Package manager queries
    Storage: 1 * time.Minute,      // Filesystem operations
}

2. Progress-Based Timeout Detection

type ProgressTracker struct {
    LastProgress time.Time
    CheckInterval time.Duration
    MaxStaleTime  time.Duration
}

func (pt *ProgressTracker) CheckProgress() bool {
    now := time.Now()
    if now.Sub(pt.LastProgress) > pt.MaxStaleTime {
        return false // No progress for too long
    }
    return true
}

// Scanner implementation updates progress
func (s *DNFScanner) scanWithProgress() ([]UpdateReportItem, error) {
    pt := &ProgressTracker{
        CheckInterval: 30 * time.Second,
        MaxStaleTime:  2 * time.Minute,
    }

    result := make(chan []UpdateReportItem, 1)
    errors := make(chan error, 1)

    go func() {
        updates, err := s.performDNFScan()
        if err != nil {
            errors <- err
            return
        }
        result <- updates
    }()

    // Monitor for progress or completion
    ticker := time.NewTicker(pt.CheckInterval)
    defer ticker.Stop()

    for {
        select {
        case updates := <-result:
            return updates, nil
        case err := <-errors:
            return nil, err
        case <-ticker.C:
            if !pt.CheckProgress() {
                return nil, fmt.Errorf("scanner appears stuck - no progress for %v", pt.MaxStaleTime)
            }
        case <-time.After(s.timeout):
            return nil, fmt.Errorf("scanner timeout after %v", s.timeout)
        }
    }
}

3. Smart Error Preservation

func (s *ScannerWrapper) ExecuteWithTimeout(scanner Scanner, timeout time.Duration) ([]UpdateReportItem, error) {
    ctx, cancel := context.WithTimeout(context.Background(), timeout)
    defer cancel()

    done := make(chan struct{})
    var result []UpdateReportItem
    var scanErr error

    go func() {
        defer close(done)
        result, scanErr = scanner.ScanForUpdates()
    }()

    select {
    case <-done:
        // Scanner completed - return its actual error
        return result, scanErr
    case <-ctx.Done():
        // Timeout - check if scanner provided progress info
        if progressInfo := scanner.GetLastProgress(); progressInfo != "" {
            return nil, fmt.Errorf("scanner timeout after %v (last progress: %s)", timeout, progressInfo)
        }
        return nil, fmt.Errorf("scanner timeout after %v (no progress detected)", timeout)
    }
}

4. User-Configurable Timeouts

{
  "scanners": {
    "timeouts": {
      "dnf": "5m",
      "apt": "3m",
      "docker": "2m",
      "windows": "10m",
      "winget": "3m",
      "storage": "1m"
    },
    "progress_detection": {
      "enabled": true,
      "check_interval": "30s",
      "max_stale_time": "2m"
    }
  }
}

Definition of Done

Scanner-specific timeouts implemented and configurable
Progress-based timeout detection differentiates hangs from slow operations
Scanner's actual error messages preserved when available
Users can tune timeouts per scanner backend in settings
Clear distinction between "no progress" vs "operation in progress"
Backward compatibility with existing configuration
Enhanced logging shows scanner progress and timeout reasons

Implementation Details

File Locations

Primary: aggregator-agent/internal/orchestrator/scanner_wrappers.go
Config: aggregator-agent/internal/config/config.go
Scanners: aggregator-agent/internal/scanner/*.go (add progress tracking)

Configuration Integration

type AgentConfig struct {
    // ... existing fields ...
    ScannerTimeouts ScannerTimeoutConfig `json:"scanner_timeouts"`
}

func (c *AgentConfig) GetTimeout(scannerType string) time.Duration {
    switch scannerType {
    case "dnf":
        return c.ScannerTimeouts.DNF
    case "apt":
        return c.ScannerTimeouts.APT
    // ... other cases
    default:
        return 2 * time.Minute // sensible default
    }
}

Scanner Interface Enhancement

type Scanner interface {
    ScanForUpdates() ([]UpdateReportItem, error)
    GetLastProgress() string  // New: return human-readable progress info
    IsMakingProgress() bool   // New: quick check if scanner is active
}

Enhanced Error Reporting

type ScannerError struct {
    Type       string    `json:"type"`       // "timeout", "permission", "network", etc.
    Scanner    string    `json:"scanner"`    // "dnf", "apt", etc.
    Message    string    `json:"message"`    // Human-readable error
    Details    string    `json:"details"`    // Technical details
    Timestamp  time.Time `json:"timestamp"`
    Duration   time.Duration `json:"duration"`
}

func (e ScannerError) Error() string {
    return fmt.Sprintf("[%s] %s: %s", e.Scanner, e.Type, e.Message)
}

Testing Strategy

Unit Tests

Per-scanner timeout configuration
Progress tracking accuracy
Error preservation logic
Configuration validation

Integration Tests

Large package list handling (simulated DNF bulk operations)
Slow network conditions
Permission error scenarios
Scanner progress detection

Manual Test Scenarios

Large Update Lists:
- Configure test system with many available updates
- Verify DNF scanner completes within 5-minute window
- Check that timeout messages include progress info
Network Issues:
- Block package manager network access
- Verify scanner returns network error, not timeout
- Confirm meaningful error messages
Configuration Testing:
- Test with custom timeout values
- Verify configuration changes take effect
- Test invalid configuration handling

Prerequisites

Scanner wrapper architecture exists
Configuration system supports nested structures
Logging infrastructure supports structured output
Context cancellation pattern available

Effort Estimate

Complexity: Medium Effort: 2-3 days

Day 1: Timeout configuration and basic implementation
Day 2: Progress tracking and error preservation
Day 3: Scanner interface enhancements and testing

Success Metrics

Reduction in false timeout errors by >80%
Users receive meaningful error messages for scanner failures
Large update lists complete successfully without timeout
Configuration changes take effect without restart
Scanner progress visible in logs
No regression in scanner reliability

Monitoring

Track these metrics after implementation:

Scanner timeout rate (by scanner type)
Average scanner duration (by scanner type)
Error message clarity score (user feedback)
User configuration changes to timeouts
Scanner success rate improvement

8.8 KiB Raw Permalink Blame History