Redflag/docs/3_BACKLOG/P4-002_Scanner-Timeout-Optimization.md

# P4-002: Scanner Timeout Optimization and Error Handling

**Priority:** P4 (Technical Debt)
**Source Reference:** From analysis of needsfixingbeforepush.md lines 226-270
**Date Identified:** 2025-11-12

## Problem Description

Agent uses a universal 45-second timeout for all scanner operations, which masks real error conditions and prevents proper error handling. Many scanner operations already capture and return proper errors, but timeouts kill scanners mid-operation, preventing meaningful error messages from reaching users.

## Impact

- **False Timeouts:** Legitimate slow operations fail unnecessarily
- **Error Masking:** Real scanner errors are replaced with generic "timeout" messages
- **Troubleshooting Difficulty:** Logs don't reflect actual problems
- **User Experience:** Users can't distinguish between slow operations vs actual hangs
- **Resource Waste:** Operations are killed when they could succeed given more time

## Current Behavior

- DNF scanner timeout: 45 seconds (far too short for bulk operations)
- Universal timeout applied to all scanners regardless of operation type
- Timeout kills scanner process even when scanner reported proper error
- No distinction between "no progress" hang vs "slow but working"

## Specific Examples

```
2025/11/02 19:54:27 [dnf] Scan failed: scan timeout after 45s
```

- DNF was actively working, just taking >45s for large update lists
- Real DNF errors (network, permissions, etc.) already captured by scanner
- Timeout prevents proper error propagation to user

## Proposed Solution

Implement scanner-specific timeout strategies and better error handling:

### 1. Per-Scanner Timeout Configuration
```go
type ScannerTimeoutConfig struct {
    DNF      time.Duration `yaml:"dnf"`
    APT      time.Duration `yaml:"apt"`
    Docker   time.Duration `yaml:"docker"`
    Windows  time.Duration `yaml:"windows"`
    Winget   time.Duration `yaml:"winget"`
    Storage  time.Duration `yaml:"storage"`
}

var DefaultTimeouts = ScannerTimeoutConfig{
    DNF:     5 * time.Minute,      // Large package lists
    APT:     3 * time.Minute,      // Generally faster
    Docker:  2 * time.Minute,      // Registry queries
    Windows: 10 * time.Minute,     // Windows Update can be slow
    Winget:  3 * time.Minute,      // Package manager queries
    Storage: 1 * time.Minute,      // Filesystem operations
}
```

### 2. Progress-Based Timeout Detection
```go
type ProgressTracker struct {
    LastProgress time.Time
    CheckInterval time.Duration
    MaxStaleTime  time.Duration
}

func (pt *ProgressTracker) CheckProgress() bool {
    now := time.Now()
    if now.Sub(pt.LastProgress) > pt.MaxStaleTime {
        return false // No progress for too long
    }
    return true
}

// Scanner implementation updates progress
func (s *DNFScanner) scanWithProgress() ([]UpdateReportItem, error) {
    pt := &ProgressTracker{
        CheckInterval: 30 * time.Second,
        MaxStaleTime:  2 * time.Minute,
    }

    result := make(chan []UpdateReportItem, 1)
    errors := make(chan error, 1)

    go func() {
        updates, err := s.performDNFScan()
        if err != nil {
            errors <- err
            return
        }
        result <- updates
    }()

    // Monitor for progress or completion
    ticker := time.NewTicker(pt.CheckInterval)
    defer ticker.Stop()

    for {
        select {
        case updates := <-result:
            return updates, nil
        case err := <-errors:
            return nil, err
        case <-ticker.C:
            if !pt.CheckProgress() {
                return nil, fmt.Errorf("scanner appears stuck - no progress for %v", pt.MaxStaleTime)
            }
        case <-time.After(s.timeout):
            return nil, fmt.Errorf("scanner timeout after %v", s.timeout)
        }
    }
}
```

### 3. Smart Error Preservation
```go
func (s *ScannerWrapper) ExecuteWithTimeout(scanner Scanner, timeout time.Duration) ([]UpdateReportItem, error) {
    ctx, cancel := context.WithTimeout(context.Background(), timeout)
    defer cancel()

    done := make(chan struct{})
    var result []UpdateReportItem
    var scanErr error

    go func() {
        defer close(done)
        result, scanErr = scanner.ScanForUpdates()
    }()

    select {
    case <-done:
        // Scanner completed - return its actual error
        return result, scanErr
    case <-ctx.Done():
        // Timeout - check if scanner provided progress info
        if progressInfo := scanner.GetLastProgress(); progressInfo != "" {
            return nil, fmt.Errorf("scanner timeout after %v (last progress: %s)", timeout, progressInfo)
        }
        return nil, fmt.Errorf("scanner timeout after %v (no progress detected)", timeout)
    }
}
```

### 4. User-Configurable Timeouts
```json
{
  "scanners": {
    "timeouts": {
      "dnf": "5m",
      "apt": "3m",
      "docker": "2m",
      "windows": "10m",
      "winget": "3m",
      "storage": "1m"
    },
    "progress_detection": {
      "enabled": true,
      "check_interval": "30s",
      "max_stale_time": "2m"
    }
  }
}
```

## Definition of Done

- [ ] Scanner-specific timeouts implemented and configurable
- [ ] Progress-based timeout detection differentiates hangs from slow operations
- [ ] Scanner's actual error messages preserved when available
- [ ] Users can tune timeouts per scanner backend in settings
- [ ] Clear distinction between "no progress" vs "operation in progress"
- [ ] Backward compatibility with existing configuration
- [ ] Enhanced logging shows scanner progress and timeout reasons

## Implementation Details

### File Locations
- **Primary:** `aggregator-agent/internal/orchestrator/scanner_wrappers.go`
- **Config:** `aggregator-agent/internal/config/config.go`
- **Scanners:** `aggregator-agent/internal/scanner/*.go` (add progress tracking)

### Configuration Integration
```go
type AgentConfig struct {
    // ... existing fields ...
    ScannerTimeouts ScannerTimeoutConfig `json:"scanner_timeouts"`
}

func (c *AgentConfig) GetTimeout(scannerType string) time.Duration {
    switch scannerType {
    case "dnf":
        return c.ScannerTimeouts.DNF
    case "apt":
        return c.ScannerTimeouts.APT
    // ... other cases
    default:
        return 2 * time.Minute // sensible default
    }
}
```

### Scanner Interface Enhancement
```go
type Scanner interface {
    ScanForUpdates() ([]UpdateReportItem, error)
    GetLastProgress() string  // New: return human-readable progress info
    IsMakingProgress() bool   // New: quick check if scanner is active
}
```

### Enhanced Error Reporting
```go
type ScannerError struct {
    Type       string    `json:"type"`       // "timeout", "permission", "network", etc.
    Scanner    string    `json:"scanner"`    // "dnf", "apt", etc.
    Message    string    `json:"message"`    // Human-readable error
    Details    string    `json:"details"`    // Technical details
    Timestamp  time.Time `json:"timestamp"`
    Duration   time.Duration `json:"duration"`
}

func (e ScannerError) Error() string {
    return fmt.Sprintf("[%s] %s: %s", e.Scanner, e.Type, e.Message)
}
```

## Testing Strategy

### Unit Tests
- Per-scanner timeout configuration
- Progress tracking accuracy
- Error preservation logic
- Configuration validation

### Integration Tests
- Large package list handling (simulated DNF bulk operations)
- Slow network conditions
- Permission error scenarios
- Scanner progress detection

### Manual Test Scenarios
1. **Large Update Lists:**
   - Configure test system with many available updates
   - Verify DNF scanner completes within 5-minute window
   - Check that timeout messages include progress info

2. **Network Issues:**
   - Block package manager network access
   - Verify scanner returns network error, not timeout
   - Confirm meaningful error messages

3. **Configuration Testing:**
   - Test with custom timeout values
   - Verify configuration changes take effect
   - Test invalid configuration handling

## Prerequisites

- Scanner wrapper architecture exists
- Configuration system supports nested structures
- Logging infrastructure supports structured output
- Context cancellation pattern available

## Effort Estimate

**Complexity:** Medium
**Effort:** 2-3 days
- Day 1: Timeout configuration and basic implementation
- Day 2: Progress tracking and error preservation
- Day 3: Scanner interface enhancements and testing

## Success Metrics

- Reduction in false timeout errors by >80%
- Users receive meaningful error messages for scanner failures
- Large update lists complete successfully without timeout
- Configuration changes take effect without restart
- Scanner progress visible in logs
- No regression in scanner reliability

## Monitoring

Track these metrics after implementation:
- Scanner timeout rate (by scanner type)
- Average scanner duration (by scanner type)
- Error message clarity score (user feedback)
- User configuration changes to timeouts
- Scanner success rate improvement