290 lines
8.8 KiB
Markdown
290 lines
8.8 KiB
Markdown
# P4-002: Scanner Timeout Optimization and Error Handling
|
|
|
|
**Priority:** P4 (Technical Debt)
|
|
**Source Reference:** From analysis of needsfixingbeforepush.md lines 226-270
|
|
**Date Identified:** 2025-11-12
|
|
|
|
## Problem Description
|
|
|
|
Agent uses a universal 45-second timeout for all scanner operations, which masks real error conditions and prevents proper error handling. Many scanner operations already capture and return proper errors, but timeouts kill scanners mid-operation, preventing meaningful error messages from reaching users.
|
|
|
|
## Impact
|
|
|
|
- **False Timeouts:** Legitimate slow operations fail unnecessarily
|
|
- **Error Masking:** Real scanner errors are replaced with generic "timeout" messages
|
|
- **Troubleshooting Difficulty:** Logs don't reflect actual problems
|
|
- **User Experience:** Users can't distinguish between slow operations vs actual hangs
|
|
- **Resource Waste:** Operations are killed when they could succeed given more time
|
|
|
|
## Current Behavior
|
|
|
|
- DNF scanner timeout: 45 seconds (far too short for bulk operations)
|
|
- Universal timeout applied to all scanners regardless of operation type
|
|
- Timeout kills scanner process even when scanner reported proper error
|
|
- No distinction between "no progress" hang vs "slow but working"
|
|
|
|
## Specific Examples
|
|
|
|
```
|
|
2025/11/02 19:54:27 [dnf] Scan failed: scan timeout after 45s
|
|
```
|
|
|
|
- DNF was actively working, just taking >45s for large update lists
|
|
- Real DNF errors (network, permissions, etc.) already captured by scanner
|
|
- Timeout prevents proper error propagation to user
|
|
|
|
## Proposed Solution
|
|
|
|
Implement scanner-specific timeout strategies and better error handling:
|
|
|
|
### 1. Per-Scanner Timeout Configuration
|
|
```go
|
|
type ScannerTimeoutConfig struct {
|
|
DNF time.Duration `yaml:"dnf"`
|
|
APT time.Duration `yaml:"apt"`
|
|
Docker time.Duration `yaml:"docker"`
|
|
Windows time.Duration `yaml:"windows"`
|
|
Winget time.Duration `yaml:"winget"`
|
|
Storage time.Duration `yaml:"storage"`
|
|
}
|
|
|
|
var DefaultTimeouts = ScannerTimeoutConfig{
|
|
DNF: 5 * time.Minute, // Large package lists
|
|
APT: 3 * time.Minute, // Generally faster
|
|
Docker: 2 * time.Minute, // Registry queries
|
|
Windows: 10 * time.Minute, // Windows Update can be slow
|
|
Winget: 3 * time.Minute, // Package manager queries
|
|
Storage: 1 * time.Minute, // Filesystem operations
|
|
}
|
|
```
|
|
|
|
### 2. Progress-Based Timeout Detection
|
|
```go
|
|
type ProgressTracker struct {
|
|
LastProgress time.Time
|
|
CheckInterval time.Duration
|
|
MaxStaleTime time.Duration
|
|
}
|
|
|
|
func (pt *ProgressTracker) CheckProgress() bool {
|
|
now := time.Now()
|
|
if now.Sub(pt.LastProgress) > pt.MaxStaleTime {
|
|
return false // No progress for too long
|
|
}
|
|
return true
|
|
}
|
|
|
|
// Scanner implementation updates progress
|
|
func (s *DNFScanner) scanWithProgress() ([]UpdateReportItem, error) {
|
|
pt := &ProgressTracker{
|
|
CheckInterval: 30 * time.Second,
|
|
MaxStaleTime: 2 * time.Minute,
|
|
}
|
|
|
|
result := make(chan []UpdateReportItem, 1)
|
|
errors := make(chan error, 1)
|
|
|
|
go func() {
|
|
updates, err := s.performDNFScan()
|
|
if err != nil {
|
|
errors <- err
|
|
return
|
|
}
|
|
result <- updates
|
|
}()
|
|
|
|
// Monitor for progress or completion
|
|
ticker := time.NewTicker(pt.CheckInterval)
|
|
defer ticker.Stop()
|
|
|
|
for {
|
|
select {
|
|
case updates := <-result:
|
|
return updates, nil
|
|
case err := <-errors:
|
|
return nil, err
|
|
case <-ticker.C:
|
|
if !pt.CheckProgress() {
|
|
return nil, fmt.Errorf("scanner appears stuck - no progress for %v", pt.MaxStaleTime)
|
|
}
|
|
case <-time.After(s.timeout):
|
|
return nil, fmt.Errorf("scanner timeout after %v", s.timeout)
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### 3. Smart Error Preservation
|
|
```go
|
|
func (s *ScannerWrapper) ExecuteWithTimeout(scanner Scanner, timeout time.Duration) ([]UpdateReportItem, error) {
|
|
ctx, cancel := context.WithTimeout(context.Background(), timeout)
|
|
defer cancel()
|
|
|
|
done := make(chan struct{})
|
|
var result []UpdateReportItem
|
|
var scanErr error
|
|
|
|
go func() {
|
|
defer close(done)
|
|
result, scanErr = scanner.ScanForUpdates()
|
|
}()
|
|
|
|
select {
|
|
case <-done:
|
|
// Scanner completed - return its actual error
|
|
return result, scanErr
|
|
case <-ctx.Done():
|
|
// Timeout - check if scanner provided progress info
|
|
if progressInfo := scanner.GetLastProgress(); progressInfo != "" {
|
|
return nil, fmt.Errorf("scanner timeout after %v (last progress: %s)", timeout, progressInfo)
|
|
}
|
|
return nil, fmt.Errorf("scanner timeout after %v (no progress detected)", timeout)
|
|
}
|
|
}
|
|
```
|
|
|
|
### 4. User-Configurable Timeouts
|
|
```json
|
|
{
|
|
"scanners": {
|
|
"timeouts": {
|
|
"dnf": "5m",
|
|
"apt": "3m",
|
|
"docker": "2m",
|
|
"windows": "10m",
|
|
"winget": "3m",
|
|
"storage": "1m"
|
|
},
|
|
"progress_detection": {
|
|
"enabled": true,
|
|
"check_interval": "30s",
|
|
"max_stale_time": "2m"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## Definition of Done
|
|
|
|
- [ ] Scanner-specific timeouts implemented and configurable
|
|
- [ ] Progress-based timeout detection differentiates hangs from slow operations
|
|
- [ ] Scanner's actual error messages preserved when available
|
|
- [ ] Users can tune timeouts per scanner backend in settings
|
|
- [ ] Clear distinction between "no progress" vs "operation in progress"
|
|
- [ ] Backward compatibility with existing configuration
|
|
- [ ] Enhanced logging shows scanner progress and timeout reasons
|
|
|
|
## Implementation Details
|
|
|
|
### File Locations
|
|
- **Primary:** `aggregator-agent/internal/orchestrator/scanner_wrappers.go`
|
|
- **Config:** `aggregator-agent/internal/config/config.go`
|
|
- **Scanners:** `aggregator-agent/internal/scanner/*.go` (add progress tracking)
|
|
|
|
### Configuration Integration
|
|
```go
|
|
type AgentConfig struct {
|
|
// ... existing fields ...
|
|
ScannerTimeouts ScannerTimeoutConfig `json:"scanner_timeouts"`
|
|
}
|
|
|
|
func (c *AgentConfig) GetTimeout(scannerType string) time.Duration {
|
|
switch scannerType {
|
|
case "dnf":
|
|
return c.ScannerTimeouts.DNF
|
|
case "apt":
|
|
return c.ScannerTimeouts.APT
|
|
// ... other cases
|
|
default:
|
|
return 2 * time.Minute // sensible default
|
|
}
|
|
}
|
|
```
|
|
|
|
### Scanner Interface Enhancement
|
|
```go
|
|
type Scanner interface {
|
|
ScanForUpdates() ([]UpdateReportItem, error)
|
|
GetLastProgress() string // New: return human-readable progress info
|
|
IsMakingProgress() bool // New: quick check if scanner is active
|
|
}
|
|
```
|
|
|
|
### Enhanced Error Reporting
|
|
```go
|
|
type ScannerError struct {
|
|
Type string `json:"type"` // "timeout", "permission", "network", etc.
|
|
Scanner string `json:"scanner"` // "dnf", "apt", etc.
|
|
Message string `json:"message"` // Human-readable error
|
|
Details string `json:"details"` // Technical details
|
|
Timestamp time.Time `json:"timestamp"`
|
|
Duration time.Duration `json:"duration"`
|
|
}
|
|
|
|
func (e ScannerError) Error() string {
|
|
return fmt.Sprintf("[%s] %s: %s", e.Scanner, e.Type, e.Message)
|
|
}
|
|
```
|
|
|
|
## Testing Strategy
|
|
|
|
### Unit Tests
|
|
- Per-scanner timeout configuration
|
|
- Progress tracking accuracy
|
|
- Error preservation logic
|
|
- Configuration validation
|
|
|
|
### Integration Tests
|
|
- Large package list handling (simulated DNF bulk operations)
|
|
- Slow network conditions
|
|
- Permission error scenarios
|
|
- Scanner progress detection
|
|
|
|
### Manual Test Scenarios
|
|
1. **Large Update Lists:**
|
|
- Configure test system with many available updates
|
|
- Verify DNF scanner completes within 5-minute window
|
|
- Check that timeout messages include progress info
|
|
|
|
2. **Network Issues:**
|
|
- Block package manager network access
|
|
- Verify scanner returns network error, not timeout
|
|
- Confirm meaningful error messages
|
|
|
|
3. **Configuration Testing:**
|
|
- Test with custom timeout values
|
|
- Verify configuration changes take effect
|
|
- Test invalid configuration handling
|
|
|
|
## Prerequisites
|
|
|
|
- Scanner wrapper architecture exists
|
|
- Configuration system supports nested structures
|
|
- Logging infrastructure supports structured output
|
|
- Context cancellation pattern available
|
|
|
|
## Effort Estimate
|
|
|
|
**Complexity:** Medium
|
|
**Effort:** 2-3 days
|
|
- Day 1: Timeout configuration and basic implementation
|
|
- Day 2: Progress tracking and error preservation
|
|
- Day 3: Scanner interface enhancements and testing
|
|
|
|
## Success Metrics
|
|
|
|
- Reduction in false timeout errors by >80%
|
|
- Users receive meaningful error messages for scanner failures
|
|
- Large update lists complete successfully without timeout
|
|
- Configuration changes take effect without restart
|
|
- Scanner progress visible in logs
|
|
- No regression in scanner reliability
|
|
|
|
## Monitoring
|
|
|
|
Track these metrics after implementation:
|
|
- Scanner timeout rate (by scanner type)
|
|
- Average scanner duration (by scanner type)
|
|
- Error message clarity score (user feedback)
|
|
- User configuration changes to timeouts
|
|
- Scanner success rate improvement |