16 KiB
16 KiB
Agent Timeout Architecture & Scanner Timeouts
Problem Statement
Date: 2025-11-03 Status: Planning phase - Important for operation reliability
Current Issues
- Aggressive Timeouts: DNF scanner timeout of 45s is too short for bulk operations
- One-Size-Fits-All: Same timeout for all operations regardless of complexity
- No Progress Detection: Timeouts kill operations that are actually making progress
- Poor Error Reporting: Generic "timeout" masks real underlying issues
- No User Control: No way to configure timeouts for different environments
Current Behavior (Problematic)
// Current scanner timeout handling
func (s *DNFScanner) Scan() (*ScanResult, error) {
ctx, cancel := context.WithTimeout(context.Background(), 45*time.Second)
defer cancel()
result := make(chan *ScanResult, 1)
err := make(chan error, 1)
go func() {
r, e := s.performDNFScan()
result <- r
err <- e
}()
select {
case r := <-result:
return r, <-err
case <-ctx.Done():
return nil, fmt.Errorf("scan timeout after 45s") // ❌ Kills working operations
}
}
Real-World Timeout Scenarios
- Large Package Lists: DNF with 3,000+ packages can take 2-5 minutes
- Network Issues: Slow package repository responses
- Disk I/O: Filesystem scanning on slow storage
- System Load: High CPU usage slowing operations
- Container Overhead: Docker operations in resource-constrained environments
Proposed Architecture: Intelligent Timeout Management
Core Principles
- Per-Operation Timeouts: Different timeouts for different operation types
- Progress-Based Timeouts: Monitor progress rather than absolute time
- Configurable Timeouts: User-adjustable timeout values
- Graceful Degradation: Handle timeouts without losing progress
- Smart Detection: Distinguish between "slow but working" vs "actually hung"
Timeout Management Components
1. Timeout Profiles
type TimeoutProfile struct {
Name string `yaml:"name" json:"name"`
DefaultTimeout time.Duration `yaml:"default_timeout" json:"default_timeout"`
MinTimeout time.Duration `yaml:"min_timeout" json:"min_timeout"`
MaxTimeout time.Duration `yaml:"max_timeout" json:"max_timeout"`
ProgressCheck time.Duration `yaml:"progress_check" json:"progress_check"`
GracefulShutdown time.Duration `yaml:"graceful_shutdown" json:"graceful_shutdown"`
}
type TimeoutProfiles struct {
Profiles map[string]TimeoutProfile `yaml:"profiles" json:"profiles"`
Default string `yaml:"default" json:"default"`
}
func DefaultTimeoutProfiles() *TimeoutProfiles {
return &TimeoutProfiles{
Default: "balanced",
Profiles: map[string]TimeoutProfile{
"fast": {
Name: "fast",
DefaultTimeout: 30 * time.Second,
MinTimeout: 10 * time.Second,
MaxTimeout: 60 * time.Second,
ProgressCheck: 5 * time.Second,
GracefulShutdown: 5 * time.Second,
},
"balanced": {
Name: "balanced",
DefaultTimeout: 2 * time.Minute,
MinTimeout: 30 * time.Second,
MaxTimeout: 10 * time.Minute,
ProgressCheck: 15 * time.Second,
GracefulShutdown: 15 * time.Second,
},
"thorough": {
Name: "thorough",
DefaultTimeout: 10 * time.Minute,
MinTimeout: 2 * time.Minute,
MaxTimeout: 30 * time.Minute,
ProgressCheck: 30 * time.Second,
GracefulShutdown: 30 * time.Second,
},
"dnf": {
Name: "dnf",
DefaultTimeout: 5 * time.Minute,
MinTimeout: 1 * time.Minute,
MaxTimeout: 15 * time.Minute,
ProgressCheck: 30 * time.Second,
GracefulShutdown: 30 * time.Second,
},
"apt": {
Name: "apt",
DefaultTimeout: 3 * time.Minute,
MinTimeout: 30 * time.Second,
MaxTimeout: 10 * time.Minute,
ProgressCheck: 15 * time.Second,
GracefulShutdown: 15 * time.Second,
},
"docker": {
Name: "docker",
DefaultTimeout: 2 * time.Minute,
MinTimeout: 30 * time.Second,
MaxTimeout: 5 * time.Minute,
ProgressCheck: 10 * time.Second,
GracefulShutdown: 10 * time.Second,
},
},
}
}
2. Progress Monitor
type ProgressMonitor struct {
total int64
completed int64
lastUpdate time.Time
progressChan chan ProgressUpdate
timeout time.Duration
gracePeriod time.Duration
onProgress func(completed, total int64)
}
type ProgressUpdate struct {
Completed int64 `json:"completed"`
Total int64 `json:"total"`
Message string `json:"message,omitempty"`
Timestamp time.Time `json:"timestamp"`
}
func (pm *ProgressMonitor) Start() {
go pm.monitor()
}
func (pm *ProgressMonitor) Update(completed, total int64, message string) {
pm.total = total
pm.completed = completed
pm.lastUpdate = time.Now()
select {
case pm.progressChan <- ProgressUpdate{
Completed: completed,
Total: total,
Message: message,
Timestamp: time.Now(),
}:
default:
// Non-blocking send
}
}
func (pm *ProgressMonitor) monitor() {
ticker := time.NewTicker(pm.progressCheckInterval)
defer ticker.Stop()
for {
select {
case update := <-pm.progressChan:
pm.handleProgress(update)
if pm.onProgress != nil {
pm.onProgress(update.Completed, update.Total)
}
case <-ticker.C:
if pm.isStuck() {
log.Printf("Operation appears to be stuck, checking...")
if pm.shouldTimeout() {
return // Signal timeout
}
}
case <-pm.ctx.Done():
return
}
}
}
func (pm *ProgressMonitor) isStuck() bool {
// Check if no progress for grace period
return time.Since(pm.lastUpdate) > pm.gracePeriod
}
func (pm *ProgressMonitor) shouldTimeout() bool {
// Check if no progress for full timeout period
return time.Since(pm.lastUpdate) > pm.timeout
}
3. Smart Timeout Manager
type SmartTimeoutManager struct {
profiles *TimeoutProfiles
currentProfile string
monitor *ProgressMonitor
ctx context.Context
cancel context.CancelFunc
}
func (stm *SmartTimeoutManager) ExecuteOperation(
operation func(*ProgressMonitor) error,
profileName string,
) error {
profile := stm.profiles.Profiles[profileName]
if profile.Name == "" {
profile = stm.profiles.Profiles[stm.profiles.Default]
}
ctx, cancel := context.WithTimeout(stm.ctx, profile.DefaultTimeout)
defer cancel()
// Create progress monitor
pm := &ProgressMonitor{
timeout: profile.DefaultTimeout,
gracePeriod: profile.GracefulShutdown,
progressChan: make(chan ProgressUpdate, 10),
}
pm.Start()
// Start operation
resultChan := make(chan error, 1)
go func() {
resultChan <- operation(pm)
}()
// Wait for completion or timeout
select {
case err := <-resultChan:
return err
case <-ctx.Done():
// Handle timeout gracefully
return stm.handleTimeout(pm, profile)
}
}
func (stm *SmartTimeoutManager) handleTimeout(pm *ProgressMonitor, profile TimeoutProfile) error {
// Check if operation is actually making progress
if !pm.isStuck() {
log.Printf("Operation timeout but still making progress, extending timeout...")
// Extend timeout by grace period
extendedCtx, cancel := context.WithTimeout(stm.ctx, profile.GracefulShutdown)
defer cancel()
select {
case <-extendedCtx.Done():
return fmt.Errorf("operation truly timed out after extension")
case <-pm.Done():
return nil // Operation completed during extension
}
}
// Operation is genuinely stuck
log.Printf("Operation stuck, attempting graceful shutdown...")
// Give operation chance to clean up
cleanupCtx, cancel := context.WithTimeout(stm.ctx, profile.GracefulShutdown)
defer cancel()
select {
case <-cleanupCtx.Done():
return fmt.Errorf("operation failed to shutdown gracefully")
case <-pm.Done():
return nil // Operation completed during cleanup
}
}
Enhanced Scanner Implementations
1. DNF Scanner with Progress
type DNFScanner struct {
config *ScannerConfig
timeoutMgr *SmartTimeoutManager
}
func (ds *DNFScanner) Scan() (*ScanResult, error) {
var result *ScanResult
var err error
operation := func(pm *ProgressMonitor) error {
return ds.performDNFScanWithProgress(pm)
}
err = ds.timeoutMgr.ExecuteOperation(operation, "dnf")
if err != nil {
return nil, err
}
return result, nil
}
func (ds *DNFScanner) performDNFScanWithProgress(pm *ProgressMonitor) error {
// Start DNF process with progress monitoring
cmd := exec.Command("dnf", "check-update")
stdout, err := cmd.StdoutPipe()
if err != nil {
return fmt.Errorf("failed to create stdout pipe: %w", err)
}
if err := cmd.Start(); err != nil {
return fmt.Errorf("failed to start dnf command: %w", err)
}
// Parse output line by line for progress
scanner := bufio.NewScanner(stdout)
lineCount := 0
var packages []PackageInfo
for scanner.Scan() {
line := scanner.Text()
lineCount++
// Parse package info
if pkg := ds.parsePackageLine(line); pkg != nil {
packages = append(packages, pkg)
// Update progress every 10 packages
if len(packages)%10 == 0 {
pm.Update(int64(len(packages)), int64(len(packages)),
fmt.Sprintf("Scanning package %d", len(packages)))
}
}
// Check for completion patterns
if strings.Contains(line, "Last metadata expiration check") ||
strings.Contains(line, "No updates available") {
break
}
}
// Wait for command to complete
if err := cmd.Wait(); err != nil {
return fmt.Errorf("dnf command failed: %w", err)
}
// Final progress update
pm.Update(int64(len(packages)), int64(len(packages)),
fmt.Sprintf("Scan completed with %d packages", len(packages)))
return nil
}
2. APT Scanner with Progress
type APTScanner struct {
config *ScannerConfig
timeoutMgr *SmartTimeoutManager
}
func (as *APTScanner) Scan() (*ScanResult, error) {
operation := func(pm *ProgressMonitor) error {
return as.performAPTScanWithProgress(pm)
}
err := as.timeoutMgr.ExecuteOperation(operation, "apt")
if err != nil {
return nil, err
}
return result, nil
}
func (as *APTScanner) performAPTScanWithProgress(pm *ProgressMonitor) error {
// Use apt-list-updates with progress
cmd := exec.Command("apt-list", "--upgradable")
stdout, err := cmd.StdoutPipe()
if err != nil {
return fmt.Errorf("failed to create stdout pipe: %w", err)
}
if err := cmd.Start(); err != nil {
return fmt.Errorf("failed to start apt command: %w", err)
}
scanner := bufio.NewScanner(stdout)
var packages []PackageInfo
lineCount := 0
for scanner.Scan() {
line := scanner.Text()
lineCount++
if strings.Contains(line, "/") && strings.Contains(line, "upgradable") {
if pkg := as.parsePackageLine(line); pkg != nil {
packages = append(packages, pkg)
// Update progress
if len(packages)%5 == 0 {
pm.Update(int64(len(packages)), int64(len(packages)),
fmt.Sprintf("Found %d upgradable packages", len(packages)))
}
}
}
}
if err := cmd.Wait(); err != nil {
return fmt.Errorf("apt command failed: %w", err)
}
return nil
}
Configuration Management
Timeout Configuration
# scanner-timeouts.yml
timeouts:
default_profile: "balanced"
profiles:
fast:
name: "Fast Scanning"
default_timeout: 30s
min_timeout: 10s
max_timeout: 60s
progress_check: 5s
graceful_shutdown: 5s
balanced:
name: "Balanced Performance"
default_timeout: 2m
min_timeout: 30s
max_timeout: 10m
progress_check: 15s
graceful_shutdown: 15s
thorough:
name: "Thorough Scanning"
default_timeout: 10m
min_timeout: 2m
max_timeout: 30m
progress_check: 30s
graceful_shutdown: 30s
scanner_specific:
dnf:
profile: "dnf"
custom_timeout: 5m
apt:
profile: "apt"
custom_timeout: 3m
docker:
profile: "docker"
custom_timeout: 2m
environment_overrides:
development:
default_profile: "fast"
production:
default_profile: "balanced"
resource_constrained:
default_profile: "fast"
scanner_specific:
dnf:
custom_timeout: 2m
Implementation Strategy
Phase 1: Foundation (1-2 weeks)
- Timeout Profile System: Define and load timeout configurations
- Progress Monitor: Basic progress tracking infrastructure
- Smart Timeout Manager: Core timeout logic with extensions
Phase 2: Scanner Updates (2-3 weeks)
- DNF Scanner: Add progress monitoring and configurable timeouts
- APT Scanner: Progress monitoring with package parsing
- Docker Scanner: Container operation progress tracking
- Generic Scanner Framework: Common progress patterns
Phase 3: Advanced Features (1-2 weeks)
- Adaptive Timeouts: Learn from historical performance
- Dynamic Profiles: Adjust timeouts based on system load
- User Interface: Allow timeout configuration via dashboard
Testing Strategy
Unit Tests
- Timeout profile loading and validation
- Progress monitoring accuracy
- Extension logic for stuck operations
- Graceful shutdown procedures
Integration Tests
- Real DNF operations with large package lists
- Network latency simulation
- Resource constraint scenarios
- Progress reporting accuracy
Performance Tests
- Timeout overhead measurement
- Progress monitoring performance impact
- Scanner execution time variations
- Resource usage during long operations
Success Criteria
- Reliability: No more false timeout errors for working operations
- Configurability: Users can adjust timeouts for their environment
- Observability: Clear visibility into operation progress and timeouts
- Performance: Minimal overhead from progress monitoring
- User Experience: Clear feedback about operation status
Tags: timeout, scanner, performance, reliability, progress Priority: High - Improves operation reliability significantly Complexity: Medium - Well-defined scope with clear implementation Estimated Effort: 4-6 weeks across multiple phases