Files
Redflag/docs/4_LOG/November_2025/Agent-Architecture/Agent_timeout_architecture.md

16 KiB

Agent Timeout Architecture & Scanner Timeouts

Problem Statement

Date: 2025-11-03 Status: Planning phase - Important for operation reliability

Current Issues

  1. Aggressive Timeouts: DNF scanner timeout of 45s is too short for bulk operations
  2. One-Size-Fits-All: Same timeout for all operations regardless of complexity
  3. No Progress Detection: Timeouts kill operations that are actually making progress
  4. Poor Error Reporting: Generic "timeout" masks real underlying issues
  5. No User Control: No way to configure timeouts for different environments

Current Behavior (Problematic)

// Current scanner timeout handling
func (s *DNFScanner) Scan() (*ScanResult, error) {
    ctx, cancel := context.WithTimeout(context.Background(), 45*time.Second)
    defer cancel()

    result := make(chan *ScanResult, 1)
    err := make(chan error, 1)

    go func() {
        r, e := s.performDNFScan()
        result <- r
        err <- e
    }()

    select {
    case r := <-result:
        return r, <-err
    case <-ctx.Done():
        return nil, fmt.Errorf("scan timeout after 45s") // ❌ Kills working operations
    }
}

Real-World Timeout Scenarios

  1. Large Package Lists: DNF with 3,000+ packages can take 2-5 minutes
  2. Network Issues: Slow package repository responses
  3. Disk I/O: Filesystem scanning on slow storage
  4. System Load: High CPU usage slowing operations
  5. Container Overhead: Docker operations in resource-constrained environments

Proposed Architecture: Intelligent Timeout Management

Core Principles

  1. Per-Operation Timeouts: Different timeouts for different operation types
  2. Progress-Based Timeouts: Monitor progress rather than absolute time
  3. Configurable Timeouts: User-adjustable timeout values
  4. Graceful Degradation: Handle timeouts without losing progress
  5. Smart Detection: Distinguish between "slow but working" vs "actually hung"

Timeout Management Components

1. Timeout Profiles

type TimeoutProfile struct {
    Name              string            `yaml:"name" json:"name"`
    DefaultTimeout    time.Duration     `yaml:"default_timeout" json:"default_timeout"`
    MinTimeout        time.Duration     `yaml:"min_timeout" json:"min_timeout"`
    MaxTimeout        time.Duration     `yaml:"max_timeout" json:"max_timeout"`
    ProgressCheck     time.Duration     `yaml:"progress_check" json:"progress_check"`
    GracefulShutdown  time.Duration     `yaml:"graceful_shutdown" json:"graceful_shutdown"`
}

type TimeoutProfiles struct {
    Profiles map[string]TimeoutProfile `yaml:"profiles" json:"profiles"`
    Default  string                    `yaml:"default" json:"default"`
}

func DefaultTimeoutProfiles() *TimeoutProfiles {
    return &TimeoutProfiles{
        Default: "balanced",
        Profiles: map[string]TimeoutProfile{
            "fast": {
                Name:             "fast",
                DefaultTimeout:   30 * time.Second,
                MinTimeout:       10 * time.Second,
                MaxTimeout:       60 * time.Second,
                ProgressCheck:    5 * time.Second,
                GracefulShutdown: 5 * time.Second,
            },
            "balanced": {
                Name:             "balanced",
                DefaultTimeout:   2 * time.Minute,
                MinTimeout:       30 * time.Second,
                MaxTimeout:       10 * time.Minute,
                ProgressCheck:    15 * time.Second,
                GracefulShutdown: 15 * time.Second,
            },
            "thorough": {
                Name:             "thorough",
                DefaultTimeout:   10 * time.Minute,
                MinTimeout:       2 * time.Minute,
                MaxTimeout:       30 * time.Minute,
                ProgressCheck:    30 * time.Second,
                GracefulShutdown: 30 * time.Second,
            },
            "dnf": {
                Name:             "dnf",
                DefaultTimeout:   5 * time.Minute,
                MinTimeout:       1 * time.Minute,
                MaxTimeout:       15 * time.Minute,
                ProgressCheck:    30 * time.Second,
                GracefulShutdown: 30 * time.Second,
            },
            "apt": {
                Name:             "apt",
                DefaultTimeout:   3 * time.Minute,
                MinTimeout:       30 * time.Second,
                MaxTimeout:       10 * time.Minute,
                ProgressCheck:    15 * time.Second,
                GracefulShutdown: 15 * time.Second,
            },
            "docker": {
                Name:             "docker",
                DefaultTimeout:   2 * time.Minute,
                MinTimeout:       30 * time.Second,
                MaxTimeout:       5 * time.Minute,
                ProgressCheck:    10 * time.Second,
                GracefulShutdown: 10 * time.Second,
            },
        },
    }
}

2. Progress Monitor

type ProgressMonitor struct {
    total          int64
    completed      int64
    lastUpdate     time.Time
    progressChan   chan ProgressUpdate
    timeout        time.Duration
    gracePeriod    time.Duration
    onProgress     func(completed, total int64)
}

type ProgressUpdate struct {
    Completed int64     `json:"completed"`
    Total     int64     `json:"total"`
    Message   string    `json:"message,omitempty"`
    Timestamp time.Time `json:"timestamp"`
}

func (pm *ProgressMonitor) Start() {
    go pm.monitor()
}

func (pm *ProgressMonitor) Update(completed, total int64, message string) {
    pm.total = total
    pm.completed = completed
    pm.lastUpdate = time.Now()

    select {
    case pm.progressChan <- ProgressUpdate{
        Completed: completed,
        Total:     total,
        Message:   message,
        Timestamp: time.Now(),
    }:
    default:
        // Non-blocking send
    }
}

func (pm *ProgressMonitor) monitor() {
    ticker := time.NewTicker(pm.progressCheckInterval)
    defer ticker.Stop()

    for {
        select {
        case update := <-pm.progressChan:
            pm.handleProgress(update)
            if pm.onProgress != nil {
                pm.onProgress(update.Completed, update.Total)
            }

        case <-ticker.C:
            if pm.isStuck() {
                log.Printf("Operation appears to be stuck, checking...")
                if pm.shouldTimeout() {
                    return // Signal timeout
                }
            }

        case <-pm.ctx.Done():
            return
        }
    }
}

func (pm *ProgressMonitor) isStuck() bool {
    // Check if no progress for grace period
    return time.Since(pm.lastUpdate) > pm.gracePeriod
}

func (pm *ProgressMonitor) shouldTimeout() bool {
    // Check if no progress for full timeout period
    return time.Since(pm.lastUpdate) > pm.timeout
}

3. Smart Timeout Manager

type SmartTimeoutManager struct {
    profiles     *TimeoutProfiles
    currentProfile string
    monitor      *ProgressMonitor
    ctx          context.Context
    cancel       context.CancelFunc
}

func (stm *SmartTimeoutManager) ExecuteOperation(
    operation func(*ProgressMonitor) error,
    profileName string,
) error {
    profile := stm.profiles.Profiles[profileName]
    if profile.Name == "" {
        profile = stm.profiles.Profiles[stm.profiles.Default]
    }

    ctx, cancel := context.WithTimeout(stm.ctx, profile.DefaultTimeout)
    defer cancel()

    // Create progress monitor
    pm := &ProgressMonitor{
        timeout:        profile.DefaultTimeout,
        gracePeriod:    profile.GracefulShutdown,
        progressChan:   make(chan ProgressUpdate, 10),
    }
    pm.Start()

    // Start operation
    resultChan := make(chan error, 1)
    go func() {
        resultChan <- operation(pm)
    }()

    // Wait for completion or timeout
    select {
    case err := <-resultChan:
        return err

    case <-ctx.Done():
        // Handle timeout gracefully
        return stm.handleTimeout(pm, profile)
    }
}

func (stm *SmartTimeoutManager) handleTimeout(pm *ProgressMonitor, profile TimeoutProfile) error {
    // Check if operation is actually making progress
    if !pm.isStuck() {
        log.Printf("Operation timeout but still making progress, extending timeout...")

        // Extend timeout by grace period
        extendedCtx, cancel := context.WithTimeout(stm.ctx, profile.GracefulShutdown)
        defer cancel()

        select {
        case <-extendedCtx.Done():
            return fmt.Errorf("operation truly timed out after extension")
        case <-pm.Done():
            return nil // Operation completed during extension
        }
    }

    // Operation is genuinely stuck
    log.Printf("Operation stuck, attempting graceful shutdown...")

    // Give operation chance to clean up
    cleanupCtx, cancel := context.WithTimeout(stm.ctx, profile.GracefulShutdown)
    defer cancel()

    select {
    case <-cleanupCtx.Done():
        return fmt.Errorf("operation failed to shutdown gracefully")
    case <-pm.Done():
        return nil // Operation completed during cleanup
    }
}

Enhanced Scanner Implementations

1. DNF Scanner with Progress

type DNFScanner struct {
    config       *ScannerConfig
    timeoutMgr   *SmartTimeoutManager
}

func (ds *DNFScanner) Scan() (*ScanResult, error) {
    var result *ScanResult
    var err error

    operation := func(pm *ProgressMonitor) error {
        return ds.performDNFScanWithProgress(pm)
    }

    err = ds.timeoutMgr.ExecuteOperation(operation, "dnf")
    if err != nil {
        return nil, err
    }

    return result, nil
}

func (ds *DNFScanner) performDNFScanWithProgress(pm *ProgressMonitor) error {
    // Start DNF process with progress monitoring
    cmd := exec.Command("dnf", "check-update")
    stdout, err := cmd.StdoutPipe()
    if err != nil {
        return fmt.Errorf("failed to create stdout pipe: %w", err)
    }

    if err := cmd.Start(); err != nil {
        return fmt.Errorf("failed to start dnf command: %w", err)
    }

    // Parse output line by line for progress
    scanner := bufio.NewScanner(stdout)
    lineCount := 0
    var packages []PackageInfo

    for scanner.Scan() {
        line := scanner.Text()
        lineCount++

        // Parse package info
        if pkg := ds.parsePackageLine(line); pkg != nil {
            packages = append(packages, pkg)

            // Update progress every 10 packages
            if len(packages)%10 == 0 {
                pm.Update(int64(len(packages)), int64(len(packages)),
                    fmt.Sprintf("Scanning package %d", len(packages)))
            }
        }

        // Check for completion patterns
        if strings.Contains(line, "Last metadata expiration check") ||
           strings.Contains(line, "No updates available") {
            break
        }
    }

    // Wait for command to complete
    if err := cmd.Wait(); err != nil {
        return fmt.Errorf("dnf command failed: %w", err)
    }

    // Final progress update
    pm.Update(int64(len(packages)), int64(len(packages)),
        fmt.Sprintf("Scan completed with %d packages", len(packages)))

    return nil
}

2. APT Scanner with Progress

type APTScanner struct {
    config       *ScannerConfig
    timeoutMgr   *SmartTimeoutManager
}

func (as *APTScanner) Scan() (*ScanResult, error) {
    operation := func(pm *ProgressMonitor) error {
        return as.performAPTScanWithProgress(pm)
    }

    err := as.timeoutMgr.ExecuteOperation(operation, "apt")
    if err != nil {
        return nil, err
    }

    return result, nil
}

func (as *APTScanner) performAPTScanWithProgress(pm *ProgressMonitor) error {
    // Use apt-list-updates with progress
    cmd := exec.Command("apt-list", "--upgradable")
    stdout, err := cmd.StdoutPipe()
    if err != nil {
        return fmt.Errorf("failed to create stdout pipe: %w", err)
    }

    if err := cmd.Start(); err != nil {
        return fmt.Errorf("failed to start apt command: %w", err)
    }

    scanner := bufio.NewScanner(stdout)
    var packages []PackageInfo
    lineCount := 0

    for scanner.Scan() {
        line := scanner.Text()
        lineCount++

        if strings.Contains(line, "/") && strings.Contains(line, "upgradable") {
            if pkg := as.parsePackageLine(line); pkg != nil {
                packages = append(packages, pkg)

                // Update progress
                if len(packages)%5 == 0 {
                    pm.Update(int64(len(packages)), int64(len(packages)),
                        fmt.Sprintf("Found %d upgradable packages", len(packages)))
                }
            }
        }
    }

    if err := cmd.Wait(); err != nil {
        return fmt.Errorf("apt command failed: %w", err)
    }

    return nil
}

Configuration Management

Timeout Configuration

# scanner-timeouts.yml
timeouts:
  default_profile: "balanced"

  profiles:
    fast:
      name: "Fast Scanning"
      default_timeout: 30s
      min_timeout: 10s
      max_timeout: 60s
      progress_check: 5s
      graceful_shutdown: 5s

    balanced:
      name: "Balanced Performance"
      default_timeout: 2m
      min_timeout: 30s
      max_timeout: 10m
      progress_check: 15s
      graceful_shutdown: 15s

    thorough:
      name: "Thorough Scanning"
      default_timeout: 10m
      min_timeout: 2m
      max_timeout: 30m
      progress_check: 30s
      graceful_shutdown: 30s

  scanner_specific:
    dnf:
      profile: "dnf"
      custom_timeout: 5m

    apt:
      profile: "apt"
      custom_timeout: 3m

    docker:
      profile: "docker"
      custom_timeout: 2m

  environment_overrides:
    development:
      default_profile: "fast"

    production:
      default_profile: "balanced"

    resource_constrained:
      default_profile: "fast"
      scanner_specific:
        dnf:
          custom_timeout: 2m

Implementation Strategy

Phase 1: Foundation (1-2 weeks)

  1. Timeout Profile System: Define and load timeout configurations
  2. Progress Monitor: Basic progress tracking infrastructure
  3. Smart Timeout Manager: Core timeout logic with extensions

Phase 2: Scanner Updates (2-3 weeks)

  1. DNF Scanner: Add progress monitoring and configurable timeouts
  2. APT Scanner: Progress monitoring with package parsing
  3. Docker Scanner: Container operation progress tracking
  4. Generic Scanner Framework: Common progress patterns

Phase 3: Advanced Features (1-2 weeks)

  1. Adaptive Timeouts: Learn from historical performance
  2. Dynamic Profiles: Adjust timeouts based on system load
  3. User Interface: Allow timeout configuration via dashboard

Testing Strategy

Unit Tests

  • Timeout profile loading and validation
  • Progress monitoring accuracy
  • Extension logic for stuck operations
  • Graceful shutdown procedures

Integration Tests

  • Real DNF operations with large package lists
  • Network latency simulation
  • Resource constraint scenarios
  • Progress reporting accuracy

Performance Tests

  • Timeout overhead measurement
  • Progress monitoring performance impact
  • Scanner execution time variations
  • Resource usage during long operations

Success Criteria

  1. Reliability: No more false timeout errors for working operations
  2. Configurability: Users can adjust timeouts for their environment
  3. Observability: Clear visibility into operation progress and timeouts
  4. Performance: Minimal overhead from progress monitoring
  5. User Experience: Clear feedback about operation status

Tags: timeout, scanner, performance, reliability, progress Priority: High - Improves operation reliability significantly Complexity: Medium - Well-defined scope with clear implementation Estimated Effort: 4-6 weeks across multiple phases