Files
Redflag/docs/historical/ISSUE_003_SCAN_TRIGGER_FIX.md

14 KiB

ISSUE #3: Scan Trigger Flow - Proper Implementation Plan

Date: 2025-12-18 (Planning for tomorrow)
Status: Planning Phase (Ready for implementation tomorrow)
Severity: High (Scan buttons currently error)
New Scope: Beyond Issues #1 and #2 (completed)


Issue Summary

Individual "Scan" buttons for each subsystem (docker, storage, system, updates) all return error:

"Failed to trigger scan: Failed to create command"

Why: Command acknowledgment and history logging flows are not properly integrated for subsystem-specific scans.

What Needs to Happen: Full ETHOS-compliant flow from UI click → API → Agent → Results → History


Current State Analysis

UI Layer (AgentHealth.tsx) WORKING

  • Per-subsystem scan buttons exist
  • handleTriggerScan(subsystem.subsystem) passes subsystem name
  • triggerScanMutation makes API call to: /api/v1/agents/:id/subsystems/:subsystem/trigger

Backend API (subsystems.go) MOSTLY WORKING

  • TriggerSubsystem handler receives subsystem parameter
  • Creates distinct command type: commandType := "scan_" + subsystem
  • Creates AgentCommand with unique command_type
  • FAILING: signAndCreateCommand call fails

Agent (main.go) MOSTLY WORKING

  • case "scan_updates": handles update scans
  • case "scan_storage": handles storage scans
  • ISSUE: Command acknowledgment flow needs review

History/Reconciliation NOT INTEGRATED

  • Missing: Subsystem context in history logging
  • Broken: Command acknowledgment for scan commands
  • Inconsistent: Some logs go to history, some don't

Proper Implementation Requirements (ETHOS)

Core Principles to Follow

  1. Errors are History, Not /dev/null MUST HAVE

    • Scan failures → history table with context
    • Button click errors → history table
    • Command creation errors → history table
    • Agent handler errors → history table
  2. Security is Non-Negotiable MUST HAVE

    • All scan triggers → authenticated endpoints (already done)
    • Command signing → Ed25519 nonces (already done)
    • Circuit breaker integration (already exists)
  3. Assume Failure; Build for Resilience MUST HAVE

    • Scan failures → retry logic (if appropriate)
    • Command creation failures → clear error context
    • Agent unreachable → proper error to UI
    • Partial failures → handled gracefully
  4. Idempotency MUST HAVE

    • Scan operations repeatable (safe to trigger multiple times)
    • No duplicate history entries for same scan
    • Results properly timestamped for tracking
  5. No Marketing Fluff MUST HAVE

    • Clear action names in history: "scan_docker", "scan_storage", "scan_system"
    • Subsystem icons in history display (not just text)
    • Accurate, honest logging throughout

Full Flow Design (From Click to History)

Phase 1: User Clicks Scan Button

UI Event: handleTriggerScan(subsystem.subsystem)

User clicks: [Scan] button on Docker row
   handleTriggerScan("docker")
   triggerScanMutation.mutate("docker")
   POST /api/v1/agents/:id/subsystems/docker/trigger

Ethos Requirements:

  • Button disable during pending state
  • Loading indicator
  • Success/error toast (already doing this)

Phase 2: Backend Receives Trigger POST

Handler: subsystems.go:TriggerSubsystem

URL: POST /api/v1/agents/:id/subsystems/:subsystem/trigger
   Authenticate (already done)
   Validate agent exists
   Validate subsystem is enabled
   Get current config
   Generate command_id

Command Creation:

command := &models.AgentCommand{
  AgentID:     agentID,
  CommandType: "scan_" + subsystem,  // "scan_docker", "scan_storage", etc.
  Status:      "pending",
  Source:      "web_ui",
  // ADD: Subsystem field for filtering/querying
  Subsystem:   subsystem,
}

// Add [HISTORY] logging
log.Printf("[HISTORY] [server] [scan] command_created agent_id=%s subsystem=%s command_id=%s timestamp=%s",
  agentID, subsystem, command.ID, time.Now().Format(time.RFC3339))

err = h.signAndCreateCommand(command)

Ethos Requirements:

  • All errors logged before returning
  • History entry created for command creation attempts
  • Subsystem context preserved in logs

Phase 3: Command Acknowledgment System

The scan command must flow through the standard acknowledgment system:

// Already exists: pending_acks.json tracking
ackTracker.Create(command.ID, time.Now())
   Agent checks in: receives command
   Agent starts scan: reports status? 
   Agent completes: reports results
   Server updates history
   Acknowledgment removed

Current Missing Pieces:

  • Command results not being saved properly
  • Subsystem context not flowing through ack system
  • Scan results not creating history entries

Phase 4: Agent Receives Scan Command

Agent Handling: main.go:handleCommand

case "scan_docker":
  log.Printf("[HISTORY] [agent] [scan_docker] command_received agent_id=%s command_id=%s timestamp=%s",
    cfg.AgentID, cmd.ID, time.Now().Format(time.RFC3339))
  
  results, err := handleScanDocker(apiClient, cfg, ackTracker, scanOrchestrator, cmd.ID)
  
  if err != nil {
    log.Printf("[ERROR] [agent] [scan_docker] scan_failed error=%v timestamp=%s")
    log.Printf("[HISTORY] [agent] [scan_docker] scan_failed error="%v" timestamp=%s")
    // Update command status: failed
    // Report back via API
    // Return error
  }
  
  log.Printf("[SUCCESS] [agent] [scan_docker] scan_completed items=%d timestamp=%s")
  log.Printf("[HISTORY] [agent] [scan_docker] scan_completed items=%d timestamp=%s")
  // Update command status: success
  // Report results via API

Existing Handlers:

  • handleScanUpdatesV2 - needs review
  • handleScanStorage - needs review
  • handleScanSystem - needs review
  • handleScanDocker - needs review

Phase 5: Results Reported Back

API Endpoint: Agent reports scan results

// POST /api/v1/agents/:id/commands/:command_id/result
{
  command_id: "...",
  result: "success",
  items_found: 4,
  stdout: "...",
  subsystem: "docker"
}

Server Handler: Updates history table

// Insert into history table
INSERT INTO history (agent_id, command_id, action, result, subsystem, stdout, stderr, executed_at)
VALUES (?, ?, 'scan_docker', ?, 'docker', ?, ?, NOW())

// Add [HISTORY] logging
log.Printf("[HISTORY] [server] [scan_docker] result_logged agent_id=%s command_id=%s timestamp=%s")

Phase 6: History Display

UI Component: HistoryTimeline.tsx

// Retrieve history entries
GET /api/v1/history?agent_id=...&subsystem=docker

// Display with subsystem context
<span className="capitalize flex items-center">
  {getActionIcon(entry.action, entry.subsystem)}
  <span>{getSubsystemDisplayName(entry.subsystem)} Scan</span>
</span>

// Icons based on subsystem
getActionIcon("scan", "docker")  Docker icon
getActionIcon("scan", "storage")  Storage icon
getActionIcon("scan", "system")  System icon

Database Changes Required

Table: history (or logs)

Add column:

ALTER TABLE history ADD COLUMN subsystem VARCHAR(50);
CREATE INDEX idx_history_agent_action_subsystem ON history(agent_id, action, subsystem);

Populate for existing scan entries:

  • Parse stdout for clues to determine subsystem
  • Or set to NULL for existing entries
  • UI must handle NULL (display as "Unknown Scan")

Code Changes Required

Backend (aggregator-server)

Files to Modify:

  1. internal/models/command.go - Add Subsystem field
  2. internal/database/queries/commands.go - Update for subsystem
  3. internal/api/handlers/subsystems.go - Update TriggerSubsystem logging
  4. internal/api/handlers/commands.go - Update command result handler
  5. internal/database/migrations/ - Add subsystem column migration

New Queries Needed:

-- Insert history with subsystem
INSERT INTO history (...) VALUES (..., subsystem)

-- Query history by subsystem
SELECT * FROM history WHERE agent_id = ? AND subsystem = ?

Agent (aggregator-agent)

Files to Modify:

  1. cmd/agent/main.go - Update all handleScan* functions with [HISTORY] logging
  2. internal/orchestrator/scanner.go - Ensure wrappers pass subsystem context
  3. internal/scanner/ - Add subsystem identification to results

Add to all scan handlers:

// Each handleScan* function needs:
// 1. [HISTORY] log when starting
// 2. [HISTORY] log on completion
// 3. [HISTORY] log on error
// 4. Subsystem context in all log messages

Frontend (aggregator-web)

Files to Modify:

  1. src/types/index.ts - Add subsystem to HistoryEntry interface
  2. src/components/HistoryTimeline.tsx - Update display logic
  3. src/lib/api.ts - Update API call to include subsystem parameter
  4. src/components/AgentHealth.tsx - Add subsystem icons map

Display Logic:

const subsystemIcon = {
  docker: <Container className="h-4 w-4" />,
  storage: <HardDrive className="h-4 w-4" />,
  system: <Cpu className="h-4 w-4" />,
  updates: <Package className="h-4 w-4" />,
  dnf: <Box className="h-4 w-4" />,
  winget: <Windows className="h-4 w-4" />,
  apt: <Linux className="h-4 w-4" />,
};

const displayName = {
  docker: 'Docker',
  storage: 'Storage',
  system: 'System',
  updates: 'Package Updates',
  // ... etc
};

Testing Requirements

Unit Tests

// Test command creation with subsystem
TestCreateCommand_WithSubsystem()
TestCreateCommand_WithoutSubsystem()

// Test history insertion with subsystem
TestCreateHistory_WithSubsystem()
TestQueryHistory_BySubsystem()

// Test agent scan handlers
TestHandleScanDocker_LogsHistory()
TestHandleScanDocker_Failure() // Error logs to history

Integration Tests

// Test full flow
TestScanTrigger_FullFlow_Docker()
TestScanTrigger_FullFlow_Storage()
TestScanTrigger_FullFlow_System()
TestScanTrigger_FullFlow_Updates()

// Verify each step:
// 1. UI trigger → 2. Command created → 3. Agent receives → 4. Scan runs → 
// 5. Results reported → 6. History logged → 7. History UI displays correctly

Manual Testing Checklist

  • Click each subsystem scan button
  • Verify scan runs and results appear
  • Verify history entry created for each
  • Verify history shows subsystem-specific icons and names
  • Verify failed scans create history entries
  • Verify command ack system tracks scan commands
  • Verify circuit breakers show scan activity

ETHOS Compliance Checklist

Errors are History, Not /dev/null

  • All scan errors → history table
  • All scan completions → history table
  • Button click failures → history table
  • Command creation failures → history table
  • Agent unreachable errors → history table
  • Subsystem context in all history entries

Security is Non-Negotiable

  • All scan endpoints → AuthMiddleware() (already done)
  • Command signing → Ed25519 nonces (already done)
  • No scan credentials in logs

Assume Failure; Build for Resilience

  • Agent unavailable → clear error to UI
  • Scan timeout → properly handled
  • Partial failures → reported to history
  • Retry logic considered (not automatic for manual scans)

Idempotency

  • Safe to click scan multiple times
  • Each scan creates distinct history entry
  • No duplicate state from repeated scans

No Marketing Fluff

  • Action names: "scan_docker", "scan_storage", "scan_system"
  • History display: "Docker Scan", "Storage Scan" etc.
  • Subsystem-specific icons (not generic play button)
  • Clear, honest logging throughout

Implementation Phases

Phase 1: Database Migration (30 min)

  • Add subsystem column to history table
  • Run migration
  • Update ORM models/queries

Phase 2: Backend API Updates (1 hour)

  • Update TriggerSubsystem to log with subsystem context
  • Update command result handler to include subsystem
  • Update queries to handle subsystem filtering

Phase 3: Agent Updates (1 hour)

  • Add [HISTORY] logging to all scan handlers
  • Ensure subsystem context flows through
  • Verify error handling logs to history

Phase 4: Frontend Updates (1 hour)

  • Add subsystem to HistoryEntry type
  • Add subsystem icons map
  • Update display logic to show subsystem context
  • Add subsystem filtering to history UI

Phase 5: Testing (1 hour)

  • Unit tests for backend changes
  • Integration tests for full flow
  • Manual testing of each subsystem scan

Total Estimated Time: 4.5 hours


Risks and Considerations

Risk 1: Database migration on production data

  • Mitigation: Test migration on backup
  • Plan: Run during low-activity window

Risk 2: Performance impact of additional column

  • Likelihood: Low (indexed, small varchar)
  • Mitigation: Add index during migration

Risk 3: UI breaks for old entries without subsystem

  • Mitigation: Handle NULL gracefully ("Unknown Scan")

Planning Documents Status

This is NEW Issue #3 - separate from completed Issues #1 and #2.

New Planning Documents Created:

  • ISSUE_003_SCAN_TRIGGER_FIX.md - This file
  • UX_ISSUE_ANALYSIS_scan_history.md - Related UX issue (documented already)

Update Existing:

  • STATE_PRESERVATION.md - Add Issue #3 tracking
  • session_2025-12-18-completion.md - Add note about Issue #3 discovered

Next Steps for Tomorrow

  1. Start of Day: Review this plan
  2. Database: Run migration
  3. Backend: Update handlers and queries
  4. Agent: Add [HISTORY] logging
  5. Frontend: Update UI components
  6. Testing: Verify all scan flows work
  7. Documentation: Update completion status

Sign-off

Planning By: Ani Tunturi (for Casey)
Review Status: Ready for implementation
Complexity: Medium-High (touching multiple layers)
Confidence: High (follows patterns established in Issues #1-2)

Blood, Sweat, and Tears Commitment: Yes - proper implementation only