# ISSUE #3: Scan Trigger Flow - Proper Implementation Plan **Date**: 2025-12-18 (Planning for tomorrow) **Status**: Planning Phase (Ready for implementation tomorrow) **Severity**: High (Scan buttons currently error) **New Scope**: Beyond Issues #1 and #2 (completed) --- ## Issue Summary Individual "Scan" buttons for each subsystem (docker, storage, system, updates) all return error: > "Failed to trigger scan: Failed to create command" **Why**: Command acknowledgment and history logging flows are not properly integrated for subsystem-specific scans. **What Needs to Happen**: Full ETHOS-compliant flow from UI click → API → Agent → Results → History --- ## Current State Analysis ### UI Layer (AgentHealth.tsx) ✅ WORKING - ✅ Per-subsystem scan buttons exist - ✅ `handleTriggerScan(subsystem.subsystem)` passes subsystem name - `triggerScanMutation` makes API call to: `/api/v1/agents/:id/subsystems/:subsystem/trigger` ### Backend API (subsystems.go) ✅ MOSTLY WORKING - ✅ `TriggerSubsystem` handler receives subsystem parameter - ✅ Creates distinct command type: `commandType := "scan_" + subsystem` - ✅ Creates AgentCommand with unique command_type - **❌ FAILING**: `signAndCreateCommand` call fails ### Agent (main.go) ✅ MOSTLY WORKING - ✅ `case "scan_updates":` handles update scans - ✅ `case "scan_storage":` handles storage scans - **❌ ISSUE**: Command acknowledgment flow needs review ### History/Reconciliation ❌ NOT INTEGRATED - **Missing**: Subsystem context in history logging - **Broken**: Command acknowledgment for scan commands - **Inconsistent**: Some logs go to history, some don't --- ## Proper Implementation Requirements (ETHOS) ### Core Principles to Follow 1. **Errors are History, Not /dev/null** ✅ MUST HAVE - Scan failures → history table with context - Button click errors → history table - Command creation errors → history table - Agent handler errors → history table 2. **Security is Non-Negotiable** ✅ MUST HAVE - All scan triggers → authenticated endpoints (already done) - Command signing → Ed25519 nonces (already done) - Circuit breaker integration (already exists) 3. **Assume Failure; Build for Resilience** ✅ MUST HAVE - Scan failures → retry logic (if appropriate) - Command creation failures → clear error context - Agent unreachable → proper error to UI - Partial failures → handled gracefully 4. **Idempotency** ✅ MUST HAVE - Scan operations repeatable (safe to trigger multiple times) - No duplicate history entries for same scan - Results properly timestamped for tracking 5. **No Marketing Fluff** ✅ MUST HAVE - Clear action names in history: "scan_docker", "scan_storage", "scan_system" - Subsystem icons in history display (not just text) - Accurate, honest logging throughout --- ## Full Flow Design (From Click to History) ### Phase 1: User Clicks Scan Button **UI Event**: `handleTriggerScan(subsystem.subsystem)` ```typescript User clicks: [Scan] button on Docker row → handleTriggerScan("docker") → triggerScanMutation.mutate("docker") → POST /api/v1/agents/:id/subsystems/docker/trigger ``` **Ethos Requirements**: - Button disable during pending state - Loading indicator - Success/error toast (already doing this) ### Phase 2: Backend Receives Trigger POST **Handler**: `subsystems.go:TriggerSubsystem` ```go URL: POST /api/v1/agents/:id/subsystems/:subsystem/trigger → Authenticate (already done) → Validate agent exists → Validate subsystem is enabled → Get current config → Generate command_id ``` **Command Creation**: ```go command := &models.AgentCommand{ AgentID: agentID, CommandType: "scan_" + subsystem, // "scan_docker", "scan_storage", etc. Status: "pending", Source: "web_ui", // ADD: Subsystem field for filtering/querying Subsystem: subsystem, } // Add [HISTORY] logging log.Printf("[HISTORY] [server] [scan] command_created agent_id=%s subsystem=%s command_id=%s timestamp=%s", agentID, subsystem, command.ID, time.Now().Format(time.RFC3339)) err = h.signAndCreateCommand(command) ``` **Ethos Requirements**: - ✅ All errors logged before returning - ✅ History entry created for command creation attempts - ✅ Subsystem context preserved in logs ### Phase 3: Command Acknowledgment System The scan command must flow through the standard acknowledgment system: ```go // Already exists: pending_acks.json tracking ackTracker.Create(command.ID, time.Now()) → Agent checks in: receives command → Agent starts scan: reports status? → Agent completes: reports results → Server updates history → Acknowledgment removed ``` **Current Missing Pieces**: - Command results not being saved properly - Subsystem context not flowing through ack system - Scan results not creating history entries ### Phase 4: Agent Receives Scan Command **Agent Handling**: `main.go:handleCommand` ```go case "scan_docker": log.Printf("[HISTORY] [agent] [scan_docker] command_received agent_id=%s command_id=%s timestamp=%s", cfg.AgentID, cmd.ID, time.Now().Format(time.RFC3339)) results, err := handleScanDocker(apiClient, cfg, ackTracker, scanOrchestrator, cmd.ID) if err != nil { log.Printf("[ERROR] [agent] [scan_docker] scan_failed error=%v timestamp=%s") log.Printf("[HISTORY] [agent] [scan_docker] scan_failed error="%v" timestamp=%s") // Update command status: failed // Report back via API // Return error } log.Printf("[SUCCESS] [agent] [scan_docker] scan_completed items=%d timestamp=%s") log.Printf("[HISTORY] [agent] [scan_docker] scan_completed items=%d timestamp=%s") // Update command status: success // Report results via API ``` **Existing Handlers**: - `handleScanUpdatesV2` - needs review - `handleScanStorage` - needs review - `handleScanSystem` - needs review - `handleScanDocker` - needs review ### Phase 5: Results Reported Back **API Endpoint**: Agent reports scan results ```go // POST /api/v1/agents/:id/commands/:command_id/result { command_id: "...", result: "success", items_found: 4, stdout: "...", subsystem: "docker" } ``` **Server Handler**: Updates history table ```go // Insert into history table INSERT INTO history (agent_id, command_id, action, result, subsystem, stdout, stderr, executed_at) VALUES (?, ?, 'scan_docker', ?, 'docker', ?, ?, NOW()) // Add [HISTORY] logging log.Printf("[HISTORY] [server] [scan_docker] result_logged agent_id=%s command_id=%s timestamp=%s") ``` ### Phase 6: History Display **UI Component**: `HistoryTimeline.tsx` ```typescript // Retrieve history entries GET /api/v1/history?agent_id=...&subsystem=docker // Display with subsystem context {getActionIcon(entry.action, entry.subsystem)} {getSubsystemDisplayName(entry.subsystem)} Scan // Icons based on subsystem getActionIcon("scan", "docker") → Docker icon getActionIcon("scan", "storage") → Storage icon getActionIcon("scan", "system") → System icon ``` --- ## Database Changes Required ### Table: `history` (or logs) **Add column**: ```sql ALTER TABLE history ADD COLUMN subsystem VARCHAR(50); CREATE INDEX idx_history_agent_action_subsystem ON history(agent_id, action, subsystem); ``` **Populate for existing scan entries**: - Parse stdout for clues to determine subsystem - Or set to NULL for existing entries - UI must handle NULL (display as "Unknown Scan") --- ## Code Changes Required ### Backend (aggregator-server) **Files to Modify**: 1. `internal/models/command.go` - Add Subsystem field 2. `internal/database/queries/commands.go` - Update for subsystem 3. `internal/api/handlers/subsystems.go` - Update TriggerSubsystem logging 4. `internal/api/handlers/commands.go` - Update command result handler 5. `internal/database/migrations/` - Add subsystem column migration **New Queries Needed**: ```sql -- Insert history with subsystem INSERT INTO history (...) VALUES (..., subsystem) -- Query history by subsystem SELECT * FROM history WHERE agent_id = ? AND subsystem = ? ``` ### Agent (aggregator-agent) **Files to Modify**: 1. `cmd/agent/main.go` - Update all `handleScan*` functions with [HISTORY] logging 2. `internal/orchestrator/scanner.go` - Ensure wrappers pass subsystem context 3. `internal/scanner/` - Add subsystem identification to results **Add to all scan handlers**: ```go // Each handleScan* function needs: // 1. [HISTORY] log when starting // 2. [HISTORY] log on completion // 3. [HISTORY] log on error // 4. Subsystem context in all log messages ``` ### Frontend (aggregator-web) **Files to Modify**: 1. `src/types/index.ts` - Add subsystem to HistoryEntry interface 2. `src/components/HistoryTimeline.tsx` - Update display logic 3. `src/lib/api.ts` - Update API call to include subsystem parameter 4. `src/components/AgentHealth.tsx` - Add subsystem icons map **Display Logic**: ```typescript const subsystemIcon = { docker: , storage: , system: , updates: , dnf: , winget: , apt: , }; const displayName = { docker: 'Docker', storage: 'Storage', system: 'System', updates: 'Package Updates', // ... etc }; ``` --- ## Testing Requirements ### Unit Tests ```go // Test command creation with subsystem TestCreateCommand_WithSubsystem() TestCreateCommand_WithoutSubsystem() // Test history insertion with subsystem TestCreateHistory_WithSubsystem() TestQueryHistory_BySubsystem() // Test agent scan handlers TestHandleScanDocker_LogsHistory() TestHandleScanDocker_Failure() // Error logs to history ``` ### Integration Tests ```go // Test full flow TestScanTrigger_FullFlow_Docker() TestScanTrigger_FullFlow_Storage() TestScanTrigger_FullFlow_System() TestScanTrigger_FullFlow_Updates() // Verify each step: // 1. UI trigger → 2. Command created → 3. Agent receives → 4. Scan runs → // 5. Results reported → 6. History logged → 7. History UI displays correctly ``` ### Manual Testing Checklist - [ ] Click each subsystem scan button - [ ] Verify scan runs and results appear - [ ] Verify history entry created for each - [ ] Verify history shows subsystem-specific icons and names - [ ] Verify failed scans create history entries - [ ] Verify command ack system tracks scan commands - [ ] Verify circuit breakers show scan activity --- ## ETHOS Compliance Checklist ### Errors are History, Not /dev/null - [ ] All scan errors → history table - [ ] All scan completions → history table - [ ] Button click failures → history table - [ ] Command creation failures → history table - [ ] Agent unreachable errors → history table - [ ] Subsystem context in all history entries ### Security is Non-Negotiable - [ ] All scan endpoints → AuthMiddleware() (already done) - [ ] Command signing → Ed25519 nonces (already done) - [ ] No scan credentials in logs ### Assume Failure; Build for Resilience - [ ] Agent unavailable → clear error to UI - [ ] Scan timeout → properly handled - [ ] Partial failures → reported to history - [ ] Retry logic considered (not automatic for manual scans) ### Idempotency - [ ] Safe to click scan multiple times - [ ] Each scan creates distinct history entry - [ ] No duplicate state from repeated scans ### No Marketing Fluff - [ ] Action names: "scan_docker", "scan_storage", "scan_system" - [ ] History display: "Docker Scan", "Storage Scan" etc. - [ ] Subsystem-specific icons (not generic play button) - [ ] Clear, honest logging throughout --- ## Implementation Phases ### Phase 1: Database Migration (30 min) - Add `subsystem` column to history table - Run migration - Update ORM models/queries ### Phase 2: Backend API Updates (1 hour) - Update TriggerSubsystem to log with subsystem context - Update command result handler to include subsystem - Update queries to handle subsystem filtering ### Phase 3: Agent Updates (1 hour) - Add [HISTORY] logging to all scan handlers - Ensure subsystem context flows through - Verify error handling logs to history ### Phase 4: Frontend Updates (1 hour) - Add subsystem to HistoryEntry type - Add subsystem icons map - Update display logic to show subsystem context - Add subsystem filtering to history UI ### Phase 5: Testing (1 hour) - Unit tests for backend changes - Integration tests for full flow - Manual testing of each subsystem scan **Total Estimated Time**: 4.5 hours --- ## Risks and Considerations **Risk 1**: Database migration on production data - Mitigation: Test migration on backup - Plan: Run during low-activity window **Risk 2**: Performance impact of additional column - Likelihood: Low (indexed, small varchar) - Mitigation: Add index during migration **Risk 3**: UI breaks for old entries without subsystem - Mitigation: Handle NULL gracefully ("Unknown Scan") --- ## Planning Documents Status This is **NEW** Issue #3 - separate from completed Issues #1 and #2. **New Planning Documents Created**: - `ISSUE_003_SCAN_TRIGGER_FIX.md` - This file - `UX_ISSUE_ANALYSIS_scan_history.md` - Related UX issue (documented already) **Update Existing**: - `STATE_PRESERVATION.md` - Add Issue #3 tracking - `session_2025-12-18-completion.md` - Add note about Issue #3 discovered --- ## Next Steps for Tomorrow 1. **Start of Day**: Review this plan 2. **Database**: Run migration 3. **Backend**: Update handlers and queries 4. **Agent**: Add [HISTORY] logging 5. **Frontend**: Update UI components 6. **Testing**: Verify all scan flows work 7. **Documentation**: Update completion status --- ## Sign-off **Planning By**: Ani Tunturi (for Casey) **Review Status**: Ready for implementation **Complexity**: Medium-High (touching multiple layers) **Confidence**: High (follows patterns established in Issues #1-2) **Blood, Sweat, and Tears Commitment**: Yes - proper implementation only