14 KiB
ISSUE #3: Scan Trigger Flow - Proper Implementation Plan
Date: 2025-12-18 (Planning for tomorrow)
Status: Planning Phase (Ready for implementation tomorrow)
Severity: High (Scan buttons currently error)
New Scope: Beyond Issues #1 and #2 (completed)
Issue Summary
Individual "Scan" buttons for each subsystem (docker, storage, system, updates) all return error:
"Failed to trigger scan: Failed to create command"
Why: Command acknowledgment and history logging flows are not properly integrated for subsystem-specific scans.
What Needs to Happen: Full ETHOS-compliant flow from UI click → API → Agent → Results → History
Current State Analysis
UI Layer (AgentHealth.tsx) ✅ WORKING
- ✅ Per-subsystem scan buttons exist
- ✅
handleTriggerScan(subsystem.subsystem)passes subsystem name triggerScanMutationmakes API call to:/api/v1/agents/:id/subsystems/:subsystem/trigger
Backend API (subsystems.go) ✅ MOSTLY WORKING
- ✅
TriggerSubsystemhandler receives subsystem parameter - ✅ Creates distinct command type:
commandType := "scan_" + subsystem - ✅ Creates AgentCommand with unique command_type
- ❌ FAILING:
signAndCreateCommandcall fails
Agent (main.go) ✅ MOSTLY WORKING
- ✅
case "scan_updates":handles update scans - ✅
case "scan_storage":handles storage scans - ❌ ISSUE: Command acknowledgment flow needs review
History/Reconciliation ❌ NOT INTEGRATED
- Missing: Subsystem context in history logging
- Broken: Command acknowledgment for scan commands
- Inconsistent: Some logs go to history, some don't
Proper Implementation Requirements (ETHOS)
Core Principles to Follow
-
Errors are History, Not /dev/null ✅ MUST HAVE
- Scan failures → history table with context
- Button click errors → history table
- Command creation errors → history table
- Agent handler errors → history table
-
Security is Non-Negotiable ✅ MUST HAVE
- All scan triggers → authenticated endpoints (already done)
- Command signing → Ed25519 nonces (already done)
- Circuit breaker integration (already exists)
-
Assume Failure; Build for Resilience ✅ MUST HAVE
- Scan failures → retry logic (if appropriate)
- Command creation failures → clear error context
- Agent unreachable → proper error to UI
- Partial failures → handled gracefully
-
Idempotency ✅ MUST HAVE
- Scan operations repeatable (safe to trigger multiple times)
- No duplicate history entries for same scan
- Results properly timestamped for tracking
-
No Marketing Fluff ✅ MUST HAVE
- Clear action names in history: "scan_docker", "scan_storage", "scan_system"
- Subsystem icons in history display (not just text)
- Accurate, honest logging throughout
Full Flow Design (From Click to History)
Phase 1: User Clicks Scan Button
UI Event: handleTriggerScan(subsystem.subsystem)
User clicks: [Scan] button on Docker row
→ handleTriggerScan("docker")
→ triggerScanMutation.mutate("docker")
→ POST /api/v1/agents/:id/subsystems/docker/trigger
Ethos Requirements:
- Button disable during pending state
- Loading indicator
- Success/error toast (already doing this)
Phase 2: Backend Receives Trigger POST
Handler: subsystems.go:TriggerSubsystem
URL: POST /api/v1/agents/:id/subsystems/:subsystem/trigger
→ Authenticate (already done)
→ Validate agent exists
→ Validate subsystem is enabled
→ Get current config
→ Generate command_id
Command Creation:
command := &models.AgentCommand{
AgentID: agentID,
CommandType: "scan_" + subsystem, // "scan_docker", "scan_storage", etc.
Status: "pending",
Source: "web_ui",
// ADD: Subsystem field for filtering/querying
Subsystem: subsystem,
}
// Add [HISTORY] logging
log.Printf("[HISTORY] [server] [scan] command_created agent_id=%s subsystem=%s command_id=%s timestamp=%s",
agentID, subsystem, command.ID, time.Now().Format(time.RFC3339))
err = h.signAndCreateCommand(command)
Ethos Requirements:
- ✅ All errors logged before returning
- ✅ History entry created for command creation attempts
- ✅ Subsystem context preserved in logs
Phase 3: Command Acknowledgment System
The scan command must flow through the standard acknowledgment system:
// Already exists: pending_acks.json tracking
ackTracker.Create(command.ID, time.Now())
→ Agent checks in: receives command
→ Agent starts scan: reports status?
→ Agent completes: reports results
→ Server updates history
→ Acknowledgment removed
Current Missing Pieces:
- Command results not being saved properly
- Subsystem context not flowing through ack system
- Scan results not creating history entries
Phase 4: Agent Receives Scan Command
Agent Handling: main.go:handleCommand
case "scan_docker":
log.Printf("[HISTORY] [agent] [scan_docker] command_received agent_id=%s command_id=%s timestamp=%s",
cfg.AgentID, cmd.ID, time.Now().Format(time.RFC3339))
results, err := handleScanDocker(apiClient, cfg, ackTracker, scanOrchestrator, cmd.ID)
if err != nil {
log.Printf("[ERROR] [agent] [scan_docker] scan_failed error=%v timestamp=%s")
log.Printf("[HISTORY] [agent] [scan_docker] scan_failed error="%v" timestamp=%s")
// Update command status: failed
// Report back via API
// Return error
}
log.Printf("[SUCCESS] [agent] [scan_docker] scan_completed items=%d timestamp=%s")
log.Printf("[HISTORY] [agent] [scan_docker] scan_completed items=%d timestamp=%s")
// Update command status: success
// Report results via API
Existing Handlers:
handleScanUpdatesV2- needs reviewhandleScanStorage- needs reviewhandleScanSystem- needs reviewhandleScanDocker- needs review
Phase 5: Results Reported Back
API Endpoint: Agent reports scan results
// POST /api/v1/agents/:id/commands/:command_id/result
{
command_id: "...",
result: "success",
items_found: 4,
stdout: "...",
subsystem: "docker"
}
Server Handler: Updates history table
// Insert into history table
INSERT INTO history (agent_id, command_id, action, result, subsystem, stdout, stderr, executed_at)
VALUES (?, ?, 'scan_docker', ?, 'docker', ?, ?, NOW())
// Add [HISTORY] logging
log.Printf("[HISTORY] [server] [scan_docker] result_logged agent_id=%s command_id=%s timestamp=%s")
Phase 6: History Display
UI Component: HistoryTimeline.tsx
// Retrieve history entries
GET /api/v1/history?agent_id=...&subsystem=docker
// Display with subsystem context
<span className="capitalize flex items-center">
{getActionIcon(entry.action, entry.subsystem)}
<span>{getSubsystemDisplayName(entry.subsystem)} Scan</span>
</span>
// Icons based on subsystem
getActionIcon("scan", "docker") → Docker icon
getActionIcon("scan", "storage") → Storage icon
getActionIcon("scan", "system") → System icon
Database Changes Required
Table: history (or logs)
Add column:
ALTER TABLE history ADD COLUMN subsystem VARCHAR(50);
CREATE INDEX idx_history_agent_action_subsystem ON history(agent_id, action, subsystem);
Populate for existing scan entries:
- Parse stdout for clues to determine subsystem
- Or set to NULL for existing entries
- UI must handle NULL (display as "Unknown Scan")
Code Changes Required
Backend (aggregator-server)
Files to Modify:
internal/models/command.go- Add Subsystem fieldinternal/database/queries/commands.go- Update for subsysteminternal/api/handlers/subsystems.go- Update TriggerSubsystem logginginternal/api/handlers/commands.go- Update command result handlerinternal/database/migrations/- Add subsystem column migration
New Queries Needed:
-- Insert history with subsystem
INSERT INTO history (...) VALUES (..., subsystem)
-- Query history by subsystem
SELECT * FROM history WHERE agent_id = ? AND subsystem = ?
Agent (aggregator-agent)
Files to Modify:
cmd/agent/main.go- Update allhandleScan*functions with [HISTORY] logginginternal/orchestrator/scanner.go- Ensure wrappers pass subsystem contextinternal/scanner/- Add subsystem identification to results
Add to all scan handlers:
// Each handleScan* function needs:
// 1. [HISTORY] log when starting
// 2. [HISTORY] log on completion
// 3. [HISTORY] log on error
// 4. Subsystem context in all log messages
Frontend (aggregator-web)
Files to Modify:
src/types/index.ts- Add subsystem to HistoryEntry interfacesrc/components/HistoryTimeline.tsx- Update display logicsrc/lib/api.ts- Update API call to include subsystem parametersrc/components/AgentHealth.tsx- Add subsystem icons map
Display Logic:
const subsystemIcon = {
docker: <Container className="h-4 w-4" />,
storage: <HardDrive className="h-4 w-4" />,
system: <Cpu className="h-4 w-4" />,
updates: <Package className="h-4 w-4" />,
dnf: <Box className="h-4 w-4" />,
winget: <Windows className="h-4 w-4" />,
apt: <Linux className="h-4 w-4" />,
};
const displayName = {
docker: 'Docker',
storage: 'Storage',
system: 'System',
updates: 'Package Updates',
// ... etc
};
Testing Requirements
Unit Tests
// Test command creation with subsystem
TestCreateCommand_WithSubsystem()
TestCreateCommand_WithoutSubsystem()
// Test history insertion with subsystem
TestCreateHistory_WithSubsystem()
TestQueryHistory_BySubsystem()
// Test agent scan handlers
TestHandleScanDocker_LogsHistory()
TestHandleScanDocker_Failure() // Error logs to history
Integration Tests
// Test full flow
TestScanTrigger_FullFlow_Docker()
TestScanTrigger_FullFlow_Storage()
TestScanTrigger_FullFlow_System()
TestScanTrigger_FullFlow_Updates()
// Verify each step:
// 1. UI trigger → 2. Command created → 3. Agent receives → 4. Scan runs →
// 5. Results reported → 6. History logged → 7. History UI displays correctly
Manual Testing Checklist
- Click each subsystem scan button
- Verify scan runs and results appear
- Verify history entry created for each
- Verify history shows subsystem-specific icons and names
- Verify failed scans create history entries
- Verify command ack system tracks scan commands
- Verify circuit breakers show scan activity
ETHOS Compliance Checklist
Errors are History, Not /dev/null
- All scan errors → history table
- All scan completions → history table
- Button click failures → history table
- Command creation failures → history table
- Agent unreachable errors → history table
- Subsystem context in all history entries
Security is Non-Negotiable
- All scan endpoints → AuthMiddleware() (already done)
- Command signing → Ed25519 nonces (already done)
- No scan credentials in logs
Assume Failure; Build for Resilience
- Agent unavailable → clear error to UI
- Scan timeout → properly handled
- Partial failures → reported to history
- Retry logic considered (not automatic for manual scans)
Idempotency
- Safe to click scan multiple times
- Each scan creates distinct history entry
- No duplicate state from repeated scans
No Marketing Fluff
- Action names: "scan_docker", "scan_storage", "scan_system"
- History display: "Docker Scan", "Storage Scan" etc.
- Subsystem-specific icons (not generic play button)
- Clear, honest logging throughout
Implementation Phases
Phase 1: Database Migration (30 min)
- Add
subsystemcolumn to history table - Run migration
- Update ORM models/queries
Phase 2: Backend API Updates (1 hour)
- Update TriggerSubsystem to log with subsystem context
- Update command result handler to include subsystem
- Update queries to handle subsystem filtering
Phase 3: Agent Updates (1 hour)
- Add [HISTORY] logging to all scan handlers
- Ensure subsystem context flows through
- Verify error handling logs to history
Phase 4: Frontend Updates (1 hour)
- Add subsystem to HistoryEntry type
- Add subsystem icons map
- Update display logic to show subsystem context
- Add subsystem filtering to history UI
Phase 5: Testing (1 hour)
- Unit tests for backend changes
- Integration tests for full flow
- Manual testing of each subsystem scan
Total Estimated Time: 4.5 hours
Risks and Considerations
Risk 1: Database migration on production data
- Mitigation: Test migration on backup
- Plan: Run during low-activity window
Risk 2: Performance impact of additional column
- Likelihood: Low (indexed, small varchar)
- Mitigation: Add index during migration
Risk 3: UI breaks for old entries without subsystem
- Mitigation: Handle NULL gracefully ("Unknown Scan")
Planning Documents Status
This is NEW Issue #3 - separate from completed Issues #1 and #2.
New Planning Documents Created:
ISSUE_003_SCAN_TRIGGER_FIX.md- This fileUX_ISSUE_ANALYSIS_scan_history.md- Related UX issue (documented already)
Update Existing:
STATE_PRESERVATION.md- Add Issue #3 trackingsession_2025-12-18-completion.md- Add note about Issue #3 discovered
Next Steps for Tomorrow
- Start of Day: Review this plan
- Database: Run migration
- Backend: Update handlers and queries
- Agent: Add [HISTORY] logging
- Frontend: Update UI components
- Testing: Verify all scan flows work
- Documentation: Update completion status
Sign-off
Planning By: Ani Tunturi (for Casey)
Review Status: Ready for implementation
Complexity: Medium-High (touching multiple layers)
Confidence: High (follows patterns established in Issues #1-2)
Blood, Sweat, and Tears Commitment: Yes - proper implementation only