Files

Fimeg 484a7f77ce Add docs and project files - force for Culurien

2026-03-28 20:46:24 -04:00

14 KiB

Raw Permalink Blame History

ISSUE #3: Scan Trigger Flow - Proper Implementation Plan

Date: 2025-12-18 (Planning for tomorrow)
Status: Planning Phase (Ready for implementation tomorrow)
Severity: High (Scan buttons currently error)
New Scope: Beyond Issues #1 and #2 (completed)

Issue Summary

Individual "Scan" buttons for each subsystem (docker, storage, system, updates) all return error:

"Failed to trigger scan: Failed to create command"

Why: Command acknowledgment and history logging flows are not properly integrated for subsystem-specific scans.

What Needs to Happen: Full ETHOS-compliant flow from UI click → API → Agent → Results → History

Current State Analysis

UI Layer (AgentHealth.tsx) ✅ WORKING

✅ Per-subsystem scan buttons exist
✅ handleTriggerScan(subsystem.subsystem) passes subsystem name
triggerScanMutation makes API call to: /api/v1/agents/:id/subsystems/:subsystem/trigger

Backend API (subsystems.go) ✅ MOSTLY WORKING

✅ TriggerSubsystem handler receives subsystem parameter
✅ Creates distinct command type: commandType := "scan_" + subsystem
✅ Creates AgentCommand with unique command_type
❌ FAILING: signAndCreateCommand call fails

Agent (main.go) ✅ MOSTLY WORKING

✅ case "scan_updates": handles update scans
✅ case "scan_storage": handles storage scans
❌ ISSUE: Command acknowledgment flow needs review

History/Reconciliation ❌ NOT INTEGRATED

Missing: Subsystem context in history logging
Broken: Command acknowledgment for scan commands
Inconsistent: Some logs go to history, some don't

Proper Implementation Requirements (ETHOS)

Core Principles to Follow

Errors are History, Not /dev/null ✅ MUST HAVE
- Scan failures → history table with context
- Button click errors → history table
- Command creation errors → history table
- Agent handler errors → history table
Security is Non-Negotiable ✅ MUST HAVE
- All scan triggers → authenticated endpoints (already done)
- Command signing → Ed25519 nonces (already done)
- Circuit breaker integration (already exists)
Assume Failure; Build for Resilience ✅ MUST HAVE
- Scan failures → retry logic (if appropriate)
- Command creation failures → clear error context
- Agent unreachable → proper error to UI
- Partial failures → handled gracefully
Idempotency ✅ MUST HAVE
- Scan operations repeatable (safe to trigger multiple times)
- No duplicate history entries for same scan
- Results properly timestamped for tracking
No Marketing Fluff ✅ MUST HAVE
- Clear action names in history: "scan_docker", "scan_storage", "scan_system"
- Subsystem icons in history display (not just text)
- Accurate, honest logging throughout

Full Flow Design (From Click to History)

Phase 1: User Clicks Scan Button

UI Event: handleTriggerScan(subsystem.subsystem)

User clicks: [Scan] button on Docker row
  → handleTriggerScan("docker")
  → triggerScanMutation.mutate("docker")
  → POST /api/v1/agents/:id/subsystems/docker/trigger

Ethos Requirements:

Button disable during pending state
Loading indicator
Success/error toast (already doing this)

Phase 2: Backend Receives Trigger POST

Handler: subsystems.go:TriggerSubsystem

URL: POST /api/v1/agents/:id/subsystems/:subsystem/trigger
  → Authenticate (already done)
  → Validate agent exists
  → Validate subsystem is enabled
  → Get current config
  → Generate command_id

Command Creation:

command := &models.AgentCommand{
  AgentID:     agentID,
  CommandType: "scan_" + subsystem,  // "scan_docker", "scan_storage", etc.
  Status:      "pending",
  Source:      "web_ui",
  // ADD: Subsystem field for filtering/querying
  Subsystem:   subsystem,
}

// Add [HISTORY] logging
log.Printf("[HISTORY] [server] [scan] command_created agent_id=%s subsystem=%s command_id=%s timestamp=%s",
  agentID, subsystem, command.ID, time.Now().Format(time.RFC3339))

err = h.signAndCreateCommand(command)

Ethos Requirements:

✅ All errors logged before returning
✅ History entry created for command creation attempts
✅ Subsystem context preserved in logs

Phase 3: Command Acknowledgment System

The scan command must flow through the standard acknowledgment system:

// Already exists: pending_acks.json tracking
ackTracker.Create(command.ID, time.Now())
  → Agent checks in: receives command
  → Agent starts scan: reports status? 
  → Agent completes: reports results
  → Server updates history
  → Acknowledgment removed

Current Missing Pieces:

Command results not being saved properly
Subsystem context not flowing through ack system
Scan results not creating history entries

Phase 4: Agent Receives Scan Command

Agent Handling: main.go:handleCommand

case "scan_docker":
  log.Printf("[HISTORY] [agent] [scan_docker] command_received agent_id=%s command_id=%s timestamp=%s",
    cfg.AgentID, cmd.ID, time.Now().Format(time.RFC3339))
  
  results, err := handleScanDocker(apiClient, cfg, ackTracker, scanOrchestrator, cmd.ID)
  
  if err != nil {
    log.Printf("[ERROR] [agent] [scan_docker] scan_failed error=%v timestamp=%s")
    log.Printf("[HISTORY] [agent] [scan_docker] scan_failed error="%v" timestamp=%s")
    // Update command status: failed
    // Report back via API
    // Return error
  }
  
  log.Printf("[SUCCESS] [agent] [scan_docker] scan_completed items=%d timestamp=%s")
  log.Printf("[HISTORY] [agent] [scan_docker] scan_completed items=%d timestamp=%s")
  // Update command status: success
  // Report results via API

Existing Handlers:

handleScanUpdatesV2 - needs review
handleScanStorage - needs review
handleScanSystem - needs review
handleScanDocker - needs review

Phase 5: Results Reported Back

API Endpoint: Agent reports scan results

// POST /api/v1/agents/:id/commands/:command_id/result
{
  command_id: "...",
  result: "success",
  items_found: 4,
  stdout: "...",
  subsystem: "docker"
}

Server Handler: Updates history table

// Insert into history table
INSERT INTO history (agent_id, command_id, action, result, subsystem, stdout, stderr, executed_at)
VALUES (?, ?, 'scan_docker', ?, 'docker', ?, ?, NOW())

// Add [HISTORY] logging
log.Printf("[HISTORY] [server] [scan_docker] result_logged agent_id=%s command_id=%s timestamp=%s")

Phase 6: History Display

UI Component: HistoryTimeline.tsx

// Retrieve history entries
GET /api/v1/history?agent_id=...&subsystem=docker

// Display with subsystem context
<span className="capitalize flex items-center">
  {getActionIcon(entry.action, entry.subsystem)}
  <span>{getSubsystemDisplayName(entry.subsystem)} Scan</span>
</span>

// Icons based on subsystem
getActionIcon("scan", "docker") → Docker icon
getActionIcon("scan", "storage") → Storage icon
getActionIcon("scan", "system") → System icon

Database Changes Required

Table: `history` (or logs)

Add column:

ALTER TABLE history ADD COLUMN subsystem VARCHAR(50);
CREATE INDEX idx_history_agent_action_subsystem ON history(agent_id, action, subsystem);

Populate for existing scan entries:

Parse stdout for clues to determine subsystem
Or set to NULL for existing entries
UI must handle NULL (display as "Unknown Scan")

Code Changes Required

Backend (aggregator-server)

Files to Modify:

internal/models/command.go - Add Subsystem field
internal/database/queries/commands.go - Update for subsystem
internal/api/handlers/subsystems.go - Update TriggerSubsystem logging
internal/api/handlers/commands.go - Update command result handler
internal/database/migrations/ - Add subsystem column migration

New Queries Needed:

-- Insert history with subsystem
INSERT INTO history (...) VALUES (..., subsystem)

-- Query history by subsystem
SELECT * FROM history WHERE agent_id = ? AND subsystem = ?

Agent (aggregator-agent)

Files to Modify:

cmd/agent/main.go - Update all handleScan* functions with [HISTORY] logging
internal/orchestrator/scanner.go - Ensure wrappers pass subsystem context
internal/scanner/ - Add subsystem identification to results

Add to all scan handlers:

// Each handleScan* function needs:
// 1. [HISTORY] log when starting
// 2. [HISTORY] log on completion
// 3. [HISTORY] log on error
// 4. Subsystem context in all log messages

Frontend (aggregator-web)

Files to Modify:

src/types/index.ts - Add subsystem to HistoryEntry interface
src/components/HistoryTimeline.tsx - Update display logic
src/lib/api.ts - Update API call to include subsystem parameter
src/components/AgentHealth.tsx - Add subsystem icons map

Display Logic:

const subsystemIcon = {
  docker: <Container className="h-4 w-4" />,
  storage: <HardDrive className="h-4 w-4" />,
  system: <Cpu className="h-4 w-4" />,
  updates: <Package className="h-4 w-4" />,
  dnf: <Box className="h-4 w-4" />,
  winget: <Windows className="h-4 w-4" />,
  apt: <Linux className="h-4 w-4" />,
};

const displayName = {
  docker: 'Docker',
  storage: 'Storage',
  system: 'System',
  updates: 'Package Updates',
  // ... etc
};

Testing Requirements

Unit Tests

// Test command creation with subsystem
TestCreateCommand_WithSubsystem()
TestCreateCommand_WithoutSubsystem()

// Test history insertion with subsystem
TestCreateHistory_WithSubsystem()
TestQueryHistory_BySubsystem()

// Test agent scan handlers
TestHandleScanDocker_LogsHistory()
TestHandleScanDocker_Failure() // Error logs to history

Integration Tests

// Test full flow
TestScanTrigger_FullFlow_Docker()
TestScanTrigger_FullFlow_Storage()
TestScanTrigger_FullFlow_System()
TestScanTrigger_FullFlow_Updates()

// Verify each step:
// 1. UI trigger → 2. Command created → 3. Agent receives → 4. Scan runs → 
// 5. Results reported → 6. History logged → 7. History UI displays correctly

Manual Testing Checklist

Click each subsystem scan button
Verify scan runs and results appear
Verify history entry created for each
Verify history shows subsystem-specific icons and names
Verify failed scans create history entries
Verify command ack system tracks scan commands
Verify circuit breakers show scan activity

ETHOS Compliance Checklist

Errors are History, Not /dev/null

All scan errors → history table
All scan completions → history table
Button click failures → history table
Command creation failures → history table
Agent unreachable errors → history table
Subsystem context in all history entries

Security is Non-Negotiable

All scan endpoints → AuthMiddleware() (already done)
Command signing → Ed25519 nonces (already done)
No scan credentials in logs

Assume Failure; Build for Resilience

Agent unavailable → clear error to UI
Scan timeout → properly handled
Partial failures → reported to history
Retry logic considered (not automatic for manual scans)

Idempotency

Safe to click scan multiple times
Each scan creates distinct history entry
No duplicate state from repeated scans

No Marketing Fluff

Action names: "scan_docker", "scan_storage", "scan_system"
History display: "Docker Scan", "Storage Scan" etc.
Subsystem-specific icons (not generic play button)
Clear, honest logging throughout

Implementation Phases

Phase 1: Database Migration (30 min)

Add subsystem column to history table
Run migration
Update ORM models/queries

Phase 2: Backend API Updates (1 hour)

Update TriggerSubsystem to log with subsystem context
Update command result handler to include subsystem
Update queries to handle subsystem filtering

Phase 3: Agent Updates (1 hour)

Add [HISTORY] logging to all scan handlers
Ensure subsystem context flows through
Verify error handling logs to history

Phase 4: Frontend Updates (1 hour)

Add subsystem to HistoryEntry type
Add subsystem icons map
Update display logic to show subsystem context
Add subsystem filtering to history UI

Phase 5: Testing (1 hour)

Unit tests for backend changes
Integration tests for full flow
Manual testing of each subsystem scan

Total Estimated Time: 4.5 hours

Risks and Considerations

Risk 1: Database migration on production data

Mitigation: Test migration on backup
Plan: Run during low-activity window

Risk 2: Performance impact of additional column

Likelihood: Low (indexed, small varchar)
Mitigation: Add index during migration

Risk 3: UI breaks for old entries without subsystem

Mitigation: Handle NULL gracefully ("Unknown Scan")

Planning Documents Status

This is NEW Issue #3 - separate from completed Issues #1 and #2.

New Planning Documents Created:

ISSUE_003_SCAN_TRIGGER_FIX.md - This file
UX_ISSUE_ANALYSIS_scan_history.md - Related UX issue (documented already)

Update Existing:

STATE_PRESERVATION.md - Add Issue #3 tracking
session_2025-12-18-completion.md - Add note about Issue #3 discovered

Next Steps for Tomorrow

Start of Day: Review this plan
Database: Run migration
Backend: Update handlers and queries
Agent: Add [HISTORY] logging
Frontend: Update UI components
Testing: Verify all scan flows work
Documentation: Update completion status

Sign-off

Planning By: Ani Tunturi (for Casey)
Review Status: Ready for implementation
Complexity: Medium-High (touching multiple layers)
Confidence: High (follows patterns established in Issues #1-2)

Blood, Sweat, and Tears Commitment: Yes - proper implementation only

14 KiB Raw Permalink Blame History

ISSUE #3: Scan Trigger Flow - Proper Implementation Plan

Issue Summary

Current State Analysis

UI Layer (AgentHealth.tsx) ✅ WORKING

Backend API (subsystems.go) ✅ MOSTLY WORKING

Agent (main.go) ✅ MOSTLY WORKING

History/Reconciliation ❌ NOT INTEGRATED

Proper Implementation Requirements (ETHOS)

Core Principles to Follow

Full Flow Design (From Click to History)

Phase 1: User Clicks Scan Button

Phase 2: Backend Receives Trigger POST

Phase 3: Command Acknowledgment System

Phase 4: Agent Receives Scan Command

Phase 5: Results Reported Back

Phase 6: History Display

Database Changes Required

Table: history (or logs)

Code Changes Required

Backend (aggregator-server)

Agent (aggregator-agent)

Frontend (aggregator-web)

Testing Requirements

Unit Tests

Integration Tests

Manual Testing Checklist

ETHOS Compliance Checklist

Errors are History, Not /dev/null

Security is Non-Negotiable

Assume Failure; Build for Resilience

Idempotency

No Marketing Fluff

Implementation Phases

Phase 1: Database Migration (30 min)

Phase 2: Backend API Updates (1 hour)

Phase 3: Agent Updates (1 hour)

Phase 4: Frontend Updates (1 hour)

Phase 5: Testing (1 hour)

Risks and Considerations

Planning Documents Status

Next Steps for Tomorrow

Sign-off

14 KiB

Raw Permalink Blame History

Table: `history` (or logs)