Files

Fimeg 484a7f77ce Add docs and project files - force for Culurien

2026-03-28 20:46:24 -04:00

24 KiB

Raw Blame History

RedFlag Subsystem Scanning Refactor Plan

🎯 Executive Summary

This document outlines the comprehensive refactor of RedFlag's subsystem scanning architecture to fix critical data classification issues, improve Live Operations UX, and implement agent-centric design patterns.

🚨 Critical Issues Identified

Issue #1: Stuck scan_updates Operations

Problem: scan_updates operations stuck in "sent" status for 96+ minutes
Root Cause: Monolithic command execution with single log entry
Impact: Poor UX, no visibility into individual subsystem status

Issue #2: Incorrect Data Classification

Problem: Storage and system metrics stored as "Updates" in database
Root Cause: All subsystems call ReportUpdates() endpoint
Impact: Updates page shows "STORAGE 44% used" as if it's a package update

Evidence:

📋 STORAGE 44.0% used → 0 GB available (showing as Update)
📋 SYSTEM 8 cores, 8 threads → Intel(R) Core(TM) i5 (showing as Update)
📋 DOCKER_IMAGE sha256:2875f → 029660641a0c (showing as Update)

Issue #3: UI/UX Inconsistencies

Problem: Live Operations shows every operation, not agent-centric
Problem: Duplicate functionality between Live Ops and Agent pages
Problem: No frosted glass consistency across pages

🏗️ Existing Agent Page Infrastructure (Reuse Required)

The Agent page already has extensive infrastructure that should be reused rather than duplicated:

📊 Existing Agent Page Components

Tabs: Overview, Storage & Disks, Updates & Packages, Agent Health, History
Status System: getStatusColor(), isOnline() functions
Heartbeat Infrastructure: Color-coded indicators, duration controls
Real-time Updates: Polling, live status indicators
Component Library: AgentStorage, AgentUpdates, AgentScanners, etc.

🎨 Existing UI/UX Patterns to Reuse

Status Color System (`utils.ts`)

// Already implemented status colors
getStatusColor('online')      // → 'text-success-600 bg-success-100'
getStatusColor('offline')     // → 'text-danger-600 bg-danger-100'
getStatusColor('pending')     // → 'text-warning-600 bg-warning-100'
getStatusColor('installing')  // → 'text-indigo-600 bg-indigo-100'
getStatusColor('failed')      // → 'text-danger-600 bg-danger-100'

Heartbeat Infrastructure

// Already implemented heartbeat system
const { data: heartbeatStatus } = useHeartbeatStatus(agentId);
const isRapidPolling = heartbeatStatus?.enabled && heartbeatStatus?.active;
const isSystemHeartbeat = heartbeatSource === 'system';
const isManualHeartbeat = heartbeatSource === 'manual';

// Color coding already implemented:
// - System heartbeat: blue with animate-pulse
// - Manual heartbeat: pink with animate-pulse
// - Normal mode: green
// - Loading state: disabled

Online Status Detection

// Already implemented online detection
const isOnline = (lastCheckin: string): boolean => {
  const diffMins = Math.floor(diffMs / 60000);
  return diffMins < 15; // 15 minute threshold
};

📋 Live Operations Should Use Existing Infrastructure

Agent Selection & Status Display

// Reuse existing agent selection logic from Agents.tsx
const { data: agents } = useAgents();
const selectedAgent = agents?.find(a => a.id === agentId);

// Reuse existing status display
<div className={cn('badge', getStatusColor(isOnline(agent.last_seen) ? 'online' : 'offline'))}>
  {isOnline(agent.last_seen) ? 'Online' : 'Offline'}
</div>

// Reuse existing heartbeat indicator
<Activity className={cn(
  heartbeatLoading ? 'animate-spin' :
  isRapidPolling && isSystemHeartbeat ? 'text-blue-600 animate-pulse' :
  isRapidPolling && isManualHeartbeat ? 'text-pink-600 animate-pulse' :
  'text-green-600'
)} />

Command Status Tracking

// Reuse existing command tracking logic
const { data: agentCommands } = useActiveCommands();
const heartbeatCommands = agentCommands.filter(cmd =>
  cmd.command_type === 'enable_heartbeat' || cmd.command_type === 'disable_heartbeat'
);
const otherCommands = agentCommands.filter(cmd =>
  cmd.command_type !== 'enable_heartbeat' && cmd.command_type !== 'disable_heartbeat'
);

🏗️ Solution Architecture

Phase 1: Data Classification Fix (High Priority)

1.1 Agent-Side Changes

Current (BROKEN):

// subsystem_handlers.go:124-136 - handleScanStorage
if len(result.Updates) > 0 {
    report := client.UpdateReport{
        CommandID: commandID,
        Timestamp: time.Now(),
        Updates:   result.Updates, // ❌ Storage data sent as "updates"
    }
    if err := apiClient.ReportUpdates(cfg.AgentID, report); err != nil {
        return fmt.Errorf("failed to report storage metrics: %w", err)
    }
}

Fixed:

// handleScanStorage - FIXED
if len(result.Updates) > 0 {
    report := client.MetricsReport{
        CommandID: commandID,
        Timestamp: time.Now(),
        Metrics:   result.Updates, // ✅ Storage data sent as metrics
    }
    if err := apiClient.ReportMetrics(cfg.AgentID, report); err != nil {
        return fmt.Errorf("failed to report storage metrics: %w", err)
    }
}

// handleScanSystem - FIXED
if len(result.Updates) > 0 {
    report := client.MetricsReport{
        CommandID: commandID,
        Timestamp: time.Now(),
        Metrics:   result.Updates, // ✅ System data sent as metrics
    }
    if err := apiClient.ReportMetrics(cfg.AgentID, report); err != nil {
        return fmt.Errorf("failed to report system metrics: %w", err)
    }
}

// handleScanDocker - FIXED
if len(result.Updates) > 0 {
    report := client.DockerReport{
        CommandID: commandID,
        Timestamp: time.Now(),
        Images:    result.Updates, // ✅ Docker data sent separately
    }
    if err := apiClient.ReportDockerImages(cfg.AgentID, report); err != nil {
        return fmt.Errorf("failed to report docker images: %w", err)
    }
}

1.2 Server-Side New Endpoints

// NEW: Separate endpoints for different data types

// 1. ReportMetrics - for system/storage metrics
func (h *MetricsHandler) ReportMetrics(c *gin.Context) {
    agentID := c.MustGet("agent_id").(uuid.UUID)

    // ✅ Full security validation (nonce, command validation)
    if err := validateNonce(c); err != nil {
        c.JSON(http.StatusUnauthorized, gin.H{"error": "invalid nonce"})
        return
    }

    var req models.MetricsReportRequest
    if err := c.ShouldBindJSON(&req); err != nil {
        c.JSON(http.StatusBadRequest, gin.H{"error": err.Error()})
        return
    }

    // ✅ Verify command exists and belongs to agent
    command, err := h.commandQueries.GetCommandByID(req.CommandID)
    if err != nil || command.AgentID != agentID {
        c.JSON(http.StatusForbidden, gin.H{"error": "unauthorized command"})
        return
    }

    // Store in metrics table, NOT updates table
    if err := h.metricsQueries.CreateMetrics(agentID, req.Metrics); err != nil {
        c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to store metrics"})
        return
    }

    c.JSON(http.StatusOK, gin.H{"message": "metrics recorded", "count": len(req.Metrics)})
}

// 2. ReportDockerImages - for Docker image information
func (h *DockerHandler) ReportDockerImages(c *gin.Context) {
    // Similar security pattern, stores in docker_images table
}

// 3. ReportUpdates - ONLY for actual package updates (RESTRICTED)
func (h *UpdateHandler) ReportUpdates(c *gin.Context) {
    // Existing endpoint, but add validation to only accept package types:
    // - apt, dnf, winget, windows_update
    // - Reject: storage, system, docker_image types
}

1.3 New Data Models

// models/metrics.go
type MetricsReportRequest struct {
    CommandID string    `json:"command_id"`
    Timestamp time.Time `json:"timestamp"`
    Metrics   []Metric  `json:"metrics"`
}

type Metric struct {
    PackageType      string            `json:"package_type"`      // "storage", "system", "cpu", "memory"
    PackageName      string            `json:"package_name"`      // mount point, metric name
    CurrentVersion   string            `json:"current_version"`   // current usage, value
    AvailableVersion string            `json:"available_version"` // available space, threshold
    Severity         string            `json:"severity"`          // "low", "moderate", "high"
    RepositorySource string            `json:"repository_source"`
    Metadata         map[string]string `json:"metadata"`
}

// models/docker.go
type DockerReportRequest struct {
    CommandID string       `json:"command_id"`
    Timestamp time.Time    `json:"timestamp"`
    Images    []DockerImage `json:"images"`
}

type DockerImage struct {
    PackageType      string            `json:"package_type"`      // "docker_image"
    PackageName      string            `json:"package_name"`      // image name:tag
    CurrentVersion   string            `json:"current_version"`   // current image ID
    AvailableVersion string            `json:"available_version"` // latest image ID
    Severity         string            `json:"severity"`          // update severity
    RepositorySource string            `json:"repository_source"` // registry
    Metadata         map[string]string `json:"metadata"`
}

Phase 2: Live Operations Refactor

2.1 Agent-Centric Design

Live Operations After Refactor:

Live Operations

🖥️ Agent 001 (fedora-server)
    Status: 🟢 scan_updates • 45s • 3/5 subsystems complete
    Last Action: APT scanning (12s)
    ▼ Quick Details
        └── 🔄 APT: Scanning | ✅ Docker: Complete | 🔄 System: Scanning

🖥️ Agent 002 (ubuntu-workstation)
    Status: 🟢 Heartbeat • 2m active
    Last Action: System check (2m ago)
    ▼ Quick Details
        └── 💓 Heartbeat monitoring active

🖥️ Agent 007 (docker-host)
    Status: 🟢 Self-upgrade • 1m 30s
    Last Action: Downloading v0.1.23 (1m ago)
    ▼ Quick Details
        └── ⬇️ Downloading: 75% complete

Key Changes:

✅ Only show active agents (no idle ones)
✅ Agent-centric view, not operation-centric
✅ Group operations by agent
✅ Quick expandable details per agent
✅ Frosted glass UI consistency

2.2 Live Operations UI Component (Reuse Existing Infrastructure)

const LiveOperations: React.FC = () => {
  // Reuse existing agent hooks from Agents.tsx
  const { data: agents } = useAgents();
  const { data: agentCommands } = useActiveCommands();
  const { data: heartbeatStatus } = useHeartbeatStatus();

  // Filter for active agents only (reuse existing logic)
  const activeAgents = agents?.filter(agent => {
    const hasActiveCommands = agentCommands?.some(cmd => cmd.agent_id === agent.id);
    const hasActiveHeartbeat = heartbeatStatus?.[agent.id]?.enabled && heartbeatStatus?.[agent.id]?.active;
    return hasActiveCommands || hasActiveHeartbeat;
  }) || [];

  return (
    <div className="space-y-3">
      {activeAgents.map(agent => {
        // Reuse existing heartbeat status logic
        const agentHeartbeat = heartbeatStatus?.[agent.id];
        const isRapidPolling = agentHeartbeat?.enabled && agentHeartbeat?.active;
        const heartbeatSource = agentHeartbeat?.source;
        const isSystemHeartbeat = heartbeatSource === 'system';
        const isManualHeartbeat = heartbeatSource === 'manual';

        // Reuse existing command filtering logic
        const agentCommandsList = agentCommands?.filter(cmd => cmd.agent_id === agent.id) || [];
        const currentAction = agentCommandsList[0]?.command_type || 'heartbeat';
        const operationDuration = agentCommandsList[0]?.duration || 0;

        return (
          <div key={agent.id} className="frosted-card p-4 border border-white/10">
            <div className="flex items-center justify-between">
              <div className="flex items-center gap-3">
                {/* Reuse existing status indicator */}
                <div className={cn(
                  'w-3 h-3 rounded-full',
                  isOnline(agent.last_seen) ? 'bg-green-500' : 'bg-gray-400'
                )} />

                <Computer className="h-5 w-5 text-blue-400" />
                <span className="font-medium text-white">{agent.name}</span>
                <span className="text-sm text-gray-400">{agent.hostname}</span>

                {/* Reuse existing status badge */}
                <span className={cn('badge', getStatusColor(isOnline(agent.last_seen) ? 'online' : 'offline'))}>
                  {isOnline(agent.last_seen) ? 'Online' : 'Offline'}
                </span>
              </div>

              <div className="flex items-center gap-4">
                {/* Reuse existing heartbeat indicator with colors */}
                <Activity className={cn(
                  'h-4 w-4',
                  isRapidPolling && isSystemHeartbeat ? 'text-blue-600 animate-pulse' :
                  isRapidPolling && isManualHeartbeat ? 'text-pink-600 animate-pulse' :
                  'text-green-600'
                )} />

                <span className="px-3 py-1 bg-blue-500/20 rounded-full text-sm text-blue-300">
                  {currentAction}
                </span>

                <span className="text-sm text-gray-400">
                  {formatDuration(operationDuration)}
                </span>
              </div>
            </div>

            {/* Reuse existing heartbeat status info */}
            {isRapidPolling && (
              <div className="mt-2 text-sm text-gray-400">
                {isSystemHeartbeat ? 'System ' : 'Manual '}heartbeat active
                • Last seen: {formatRelativeTime(agent.last_seen)}
              </div>
            )}

            {/* Show active command details */}
            {agentCommandsList.map(cmd => (
              <div key={cmd.id} className="mt-2 text-sm text-gray-400">
                {cmd.command_type} • {formatRelativeTime(cmd.created_at)}
              </div>
            ))}
          </div>
        );
      })}
    </div>
  );
};

Phase 3: Agent Pages Integration

3.1 Data Flow to Existing Pages

scan_docker → Updates & Packages tab (shows Docker images properly)
scan_storage → Storage & Disks tab (live disk usage updates)
scan_system → Overview tab (live CPU, memory, uptime updates)
scan_updates → Updates & Packages tab (only actual package updates)

3.2 Storage & Disks Tab Enhancement

// NEW: Live storage data integration
const StorageDisksTab: React.FC<{agentId: string}> = ({ agentId }) => {
  const { data: storageData } = useQuery({
    queryKey: ['agent-storage', agentId],
    queryFn: () => agentApi.getStorageMetrics(agentId),
    refetchInterval: 30000, // Update every 30s during live operations
  });

  return (
    <div className="space-y-6">
      <StorageChart data={storageData?.storage} />
      <DiskUsageTable mounts={storageData?.storage} />

      {/* Live indicator */}
      {storageData?.isLive && (
        <div className="flex items-center gap-2 text-sm text-green-400">
          <div className="w-2 h-2 bg-green-400 rounded-full animate-pulse" />
          Live data from recent scan
        </div>
      )}
    </div>
  );
};

3.3 Updates & Packages Tab Fix

// FIXED: Only show actual package updates
const UpdatesPackagesTab: React.FC<{agentId: string}> = ({ agentId }) => {
  const { data: packageUpdates } = useQuery({
    queryKey: ['agent-updates', agentId],
    queryFn: () => updatesApi.getPackageUpdates(agentId), // NEW: filters only packages
  });

  return (
    <div>
      {/* Shows ONLY: APT: 2 updates, DNF: 1 update */}
      {/* NO MORE: STORAGE 44% used, SYSTEM 8 cores */}
      <PackageUpdatesList updates={packageUpdates?.updates} />
    </div>
  );
};

3.4 Overview Tab Enhancement

// NEW: Live system metrics integration
const OverviewTab: React.FC<{agentId: string}> = ({ agentId }) => {
  const { data: systemMetrics } = useQuery({
    queryKey: ['agent-metrics', agentId],
    queryFn: () => agentApi.getSystemMetrics(agentId),
    refetchInterval: 30000,
  });

  return (
    <div className="grid grid-cols-1 md:grid-cols-2 gap-6">
      <SystemCard metrics={systemMetrics?.system} />
      <PerformanceCard data={systemMetrics?.performance} />

      {systemMetrics?.isLive && (
        <div className="col-span-full flex items-center gap-2 text-sm text-green-400">
          <Activity className="h-4 w-4" />
          Live system metrics from recent scan
        </div>
      )}
    </div>
  );
};

Phase 4: API Endpoints for Agent Pages

4.1 New Agent-Specific Endpoints

// GET /api/v1/agents/{agentId}/storage - for Storage & Disks tab
func (h *AgentHandler) GetStorageMetrics(c *gin.Context) {
    agentID := c.Param("agentId")
    // Return latest storage scan data from metrics table
    // Filter by package_type IN ('storage', 'disk')
}

// GET /api/v1/agents/{agentId}/system - for Overview tab
func (h *AgentHandler) GetSystemMetrics(c *gin.Context) {
    agentID := c.Param("agentId")
    // Return latest system scan data from metrics table
    // Filter by package_type IN ('system', 'cpu', 'memory')
}

// GET /api/v1/agents/{agentId}/packages - for Updates tab (filtered)
func (h *AgentHandler) GetPackageUpdates(c *gin.Context) {
    agentID := c.Param("agentId")
    // Return ONLY package updates, filter out storage/system metrics
    // Filter by package_type IN ('apt', 'dnf', 'winget', 'windows_update')
}

// GET /api/v1/agents/{agentId}/docker - for Docker updates
func (h *AgentHandler) GetDockerImages(c *gin.Context) {
    agentID := c.Param("agentId")
    // Return Docker image updates from docker_images table
}

// GET /api/v1/agents/active - for Live Operations page
func (h *AgentHandler) GetActiveAgents(c *gin.Context) {
    // Return only agents with:
    // - Active commands (status != 'completed')
    // - Recent heartbeat (< 5 minutes)
    // - Self-upgrade in progress
}

Phase 5: UI/UX Consistency

5.1 Frosted Glass Design System

/* Frosted glass component library */
.frosted-card {
  background: rgba(255, 255, 255, 0.05);
  backdrop-filter: blur(12px);
  border: 1px solid rgba(255, 255, 255, 0.1);
  border-radius: 12px;
  transition: all 0.3s ease;
}

.frosted-card:hover {
  background: rgba(255, 255, 255, 0.08);
  transform: translateY(-1px);
  box-shadow: 0 8px 32px rgba(0, 0, 0, 0.2);
}

.live-indicator {
  animation: pulse 2s infinite;
}

@keyframes pulse {
  0%, 100% { opacity: 1; }
  50% { opacity: 0.5; }
}

5.2 Agent Health UI Rework (Future)

Agent Health Tab (Planned Enhancements):

┌─ System Health ────────────────────────┐
│ 🟢 CPU: Normal (15% usage)              │
│ 🟢 Memory: Normal (51% usage)          │
│ 🟢 Disk: Normal (44% used)              │
│ 🟢 Network: Connected (100ms latency)  │
│ 🟢 Uptime: 4 days, 12 hours            │
└─────────────────────────────────────────┘

┌─ Agent Health ────────────────────────┐
│ 🟢 Version: v0.1.22                    │
│ 🟢 Last Check-in: 2 minutes ago        │
│ 🟢 Commands: 1 active                  │
│ 🟢 Success Rate: 98.5% (247/251)       │
│ 🟢 Errors: None in last 24h            │
└─────────────────────────────────────────┘

┌─ Recent Activity Timeline ─────────────┐
│ ✅ scan_updates completed • 2m ago     │
│ ✅ package install: 7zip • 1h ago      │
│ ❌ scan_docker failed • 2h ago          │
│ ✅ heartbeat received • 2m ago         │
└─────────────────────────────────────────┘

🚀 Implementation Priority

Priority 1: Critical Data Classification Fix

✅ Create new API endpoints: ReportMetrics(), ReportDockerImages()
✅ Fix agent subsystem handlers to use correct endpoints
✅ Update UpdateReportRequest model to add validation
✅ Create separate database tables: metrics, docker_images

Priority 2: Live Operations Refactor

✅ Implement agent-centric view (active agents only)
✅ Create GetActiveAgents() endpoint
✅ Apply frosted glass UI consistency
✅ Add subsystem status aggregation

Priority 3: Agent Pages Integration

✅ Create agent-specific endpoints for storage, system, packages, docker
✅ Update Storage & Disks tab to show live metrics
✅ Fix Updates & Packages tab to filter out non-packages
✅ Enhance Overview tab with live system metrics

Priority 4: UI Polish

✅ Apply frosted glass consistency across all pages
✅ Add live data indicators during active operations
✅ Refine Agent Health tab (future task)
✅ Add loading states and transitions

🎯 Success Criteria

Data Classification Fixed:

✅ Updates page shows only package updates (APT: 2, DNF: 1)
✅ No more "STORAGE 44% used" showing as updates
✅ Storage metrics appear in Storage & Disks tab only
✅ System metrics appear in Overview tab only

Live Operations Improved:

✅ Shows only active agents (no idle ones)
✅ Agent-centric view with operation grouping
✅ Frosted glass UI consistency
✅ Real-time status updates every 5 seconds

Agent Pages Enhanced:

✅ Storage & Disks shows live data during scans
✅ Overview shows live system metrics
✅ Updates shows only actual package updates
✅ Live data indicators during active operations

Security Maintained:

✅ All new endpoints use existing nonce validation
✅ Command validation enforced
✅ No WebSockets (maintains security profile)
✅ Agent authentication preserved

📋 Testing Checklist

Verify scan_storage data goes to Storage & Disks tab, not Updates
Verify scan_system data goes to Overview tab, not Updates
Verify scan_docker data appears correctly in Updates tab
Verify Live Operations shows only active agents
Verify stuck scan_updates operations are resolved
Verify frosted glass UI consistency across pages
Verify security validation on all new endpoints
Verify live data updates during active operations
Verify existing functionality remains intact

🔧 Migration Notes

Database Changes Required:
- New metrics table for storage/system data
- New docker_images table for Docker data
- Update existing update_events constraints to reject non-package types
Agent Deployment:
- Requires agent binary update (v0.1.23+)
- Backward compatibility maintained during transition
- Old agents will continue to work but data classification issues persist
UI Deployment:
- Frontend changes independent of backend
- Can deploy gradually per page
- Live Operations changes first, then Agent pages

📈 Performance Impact

Reduced database load: Proper data classification reduces query complexity
Improved UI responsiveness: Active agent filtering reduces DOM elements
Better user experience: Agent-centric view scales to 100+ agents
Enhanced security: No WebSocket attack surface

Document created: 2025-11-03 Author: Claude Code Assistant Version: 1.0

24 KiB Raw Blame History