24 KiB
24 KiB
RedFlag Subsystem Scanning Refactor Plan
🎯 Executive Summary
This document outlines the comprehensive refactor of RedFlag's subsystem scanning architecture to fix critical data classification issues, improve Live Operations UX, and implement agent-centric design patterns.
🚨 Critical Issues Identified
Issue #1: Stuck scan_updates Operations
- Problem:
scan_updatesoperations stuck in "sent" status for 96+ minutes - Root Cause: Monolithic command execution with single log entry
- Impact: Poor UX, no visibility into individual subsystem status
Issue #2: Incorrect Data Classification
- Problem: Storage and system metrics stored as "Updates" in database
- Root Cause: All subsystems call
ReportUpdates()endpoint - Impact: Updates page shows "STORAGE 44% used" as if it's a package update
- Evidence:
📋 STORAGE 44.0% used → 0 GB available (showing as Update) 📋 SYSTEM 8 cores, 8 threads → Intel(R) Core(TM) i5 (showing as Update) 📋 DOCKER_IMAGE sha256:2875f → 029660641a0c (showing as Update)
Issue #3: UI/UX Inconsistencies
- Problem: Live Operations shows every operation, not agent-centric
- Problem: Duplicate functionality between Live Ops and Agent pages
- Problem: No frosted glass consistency across pages
🏗️ Existing Agent Page Infrastructure (Reuse Required)
The Agent page already has extensive infrastructure that should be reused rather than duplicated:
📊 Existing Agent Page Components
- Tabs: Overview, Storage & Disks, Updates & Packages, Agent Health, History
- Status System:
getStatusColor(),isOnline()functions - Heartbeat Infrastructure: Color-coded indicators, duration controls
- Real-time Updates: Polling, live status indicators
- Component Library:
AgentStorage,AgentUpdates,AgentScanners, etc.
🎨 Existing UI/UX Patterns to Reuse
Status Color System (utils.ts)
// Already implemented status colors
getStatusColor('online') // → 'text-success-600 bg-success-100'
getStatusColor('offline') // → 'text-danger-600 bg-danger-100'
getStatusColor('pending') // → 'text-warning-600 bg-warning-100'
getStatusColor('installing') // → 'text-indigo-600 bg-indigo-100'
getStatusColor('failed') // → 'text-danger-600 bg-danger-100'
Heartbeat Infrastructure
// Already implemented heartbeat system
const { data: heartbeatStatus } = useHeartbeatStatus(agentId);
const isRapidPolling = heartbeatStatus?.enabled && heartbeatStatus?.active;
const isSystemHeartbeat = heartbeatSource === 'system';
const isManualHeartbeat = heartbeatSource === 'manual';
// Color coding already implemented:
// - System heartbeat: blue with animate-pulse
// - Manual heartbeat: pink with animate-pulse
// - Normal mode: green
// - Loading state: disabled
Online Status Detection
// Already implemented online detection
const isOnline = (lastCheckin: string): boolean => {
const diffMins = Math.floor(diffMs / 60000);
return diffMins < 15; // 15 minute threshold
};
📋 Live Operations Should Use Existing Infrastructure
Agent Selection & Status Display
// Reuse existing agent selection logic from Agents.tsx
const { data: agents } = useAgents();
const selectedAgent = agents?.find(a => a.id === agentId);
// Reuse existing status display
<div className={cn('badge', getStatusColor(isOnline(agent.last_seen) ? 'online' : 'offline'))}>
{isOnline(agent.last_seen) ? 'Online' : 'Offline'}
</div>
// Reuse existing heartbeat indicator
<Activity className={cn(
heartbeatLoading ? 'animate-spin' :
isRapidPolling && isSystemHeartbeat ? 'text-blue-600 animate-pulse' :
isRapidPolling && isManualHeartbeat ? 'text-pink-600 animate-pulse' :
'text-green-600'
)} />
Command Status Tracking
// Reuse existing command tracking logic
const { data: agentCommands } = useActiveCommands();
const heartbeatCommands = agentCommands.filter(cmd =>
cmd.command_type === 'enable_heartbeat' || cmd.command_type === 'disable_heartbeat'
);
const otherCommands = agentCommands.filter(cmd =>
cmd.command_type !== 'enable_heartbeat' && cmd.command_type !== 'disable_heartbeat'
);
🏗️ Solution Architecture
Phase 1: Data Classification Fix (High Priority)
1.1 Agent-Side Changes
Current (BROKEN):
// subsystem_handlers.go:124-136 - handleScanStorage
if len(result.Updates) > 0 {
report := client.UpdateReport{
CommandID: commandID,
Timestamp: time.Now(),
Updates: result.Updates, // ❌ Storage data sent as "updates"
}
if err := apiClient.ReportUpdates(cfg.AgentID, report); err != nil {
return fmt.Errorf("failed to report storage metrics: %w", err)
}
}
Fixed:
// handleScanStorage - FIXED
if len(result.Updates) > 0 {
report := client.MetricsReport{
CommandID: commandID,
Timestamp: time.Now(),
Metrics: result.Updates, // ✅ Storage data sent as metrics
}
if err := apiClient.ReportMetrics(cfg.AgentID, report); err != nil {
return fmt.Errorf("failed to report storage metrics: %w", err)
}
}
// handleScanSystem - FIXED
if len(result.Updates) > 0 {
report := client.MetricsReport{
CommandID: commandID,
Timestamp: time.Now(),
Metrics: result.Updates, // ✅ System data sent as metrics
}
if err := apiClient.ReportMetrics(cfg.AgentID, report); err != nil {
return fmt.Errorf("failed to report system metrics: %w", err)
}
}
// handleScanDocker - FIXED
if len(result.Updates) > 0 {
report := client.DockerReport{
CommandID: commandID,
Timestamp: time.Now(),
Images: result.Updates, // ✅ Docker data sent separately
}
if err := apiClient.ReportDockerImages(cfg.AgentID, report); err != nil {
return fmt.Errorf("failed to report docker images: %w", err)
}
}
1.2 Server-Side New Endpoints
// NEW: Separate endpoints for different data types
// 1. ReportMetrics - for system/storage metrics
func (h *MetricsHandler) ReportMetrics(c *gin.Context) {
agentID := c.MustGet("agent_id").(uuid.UUID)
// ✅ Full security validation (nonce, command validation)
if err := validateNonce(c); err != nil {
c.JSON(http.StatusUnauthorized, gin.H{"error": "invalid nonce"})
return
}
var req models.MetricsReportRequest
if err := c.ShouldBindJSON(&req); err != nil {
c.JSON(http.StatusBadRequest, gin.H{"error": err.Error()})
return
}
// ✅ Verify command exists and belongs to agent
command, err := h.commandQueries.GetCommandByID(req.CommandID)
if err != nil || command.AgentID != agentID {
c.JSON(http.StatusForbidden, gin.H{"error": "unauthorized command"})
return
}
// Store in metrics table, NOT updates table
if err := h.metricsQueries.CreateMetrics(agentID, req.Metrics); err != nil {
c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to store metrics"})
return
}
c.JSON(http.StatusOK, gin.H{"message": "metrics recorded", "count": len(req.Metrics)})
}
// 2. ReportDockerImages - for Docker image information
func (h *DockerHandler) ReportDockerImages(c *gin.Context) {
// Similar security pattern, stores in docker_images table
}
// 3. ReportUpdates - ONLY for actual package updates (RESTRICTED)
func (h *UpdateHandler) ReportUpdates(c *gin.Context) {
// Existing endpoint, but add validation to only accept package types:
// - apt, dnf, winget, windows_update
// - Reject: storage, system, docker_image types
}
1.3 New Data Models
// models/metrics.go
type MetricsReportRequest struct {
CommandID string `json:"command_id"`
Timestamp time.Time `json:"timestamp"`
Metrics []Metric `json:"metrics"`
}
type Metric struct {
PackageType string `json:"package_type"` // "storage", "system", "cpu", "memory"
PackageName string `json:"package_name"` // mount point, metric name
CurrentVersion string `json:"current_version"` // current usage, value
AvailableVersion string `json:"available_version"` // available space, threshold
Severity string `json:"severity"` // "low", "moderate", "high"
RepositorySource string `json:"repository_source"`
Metadata map[string]string `json:"metadata"`
}
// models/docker.go
type DockerReportRequest struct {
CommandID string `json:"command_id"`
Timestamp time.Time `json:"timestamp"`
Images []DockerImage `json:"images"`
}
type DockerImage struct {
PackageType string `json:"package_type"` // "docker_image"
PackageName string `json:"package_name"` // image name:tag
CurrentVersion string `json:"current_version"` // current image ID
AvailableVersion string `json:"available_version"` // latest image ID
Severity string `json:"severity"` // update severity
RepositorySource string `json:"repository_source"` // registry
Metadata map[string]string `json:"metadata"`
}
Phase 2: Live Operations Refactor
2.1 Agent-Centric Design
Live Operations After Refactor:
Live Operations
🖥️ Agent 001 (fedora-server)
Status: 🟢 scan_updates • 45s • 3/5 subsystems complete
Last Action: APT scanning (12s)
▼ Quick Details
└── 🔄 APT: Scanning | ✅ Docker: Complete | 🔄 System: Scanning
🖥️ Agent 002 (ubuntu-workstation)
Status: 🟢 Heartbeat • 2m active
Last Action: System check (2m ago)
▼ Quick Details
└── 💓 Heartbeat monitoring active
🖥️ Agent 007 (docker-host)
Status: 🟢 Self-upgrade • 1m 30s
Last Action: Downloading v0.1.23 (1m ago)
▼ Quick Details
└── ⬇️ Downloading: 75% complete
Key Changes:
- ✅ Only show active agents (no idle ones)
- ✅ Agent-centric view, not operation-centric
- ✅ Group operations by agent
- ✅ Quick expandable details per agent
- ✅ Frosted glass UI consistency
2.2 Live Operations UI Component (Reuse Existing Infrastructure)
const LiveOperations: React.FC = () => {
// Reuse existing agent hooks from Agents.tsx
const { data: agents } = useAgents();
const { data: agentCommands } = useActiveCommands();
const { data: heartbeatStatus } = useHeartbeatStatus();
// Filter for active agents only (reuse existing logic)
const activeAgents = agents?.filter(agent => {
const hasActiveCommands = agentCommands?.some(cmd => cmd.agent_id === agent.id);
const hasActiveHeartbeat = heartbeatStatus?.[agent.id]?.enabled && heartbeatStatus?.[agent.id]?.active;
return hasActiveCommands || hasActiveHeartbeat;
}) || [];
return (
<div className="space-y-3">
{activeAgents.map(agent => {
// Reuse existing heartbeat status logic
const agentHeartbeat = heartbeatStatus?.[agent.id];
const isRapidPolling = agentHeartbeat?.enabled && agentHeartbeat?.active;
const heartbeatSource = agentHeartbeat?.source;
const isSystemHeartbeat = heartbeatSource === 'system';
const isManualHeartbeat = heartbeatSource === 'manual';
// Reuse existing command filtering logic
const agentCommandsList = agentCommands?.filter(cmd => cmd.agent_id === agent.id) || [];
const currentAction = agentCommandsList[0]?.command_type || 'heartbeat';
const operationDuration = agentCommandsList[0]?.duration || 0;
return (
<div key={agent.id} className="frosted-card p-4 border border-white/10">
<div className="flex items-center justify-between">
<div className="flex items-center gap-3">
{/* Reuse existing status indicator */}
<div className={cn(
'w-3 h-3 rounded-full',
isOnline(agent.last_seen) ? 'bg-green-500' : 'bg-gray-400'
)} />
<Computer className="h-5 w-5 text-blue-400" />
<span className="font-medium text-white">{agent.name}</span>
<span className="text-sm text-gray-400">{agent.hostname}</span>
{/* Reuse existing status badge */}
<span className={cn('badge', getStatusColor(isOnline(agent.last_seen) ? 'online' : 'offline'))}>
{isOnline(agent.last_seen) ? 'Online' : 'Offline'}
</span>
</div>
<div className="flex items-center gap-4">
{/* Reuse existing heartbeat indicator with colors */}
<Activity className={cn(
'h-4 w-4',
isRapidPolling && isSystemHeartbeat ? 'text-blue-600 animate-pulse' :
isRapidPolling && isManualHeartbeat ? 'text-pink-600 animate-pulse' :
'text-green-600'
)} />
<span className="px-3 py-1 bg-blue-500/20 rounded-full text-sm text-blue-300">
{currentAction}
</span>
<span className="text-sm text-gray-400">
{formatDuration(operationDuration)}
</span>
</div>
</div>
{/* Reuse existing heartbeat status info */}
{isRapidPolling && (
<div className="mt-2 text-sm text-gray-400">
{isSystemHeartbeat ? 'System ' : 'Manual '}heartbeat active
• Last seen: {formatRelativeTime(agent.last_seen)}
</div>
)}
{/* Show active command details */}
{agentCommandsList.map(cmd => (
<div key={cmd.id} className="mt-2 text-sm text-gray-400">
{cmd.command_type} • {formatRelativeTime(cmd.created_at)}
</div>
))}
</div>
);
})}
</div>
);
};
Phase 3: Agent Pages Integration
3.1 Data Flow to Existing Pages
scan_docker → Updates & Packages tab (shows Docker images properly)
scan_storage → Storage & Disks tab (live disk usage updates)
scan_system → Overview tab (live CPU, memory, uptime updates)
scan_updates → Updates & Packages tab (only actual package updates)
3.2 Storage & Disks Tab Enhancement
// NEW: Live storage data integration
const StorageDisksTab: React.FC<{agentId: string}> = ({ agentId }) => {
const { data: storageData } = useQuery({
queryKey: ['agent-storage', agentId],
queryFn: () => agentApi.getStorageMetrics(agentId),
refetchInterval: 30000, // Update every 30s during live operations
});
return (
<div className="space-y-6">
<StorageChart data={storageData?.storage} />
<DiskUsageTable mounts={storageData?.storage} />
{/* Live indicator */}
{storageData?.isLive && (
<div className="flex items-center gap-2 text-sm text-green-400">
<div className="w-2 h-2 bg-green-400 rounded-full animate-pulse" />
Live data from recent scan
</div>
)}
</div>
);
};
3.3 Updates & Packages Tab Fix
// FIXED: Only show actual package updates
const UpdatesPackagesTab: React.FC<{agentId: string}> = ({ agentId }) => {
const { data: packageUpdates } = useQuery({
queryKey: ['agent-updates', agentId],
queryFn: () => updatesApi.getPackageUpdates(agentId), // NEW: filters only packages
});
return (
<div>
{/* Shows ONLY: APT: 2 updates, DNF: 1 update */}
{/* NO MORE: STORAGE 44% used, SYSTEM 8 cores */}
<PackageUpdatesList updates={packageUpdates?.updates} />
</div>
);
};
3.4 Overview Tab Enhancement
// NEW: Live system metrics integration
const OverviewTab: React.FC<{agentId: string}> = ({ agentId }) => {
const { data: systemMetrics } = useQuery({
queryKey: ['agent-metrics', agentId],
queryFn: () => agentApi.getSystemMetrics(agentId),
refetchInterval: 30000,
});
return (
<div className="grid grid-cols-1 md:grid-cols-2 gap-6">
<SystemCard metrics={systemMetrics?.system} />
<PerformanceCard data={systemMetrics?.performance} />
{systemMetrics?.isLive && (
<div className="col-span-full flex items-center gap-2 text-sm text-green-400">
<Activity className="h-4 w-4" />
Live system metrics from recent scan
</div>
)}
</div>
);
};
Phase 4: API Endpoints for Agent Pages
4.1 New Agent-Specific Endpoints
// GET /api/v1/agents/{agentId}/storage - for Storage & Disks tab
func (h *AgentHandler) GetStorageMetrics(c *gin.Context) {
agentID := c.Param("agentId")
// Return latest storage scan data from metrics table
// Filter by package_type IN ('storage', 'disk')
}
// GET /api/v1/agents/{agentId}/system - for Overview tab
func (h *AgentHandler) GetSystemMetrics(c *gin.Context) {
agentID := c.Param("agentId")
// Return latest system scan data from metrics table
// Filter by package_type IN ('system', 'cpu', 'memory')
}
// GET /api/v1/agents/{agentId}/packages - for Updates tab (filtered)
func (h *AgentHandler) GetPackageUpdates(c *gin.Context) {
agentID := c.Param("agentId")
// Return ONLY package updates, filter out storage/system metrics
// Filter by package_type IN ('apt', 'dnf', 'winget', 'windows_update')
}
// GET /api/v1/agents/{agentId}/docker - for Docker updates
func (h *AgentHandler) GetDockerImages(c *gin.Context) {
agentID := c.Param("agentId")
// Return Docker image updates from docker_images table
}
// GET /api/v1/agents/active - for Live Operations page
func (h *AgentHandler) GetActiveAgents(c *gin.Context) {
// Return only agents with:
// - Active commands (status != 'completed')
// - Recent heartbeat (< 5 minutes)
// - Self-upgrade in progress
}
Phase 5: UI/UX Consistency
5.1 Frosted Glass Design System
/* Frosted glass component library */
.frosted-card {
background: rgba(255, 255, 255, 0.05);
backdrop-filter: blur(12px);
border: 1px solid rgba(255, 255, 255, 0.1);
border-radius: 12px;
transition: all 0.3s ease;
}
.frosted-card:hover {
background: rgba(255, 255, 255, 0.08);
transform: translateY(-1px);
box-shadow: 0 8px 32px rgba(0, 0, 0, 0.2);
}
.live-indicator {
animation: pulse 2s infinite;
}
@keyframes pulse {
0%, 100% { opacity: 1; }
50% { opacity: 0.5; }
}
5.2 Agent Health UI Rework (Future)
Agent Health Tab (Planned Enhancements):
┌─ System Health ────────────────────────┐
│ 🟢 CPU: Normal (15% usage) │
│ 🟢 Memory: Normal (51% usage) │
│ 🟢 Disk: Normal (44% used) │
│ 🟢 Network: Connected (100ms latency) │
│ 🟢 Uptime: 4 days, 12 hours │
└─────────────────────────────────────────┘
┌─ Agent Health ────────────────────────┐
│ 🟢 Version: v0.1.22 │
│ 🟢 Last Check-in: 2 minutes ago │
│ 🟢 Commands: 1 active │
│ 🟢 Success Rate: 98.5% (247/251) │
│ 🟢 Errors: None in last 24h │
└─────────────────────────────────────────┘
┌─ Recent Activity Timeline ─────────────┐
│ ✅ scan_updates completed • 2m ago │
│ ✅ package install: 7zip • 1h ago │
│ ❌ scan_docker failed • 2h ago │
│ ✅ heartbeat received • 2m ago │
└─────────────────────────────────────────┘
🚀 Implementation Priority
Priority 1: Critical Data Classification Fix
- ✅ Create new API endpoints:
ReportMetrics(),ReportDockerImages() - ✅ Fix agent subsystem handlers to use correct endpoints
- ✅ Update UpdateReportRequest model to add validation
- ✅ Create separate database tables:
metrics,docker_images
Priority 2: Live Operations Refactor
- ✅ Implement agent-centric view (active agents only)
- ✅ Create GetActiveAgents() endpoint
- ✅ Apply frosted glass UI consistency
- ✅ Add subsystem status aggregation
Priority 3: Agent Pages Integration
- ✅ Create agent-specific endpoints for storage, system, packages, docker
- ✅ Update Storage & Disks tab to show live metrics
- ✅ Fix Updates & Packages tab to filter out non-packages
- ✅ Enhance Overview tab with live system metrics
Priority 4: UI Polish
- ✅ Apply frosted glass consistency across all pages
- ✅ Add live data indicators during active operations
- ✅ Refine Agent Health tab (future task)
- ✅ Add loading states and transitions
🎯 Success Criteria
Data Classification Fixed:
- ✅ Updates page shows only package updates (APT: 2, DNF: 1)
- ✅ No more "STORAGE 44% used" showing as updates
- ✅ Storage metrics appear in Storage & Disks tab only
- ✅ System metrics appear in Overview tab only
Live Operations Improved:
- ✅ Shows only active agents (no idle ones)
- ✅ Agent-centric view with operation grouping
- ✅ Frosted glass UI consistency
- ✅ Real-time status updates every 5 seconds
Agent Pages Enhanced:
- ✅ Storage & Disks shows live data during scans
- ✅ Overview shows live system metrics
- ✅ Updates shows only actual package updates
- ✅ Live data indicators during active operations
Security Maintained:
- ✅ All new endpoints use existing nonce validation
- ✅ Command validation enforced
- ✅ No WebSockets (maintains security profile)
- ✅ Agent authentication preserved
📋 Testing Checklist
- Verify
scan_storagedata goes to Storage & Disks tab, not Updates - Verify
scan_systemdata goes to Overview tab, not Updates - Verify
scan_dockerdata appears correctly in Updates tab - Verify Live Operations shows only active agents
- Verify stuck scan_updates operations are resolved
- Verify frosted glass UI consistency across pages
- Verify security validation on all new endpoints
- Verify live data updates during active operations
- Verify existing functionality remains intact
🔧 Migration Notes
-
Database Changes Required:
- New
metricstable for storage/system data - New
docker_imagestable for Docker data - Update existing
update_eventsconstraints to reject non-package types
- New
-
Agent Deployment:
- Requires agent binary update (v0.1.23+)
- Backward compatibility maintained during transition
- Old agents will continue to work but data classification issues persist
-
UI Deployment:
- Frontend changes independent of backend
- Can deploy gradually per page
- Live Operations changes first, then Agent pages
📈 Performance Impact
- Reduced database load: Proper data classification reduces query complexity
- Improved UI responsiveness: Active agent filtering reduces DOM elements
- Better user experience: Agent-centric view scales to 100+ agents
- Enhanced security: No WebSocket attack surface
Document created: 2025-11-03 Author: Claude Code Assistant Version: 1.0