Files
Redflag/docs/historical/SOMEISSUES_v0.1.26.md

13 KiB

RedFlag v0.1.26.0 - Technical Issues and Technical Debt Audit

Document Version: 1.0 Date: 2025-12-19 Scope: Post-Issue#3 Implementation Audit Status: ACTIVE ISSUES requiring immediate resolution


Executive Summary

During the implementation of Issue #3 (subsystem tracking) and the command recovery fix, we identified critical architectural issues that violate ETHOS principles and create user-facing bugs. This document catalogs all issues, their root causes, and required fixes.

Issues by Severity:

  • 🔴 CRITICAL: 3 issues (user-facing bugs, data corruption risk)
  • 🟡 HIGH: 4 issues (technical debt, maintenance burden)
  • 🟢 MEDIUM: 2 issues (code quality, naming violations)

🔴 CRITICAL ISSUES (User-Facing)

1. Storage Scans Appearing as Package Updates

Severity: 🔴 CRITICAL User Impact: HIGH ETHOS Violations: #1 (Errors are History - data in wrong place), #5 (No BS - misleading UI)

Problem: Storage scan results (handleScanStorage) are appearing on the Updates page alongside package updates. Users see disk usage metrics (partition sizes, mount points) mixed with apt/dnf package updates.

Root Cause: handleScanStorage in aggregator-agent/cmd/agent/subsystem_handlers.go calls ReportLog() which stores entries in update_logs table, the same table used for package updates.

Location:

  • Agent: aggregator-agent/cmd/agent/subsystem_handlers.go:119-123
// Report the scan log (WRONG - this goes to update_logs table)
if err := reportLogWithAck(apiClient, cfg, ackTracker, logReport); err != nil {
    log.Printf("Failed to report scan log: %v\n", err)
}

Correct Behavior: Storage scans should ONLY report to /api/v1/agents/:id/storage-metrics endpoint, which stores in dedicated storage_metrics table.

Fix Required:

  1. Comment out/remove the ReportLog call in handleScanStorage (lines 119-123)
  2. Verify ReportStorageMetrics call (lines 162-164) is working
  3. Register missing route for GET /api/v1/agents/:id/storage-metrics if not already registered

Verification Steps:

  • Trigger storage scan from UI
  • Verify NO new entries appear on Updates page
  • Verify data appears on Storage page
  • Check storage_metrics table has new rows

2. System Scans Appearing as Package Updates

Severity: 🔴 CRITICAL User Impact: HIGH ETHOS Violations: #1, #5

Problem: System scan results (CPU, memory, processes, uptime) are appearing on Updates page as LOW severity package updates.

User Report: "On the Updates tab, the top 6-7 'updates' are system specs, not system packages. They are HD details or processes, or partition sizes."

Root Cause: handleScanSystem also calls ReportLog() storing in update_logs table.

Location:

  • Agent: aggregator-agent/cmd/agent/subsystem_handlers.go:207-211
// Report the scan log (WRONG - this goes to update_logs table)
if err := reportLogWithAck(apiClient, cfg, ackTracker, logReport); err != nil {
    log.Printf("Failed to report scan log: %v\n", err)
}

Correct Behavior: System scans should ONLY report to /api/v1/agents/:id/metrics endpoint.

Fix Required:

  1. Comment out/remove the ReportLog call in handleScanSystem (lines 207-211)
  2. Verify ReportMetrics call is working
  3. Register missing route for GET endpoint if needed

3. Duplicate "Scan All" Entries in History

Severity: 🔴 CRITICAL User Impact: MEDIUM ETHOS Violations: #1 (duplicate history entries), #4 (not idempotent)

Problem: When triggering a full system scan (handleScanUpdatesV2), users see TWO entries:

  • One generic "scan updates" collective entry
  • Plus individual entries for each subsystem

Root Cause: handleScanUpdatesV2 creates a collective log (lines 44-57) while orchestrator also logs individual scan results via individual handlers.

Location:

  • Agent: aggregator-agent/cmd/agent/subsystem_handlers.go:44-63
// Create scan log entry with subsystem metadata (COLLECTIVE)
logReport := client.LogReport{
    CommandID:       commandID,
    Action:          "scan_updates",
    Result:          map[bool]string{true: "success", false: "failure"}[exitCode == 0],
    // ...
}
// Report the scan log
if err := reportLogWithAck(apiClient, cfg, ackTracker, logReport); err != nil {
    log.Printf("Failed to report scan log: %v\n", err)
}

Fix Required:

  1. Comment out lines 44-63 (remove collective logging from handleScanUpdatesV2)
  2. Keep individual subsystem logging (lines 60, 121, 209, 291)

Verification: After fix, only individual subsystem entries should appear (scan_docker, scan_storage, scan_system, etc.)


🟡 HIGH PRIORITY ISSUES (Technical Debt)

4. Missing Route Registration for Storage Metrics Endpoint

Severity: 🟡 HIGH Impact: Storage page empty ETHOS Violations: #3 (Assume Failure), #4 (Idempotency - retry won't work without route)

Problem: Backend has handler functions but routes are not registered. Agent cannot report storage metrics.

Location:

  • Handler exists: aggregator-server/internal/api/handlers/storage_metrics.go:26,75
  • Missing: Route registration in router setup

Handlers Without Routes:

// Exists but not wired to HTTP routes:
func (h *StorageMetricsHandler) ReportStorageMetrics(c *gin.Context) // POST
func (h *StorageMetricsHandler) GetStorageMetrics(c *gin.Context)   // GET

Fix Required: Find route registration file (likely cmd/server/main.go or internal/api/server.go) and add:

agentGroup := router.Group("/api/v1/agents", middleware...)
agentGroup.POST("/:id/storage-metrics", storageMetricsHandler.ReportStorageMetrics)
agentGroup.GET("/:id/storage-metrics", storageMetricsHandler.GetStorageMetrics)

5. Route Registration for Metrics Endpoint

Severity: 🟡 HIGH Impact: System page potentially empty

Problem: Similar to #4, /api/v1/agents/:id/metrics endpoint may not be registered.

Location: Need to verify routes exist for system metrics reporting.


6. Database Migration Not Applied

Severity: 🟡 HIGH Impact: Subsystem column doesn't exist, subsystem queries will fail

Problem: Migration 022_add_subsystem_to_logs.up.sql created but not run. Server code references subsystem column which doesn't exist.

Files:

  • Created: aggregator-server/internal/database/migrations/022_add_subsystem_to_logs.up.sql
  • Referenced: aggregator-server/internal/models/update.go:61
  • Referenced: aggregator-server/internal/api/handlers/updates.go:226-230

Verification:

\d update_logs
-- Should show: subsystem | varchar(50) |

Fix Required:

cd aggregator-server
go run cmd/server/main.go -migrate

🟢 MEDIUM PRIORITY ISSUES (Code Quality)

7. Frontend File Duplication - Marketing Fluff Naming

Severity: 🟢 MEDIUM ETHOS Violations: #5 (No Marketing Fluff - "enhanced" is banned), Technical Debt

Problem: Duplicate files with marketing fluff naming.

Files:

  • aggregator-web/src/components/AgentUpdates.tsx (236 lines - old/simple version)
  • aggregator-web/src/components/AgentUpdatesEnhanced.tsx (567 lines - current version)
  • aggregator-web/src/components/AgentUpdate.tsx (Agent binary updater - legitimate)

ETHOS Violation: From ETHOS.md line 67: Banned Words: enhanced, enterprise-ready, seamless, robust, production-ready, revolutionary, etc.

Quote from ETHOS:

"We are building an 'honest' tool for technical users, not pitching a product. Fluff hides meaning and creates enterprise BS."

Fix Required:

# Remove old duplicate
cd aggregator-web/src/components
rm AgentUpdates.tsx

# Rename to remove marketing fluff
mv AgentUpdatesEnhanced.tsx AgentUpdates.tsx

# Search and replace all imports
grep -r "AgentUpdatesEnhanced" src/ --include="*.tsx" --include="*.ts"
# Replace with "AgentUpdates"

Verification: Application builds and runs with renamed component.


8. Backend V2 Naming Pattern - Poor Refactoring

Severity: 🟢 MEDIUM ETHOS Violations: #5 (No BS), Technical Debt

Problem: handleScanUpdatesV2 suggests V1 exists or poor refactoring.

Location: aggregator-agent/cmd/agent/subsystem_handlers.go:28

Historical Context: Likely created during orchestrator refactor. Old version should have been removed/replaced, not versioned.

Quote from ETHOS (line 59-60):

"Never use banned words or emojis in logs or code. We are building an 'honest' tool..."

Fix Required:

  1. Check if handleScanUpdates (V1) exists anywhere
  2. If V1 doesn't exist: rename handleScanUpdatesV2 to handleScanUpdates
  3. Update all references in command routing

Original Issues (Already Fixed)

Command Status Bug (Priority 1 - FIXED)

File: aggregator-server/internal/api/handlers/agents.go:446

Problem: MarkCommandSent() error was not checked. Commands returned to agent but stayed in 'pending' status, causing infinite re-delivery.

Fix Applied:

  1. Added GetStuckCommands() query to recover stuck commands
  2. Modified check-in handler to recover commands older than 5 minutes
  3. Added proper error handling with [HISTORY] logging
  4. Changed source from "web_ui" to "manual" to match DB constraint

Verification: Build successful, ready for testing


Issue #3 - Subsystem Tracking (Priority 2 - IMPLEMENTED)

Status: Backend implementation complete, pending database migration

Files Modified:

  1. Migration created: 022_add_subsystem_to_logs.up.sql
  2. Models updated: UpdateLog and UpdateLogRequest with Subsystem field
  3. Backend handlers updated to extract subsystem from action
  4. Agent client updated to send subsystem from metadata
  5. Query functions added: GetLogsByAgentAndSubsystem(), GetSubsystemStats()

Pending:

  1. Run database migration
  2. Verify frontend receives subsystem data
  3. Test all 7 subsystems independently

Complete Fix Sequence

Phase 1: Critical User-Facing Bugs (MUST DO NOW)

  1. Fix #1: Comment out ReportLog in handleScanStorage (lines 119-123)
  2. Fix #2: Comment out ReportLog in handleScanSystem (lines 207-211)
  3. Fix #3: Comment out collective logging in handleScanUpdatesV2 (lines 44-63)
  4. Fix #4: Register storage-metrics routes
  5. Fix #5: Register metrics routes

Phase 2: Database & Technical Debt

  1. Fix #6: Run migration 022_add_subsystem_to_logs
  2. Fix #7: Remove AgentUpdates.tsx, rename AgentUpdatesEnhanced.tsx
  3. Fix #8: Remove V2 suffix from handleScanUpdates (if no V1 exists)

Phase 3: Verification

  1. Test storage scan - should appear ONLY on Storage page
  2. Test system scan - should appear ONLY on System page
  3. Test full scan - should show individual subsystem entries only
  4. Verify history shows proper subsystem names

ETHOS Compliance Checklist

For each fix, we must verify:

  • ETHOS #1: All errors logged with context, no /dev/null
  • ETHOS #2: No new unauthenticated endpoints
  • ETHOS #3: Fallback paths exist (retry logic, circuit breakers)
  • ETHOS #4: Idempotency verified (run 3x safely)
  • ETHOS #5: No marketing fluff (no "enhanced", "robust", etc.)
  • Pre-Integration: History logging added, security review, tests

Files to Delete/Rename

Delete These Files:

  • aggregator-web/src/components/AgentUpdates.tsx (236 lines, old version)

Rename These Files:

  • aggregator-agent/cmd/agent/subsystem_handlers.go:28 - rename handleScanUpdatesV2handleScanUpdates
  • aggregator-web/src/components/AgentUpdatesEnhanced.tsxAgentUpdates.tsx

Lines to Comment Out:

  • aggregator-agent/cmd/agent/subsystem_handlers.go:44-63 (collective logging)
  • aggregator-agent/cmd/agent/subsystem_handlers.go:119-123 (ReportLog in storage)
  • aggregator-agent/cmd/agent/subsystem_handlers.go:207-211 (ReportLog in system)

Routes to Add:

  • POST /api/v1/agents/:id/storage-metrics
  • GET /api/v1/agents/:id/storage-metrics
  • Verify GET /api/v1/agents/:id/metrics exists

Session Documentation Requirements

As per ETHOS.md: Every session must identify and document:

  1. New Technical Debt:

    • Route registration missing (assumed but not implemented)
    • Duplicate frontend files (poor refactoring)
    • V2 naming pattern (poor version control)
  2. Deferred Features:

    • Frontend subsystem icons and display names
    • Comprehensive testing of all 7 subsystems
  3. Known Issues:

    • Database migration not applied in test environment
    • Storage/System pages empty due to missing routes
  4. Architecture Decisions:

    • Decision to keep both collective and individual scan patterns
    • Justification: Different user intents (full audit vs single check)

Conclusion

Total Issues: 8 (3 critical, 4 high, 1 medium) Fixes Required: 8 code changes, 3 deletions, 2 renames Estimated Time: 2-3 hours for all fixes and verification Status: Ready for implementation

Next Action: Implement Phase 1 fixes (critical user-facing bugs) immediately.


Document Maintained By: Development Team Last Updated: 2025-12-19 Session: Issue #3 Implementation & Command Recovery Fix