379 lines
13 KiB
Markdown
379 lines
13 KiB
Markdown
# RedFlag v0.1.26.0 - Technical Issues and Technical Debt Audit
|
|
|
|
**Document Version**: 1.0
|
|
**Date**: 2025-12-19
|
|
**Scope**: Post-Issue#3 Implementation Audit
|
|
**Status**: ACTIVE ISSUES requiring immediate resolution
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
During the implementation of Issue #3 (subsystem tracking) and the command recovery fix, we identified **critical architectural issues** that violate ETHOS principles and create user-facing bugs. This document catalogs all issues, their root causes, and required fixes.
|
|
|
|
**Issues by Severity**:
|
|
- 🔴 **CRITICAL**: 3 issues (user-facing bugs, data corruption risk)
|
|
- 🟡 **HIGH**: 4 issues (technical debt, maintenance burden)
|
|
- 🟢 **MEDIUM**: 2 issues (code quality, naming violations)
|
|
|
|
---
|
|
|
|
## 🔴 CRITICAL ISSUES (User-Facing)
|
|
|
|
### 1. Storage Scans Appearing as Package Updates
|
|
|
|
**Severity**: 🔴 CRITICAL
|
|
**User Impact**: HIGH
|
|
**ETHOS Violations**: #1 (Errors are History - data in wrong place), #5 (No BS - misleading UI)
|
|
|
|
**Problem**: Storage scan results (`handleScanStorage`) are appearing on the Updates page alongside package updates. Users see disk usage metrics (partition sizes, mount points) mixed with apt/dnf package updates.
|
|
|
|
**Root Cause**: `handleScanStorage` in `aggregator-agent/cmd/agent/subsystem_handlers.go` calls `ReportLog()` which stores entries in `update_logs` table, the same table used for package updates.
|
|
|
|
**Location**:
|
|
- Agent: `aggregator-agent/cmd/agent/subsystem_handlers.go:119-123`
|
|
```go
|
|
// Report the scan log (WRONG - this goes to update_logs table)
|
|
if err := reportLogWithAck(apiClient, cfg, ackTracker, logReport); err != nil {
|
|
log.Printf("Failed to report scan log: %v\n", err)
|
|
}
|
|
```
|
|
|
|
**Correct Behavior**: Storage scans should ONLY report to `/api/v1/agents/:id/storage-metrics` endpoint, which stores in dedicated `storage_metrics` table.
|
|
|
|
**Fix Required**:
|
|
1. Comment out/remove the `ReportLog` call in `handleScanStorage` (lines 119-123)
|
|
2. Verify `ReportStorageMetrics` call (lines 162-164) is working
|
|
3. Register missing route for GET `/api/v1/agents/:id/storage-metrics` if not already registered
|
|
|
|
**Verification Steps**:
|
|
- Trigger storage scan from UI
|
|
- Verify NO new entries appear on Updates page
|
|
- Verify data appears on Storage page
|
|
- Check `storage_metrics` table has new rows
|
|
|
|
---
|
|
|
|
### 2. System Scans Appearing as Package Updates
|
|
|
|
**Severity**: 🔴 CRITICAL
|
|
**User Impact**: HIGH
|
|
**ETHOS Violations**: #1, #5
|
|
|
|
**Problem**: System scan results (CPU, memory, processes, uptime) are appearing on Updates page as LOW severity package updates.
|
|
|
|
**User Report**: "On the Updates tab, the top 6-7 'updates' are system specs, not system packages. They are HD details or processes, or partition sizes."
|
|
|
|
**Root Cause**: `handleScanSystem` also calls `ReportLog()` storing in `update_logs` table.
|
|
|
|
**Location**:
|
|
- Agent: `aggregator-agent/cmd/agent/subsystem_handlers.go:207-211`
|
|
```go
|
|
// Report the scan log (WRONG - this goes to update_logs table)
|
|
if err := reportLogWithAck(apiClient, cfg, ackTracker, logReport); err != nil {
|
|
log.Printf("Failed to report scan log: %v\n", err)
|
|
}
|
|
```
|
|
|
|
**Correct Behavior**: System scans should ONLY report to `/api/v1/agents/:id/metrics` endpoint.
|
|
|
|
**Fix Required**:
|
|
1. Comment out/remove the `ReportLog` call in `handleScanSystem` (lines 207-211)
|
|
2. Verify `ReportMetrics` call is working
|
|
3. Register missing route for GET endpoint if needed
|
|
|
|
---
|
|
|
|
### 3. Duplicate "Scan All" Entries in History
|
|
|
|
**Severity**: 🔴 CRITICAL
|
|
**User Impact**: MEDIUM
|
|
**ETHOS Violations**: #1 (duplicate history entries), #4 (not idempotent)
|
|
|
|
**Problem**: When triggering a full system scan (`handleScanUpdatesV2`), users see TWO entries:
|
|
- One generic "scan updates" collective entry
|
|
- Plus individual entries for each subsystem
|
|
|
|
**Root Cause**: `handleScanUpdatesV2` creates a collective log (lines 44-57) while orchestrator also logs individual scan results via individual handlers.
|
|
|
|
**Location**:
|
|
- Agent: `aggregator-agent/cmd/agent/subsystem_handlers.go:44-63`
|
|
```go
|
|
// Create scan log entry with subsystem metadata (COLLECTIVE)
|
|
logReport := client.LogReport{
|
|
CommandID: commandID,
|
|
Action: "scan_updates",
|
|
Result: map[bool]string{true: "success", false: "failure"}[exitCode == 0],
|
|
// ...
|
|
}
|
|
// Report the scan log
|
|
if err := reportLogWithAck(apiClient, cfg, ackTracker, logReport); err != nil {
|
|
log.Printf("Failed to report scan log: %v\n", err)
|
|
}
|
|
```
|
|
|
|
**Fix Required**:
|
|
1. Comment out lines 44-63 (remove collective logging from handleScanUpdatesV2)
|
|
2. Keep individual subsystem logging (lines 60, 121, 209, 291)
|
|
|
|
**Verification**: After fix, only individual subsystem entries should appear (scan_docker, scan_storage, scan_system, etc.)
|
|
|
|
---
|
|
|
|
## 🟡 HIGH PRIORITY ISSUES (Technical Debt)
|
|
|
|
### 4. Missing Route Registration for Storage Metrics Endpoint
|
|
|
|
**Severity**: 🟡 HIGH
|
|
**Impact**: Storage page empty
|
|
**ETHOS Violations**: #3 (Assume Failure), #4 (Idempotency - retry won't work without route)
|
|
|
|
**Problem**: Backend has handler functions but routes are not registered. Agent cannot report storage metrics.
|
|
|
|
**Location**:
|
|
- Handler exists: `aggregator-server/internal/api/handlers/storage_metrics.go:26,75`
|
|
- **Missing**: Route registration in router setup
|
|
|
|
**Handlers Without Routes**:
|
|
```go
|
|
// Exists but not wired to HTTP routes:
|
|
func (h *StorageMetricsHandler) ReportStorageMetrics(c *gin.Context) // POST
|
|
func (h *StorageMetricsHandler) GetStorageMetrics(c *gin.Context) // GET
|
|
```
|
|
|
|
**Fix Required**:
|
|
Find route registration file (likely `cmd/server/main.go` or `internal/api/server.go`) and add:
|
|
```go
|
|
agentGroup := router.Group("/api/v1/agents", middleware...)
|
|
agentGroup.POST("/:id/storage-metrics", storageMetricsHandler.ReportStorageMetrics)
|
|
agentGroup.GET("/:id/storage-metrics", storageMetricsHandler.GetStorageMetrics)
|
|
```
|
|
|
|
---
|
|
|
|
### 5. Route Registration for Metrics Endpoint
|
|
|
|
**Severity**: 🟡 HIGH
|
|
**Impact**: System page potentially empty
|
|
|
|
**Problem**: Similar to #4, `/api/v1/agents/:id/metrics` endpoint may not be registered.
|
|
|
|
**Location**: Need to verify routes exist for system metrics reporting.
|
|
|
|
---
|
|
|
|
### 6. Database Migration Not Applied
|
|
|
|
**Severity**: 🟡 HIGH
|
|
**Impact**: Subsystem column doesn't exist, subsystem queries will fail
|
|
|
|
**Problem**: Migration `022_add_subsystem_to_logs.up.sql` created but not run. Server code references `subsystem` column which doesn't exist.
|
|
|
|
**Files**:
|
|
- Created: `aggregator-server/internal/database/migrations/022_add_subsystem_to_logs.up.sql`
|
|
- Referenced: `aggregator-server/internal/models/update.go:61`
|
|
- Referenced: `aggregator-server/internal/api/handlers/updates.go:226-230`
|
|
|
|
**Verification**:
|
|
```sql
|
|
\d update_logs
|
|
-- Should show: subsystem | varchar(50) |
|
|
```
|
|
|
|
**Fix Required**:
|
|
```bash
|
|
cd aggregator-server
|
|
go run cmd/server/main.go -migrate
|
|
```
|
|
|
|
---
|
|
|
|
## 🟢 MEDIUM PRIORITY ISSUES (Code Quality)
|
|
|
|
### 7. Frontend File Duplication - Marketing Fluff Naming
|
|
|
|
**Severity**: 🟢 MEDIUM
|
|
**ETHOS Violations**: #5 (No Marketing Fluff - "enhanced" is banned), Technical Debt
|
|
|
|
**Problem**: Duplicate files with marketing fluff naming.
|
|
|
|
**Files**:
|
|
- `aggregator-web/src/components/AgentUpdates.tsx` (236 lines - old/simple version)
|
|
- `aggregator-web/src/components/AgentUpdatesEnhanced.tsx` (567 lines - current version)
|
|
- `aggregator-web/src/components/AgentUpdate.tsx` (Agent binary updater - legitimate)
|
|
|
|
**ETHOS Violation**:
|
|
From ETHOS.md line 67: **Banned Words**: enhanced, enterprise-ready, seamless, robust, production-ready, revolutionary, etc.
|
|
|
|
**Quote from ETHOS**:
|
|
> "We are building an 'honest' tool for technical users, not pitching a product. Fluff hides meaning and creates enterprise BS."
|
|
|
|
**Fix Required**:
|
|
```bash
|
|
# Remove old duplicate
|
|
cd aggregator-web/src/components
|
|
rm AgentUpdates.tsx
|
|
|
|
# Rename to remove marketing fluff
|
|
mv AgentUpdatesEnhanced.tsx AgentUpdates.tsx
|
|
|
|
# Search and replace all imports
|
|
grep -r "AgentUpdatesEnhanced" src/ --include="*.tsx" --include="*.ts"
|
|
# Replace with "AgentUpdates"
|
|
```
|
|
|
|
**Verification**: Application builds and runs with renamed component.
|
|
|
|
---
|
|
|
|
### 8. Backend V2 Naming Pattern - Poor Refactoring
|
|
|
|
**Severity**: 🟢 MEDIUM
|
|
**ETHOS Violations**: #5 (No BS), Technical Debt
|
|
|
|
**Problem**: `handleScanUpdatesV2` suggests V1 exists or poor refactoring.
|
|
|
|
**Location**: `aggregator-agent/cmd/agent/subsystem_handlers.go:28`
|
|
|
|
**Historical Context**: Likely created during orchestrator refactor. Old version should have been removed/replaced, not versioned.
|
|
|
|
**Quote from ETHOS** (line 59-60):
|
|
> "Never use banned words or emojis in logs or code. We are building an 'honest' tool..."
|
|
|
|
**Fix Required**:
|
|
1. Check if `handleScanUpdates` (V1) exists anywhere
|
|
2. If V1 doesn't exist: rename `handleScanUpdatesV2` to `handleScanUpdates`
|
|
3. Update all references in command routing
|
|
|
|
---
|
|
|
|
## Original Issues (Already Fixed)
|
|
|
|
### ✅ Command Status Bug (Priority 1 - FIXED)
|
|
|
|
**File**: `aggregator-server/internal/api/handlers/agents.go:446`
|
|
|
|
**Problem**: `MarkCommandSent()` error was not checked. Commands returned to agent but stayed in 'pending' status, causing infinite re-delivery.
|
|
|
|
**Fix Applied**:
|
|
1. Added `GetStuckCommands()` query to recover stuck commands
|
|
2. Modified check-in handler to recover commands older than 5 minutes
|
|
3. Added proper error handling with [HISTORY] logging
|
|
4. Changed source from "web_ui" to "manual" to match DB constraint
|
|
|
|
**Verification**: Build successful, ready for testing
|
|
|
|
---
|
|
|
|
### ✅ Issue #3 - Subsystem Tracking (Priority 2 - IMPLEMENTED)
|
|
|
|
**Status**: Backend implementation complete, pending database migration
|
|
|
|
**Files Modified**:
|
|
1. Migration created: `022_add_subsystem_to_logs.up.sql`
|
|
2. Models updated: `UpdateLog` and `UpdateLogRequest` with `Subsystem` field
|
|
3. Backend handlers updated to extract subsystem from action
|
|
4. Agent client updated to send subsystem from metadata
|
|
5. Query functions added: `GetLogsByAgentAndSubsystem()`, `GetSubsystemStats()`
|
|
|
|
**Pending**:
|
|
1. Run database migration
|
|
2. Verify frontend receives subsystem data
|
|
3. Test all 7 subsystems independently
|
|
|
|
---
|
|
|
|
## Complete Fix Sequence
|
|
|
|
### Phase 1: Critical User-Facing Bugs (MUST DO NOW)
|
|
1. ✅ Fix #1: Comment out ReportLog in handleScanStorage (lines 119-123)
|
|
2. ✅ Fix #2: Comment out ReportLog in handleScanSystem (lines 207-211)
|
|
3. ✅ Fix #3: Comment out collective logging in handleScanUpdatesV2 (lines 44-63)
|
|
4. ✅ Fix #4: Register storage-metrics routes
|
|
5. ✅ Fix #5: Register metrics routes
|
|
|
|
### Phase 2: Database & Technical Debt
|
|
6. ✅ Fix #6: Run migration 022_add_subsystem_to_logs
|
|
7. ✅ Fix #7: Remove AgentUpdates.tsx, rename AgentUpdatesEnhanced.tsx
|
|
8. ✅ Fix #8: Remove V2 suffix from handleScanUpdates (if no V1 exists)
|
|
|
|
### Phase 3: Verification
|
|
9. Test storage scan - should appear ONLY on Storage page
|
|
10. Test system scan - should appear ONLY on System page
|
|
11. Test full scan - should show individual subsystem entries only
|
|
12. Verify history shows proper subsystem names
|
|
|
|
---
|
|
|
|
## ETHOS Compliance Checklist
|
|
|
|
For each fix, we must verify:
|
|
|
|
- [ ] **ETHOS #1**: All errors logged with context, no `/dev/null`
|
|
- [ ] **ETHOS #2**: No new unauthenticated endpoints
|
|
- [ ] **ETHOS #3**: Fallback paths exist (retry logic, circuit breakers)
|
|
- [ ] **ETHOS #4**: Idempotency verified (run 3x safely)
|
|
- [ ] **ETHOS #5**: No marketing fluff (no "enhanced", "robust", etc.)
|
|
- [ ] **Pre-Integration**: History logging added, security review, tests
|
|
|
|
---
|
|
|
|
## Files to Delete/Rename
|
|
|
|
### Delete These Files:
|
|
- `aggregator-web/src/components/AgentUpdates.tsx` (236 lines, old version)
|
|
|
|
### Rename These Files:
|
|
- `aggregator-agent/cmd/agent/subsystem_handlers.go:28` - rename `handleScanUpdatesV2` → `handleScanUpdates`
|
|
- `aggregator-web/src/components/AgentUpdatesEnhanced.tsx` → `AgentUpdates.tsx`
|
|
|
|
### Lines to Comment Out:
|
|
- `aggregator-agent/cmd/agent/subsystem_handlers.go:44-63` (collective logging)
|
|
- `aggregator-agent/cmd/agent/subsystem_handlers.go:119-123` (ReportLog in storage)
|
|
- `aggregator-agent/cmd/agent/subsystem_handlers.go:207-211` (ReportLog in system)
|
|
|
|
### Routes to Add:
|
|
- POST `/api/v1/agents/:id/storage-metrics`
|
|
- GET `/api/v1/agents/:id/storage-metrics`
|
|
- Verify GET `/api/v1/agents/:id/metrics` exists
|
|
|
|
---
|
|
|
|
## Session Documentation Requirements
|
|
|
|
As per ETHOS.md: **Every session must identify and document**:
|
|
|
|
1. **New Technical Debt**:
|
|
- Route registration missing (assumed but not implemented)
|
|
- Duplicate frontend files (poor refactoring)
|
|
- V2 naming pattern (poor version control)
|
|
|
|
2. **Deferred Features**:
|
|
- Frontend subsystem icons and display names
|
|
- Comprehensive testing of all 7 subsystems
|
|
|
|
3. **Known Issues**:
|
|
- Database migration not applied in test environment
|
|
- Storage/System pages empty due to missing routes
|
|
|
|
4. **Architecture Decisions**:
|
|
- Decision to keep both collective and individual scan patterns
|
|
- Justification: Different user intents (full audit vs single check)
|
|
|
|
---
|
|
|
|
## Conclusion
|
|
|
|
**Total Issues**: 8 (3 critical, 4 high, 1 medium)
|
|
**Fixes Required**: 8 code changes, 3 deletions, 2 renames
|
|
**Estimated Time**: 2-3 hours for all fixes and verification
|
|
**Status**: Ready for implementation
|
|
|
|
**Next Action**: Implement Phase 1 fixes (critical user-facing bugs) immediately.
|
|
|
|
---
|
|
|
|
**Document Maintained By**: Development Team
|
|
**Last Updated**: 2025-12-19
|
|
**Session**: Issue #3 Implementation & Command Recovery Fix
|