Files
Redflag/docs/historical/SOMEISSUES_v0.1.26.md

379 lines
13 KiB
Markdown

# RedFlag v0.1.26.0 - Technical Issues and Technical Debt Audit
**Document Version**: 1.0
**Date**: 2025-12-19
**Scope**: Post-Issue#3 Implementation Audit
**Status**: ACTIVE ISSUES requiring immediate resolution
---
## Executive Summary
During the implementation of Issue #3 (subsystem tracking) and the command recovery fix, we identified **critical architectural issues** that violate ETHOS principles and create user-facing bugs. This document catalogs all issues, their root causes, and required fixes.
**Issues by Severity**:
- 🔴 **CRITICAL**: 3 issues (user-facing bugs, data corruption risk)
- 🟡 **HIGH**: 4 issues (technical debt, maintenance burden)
- 🟢 **MEDIUM**: 2 issues (code quality, naming violations)
---
## 🔴 CRITICAL ISSUES (User-Facing)
### 1. Storage Scans Appearing as Package Updates
**Severity**: 🔴 CRITICAL
**User Impact**: HIGH
**ETHOS Violations**: #1 (Errors are History - data in wrong place), #5 (No BS - misleading UI)
**Problem**: Storage scan results (`handleScanStorage`) are appearing on the Updates page alongside package updates. Users see disk usage metrics (partition sizes, mount points) mixed with apt/dnf package updates.
**Root Cause**: `handleScanStorage` in `aggregator-agent/cmd/agent/subsystem_handlers.go` calls `ReportLog()` which stores entries in `update_logs` table, the same table used for package updates.
**Location**:
- Agent: `aggregator-agent/cmd/agent/subsystem_handlers.go:119-123`
```go
// Report the scan log (WRONG - this goes to update_logs table)
if err := reportLogWithAck(apiClient, cfg, ackTracker, logReport); err != nil {
log.Printf("Failed to report scan log: %v\n", err)
}
```
**Correct Behavior**: Storage scans should ONLY report to `/api/v1/agents/:id/storage-metrics` endpoint, which stores in dedicated `storage_metrics` table.
**Fix Required**:
1. Comment out/remove the `ReportLog` call in `handleScanStorage` (lines 119-123)
2. Verify `ReportStorageMetrics` call (lines 162-164) is working
3. Register missing route for GET `/api/v1/agents/:id/storage-metrics` if not already registered
**Verification Steps**:
- Trigger storage scan from UI
- Verify NO new entries appear on Updates page
- Verify data appears on Storage page
- Check `storage_metrics` table has new rows
---
### 2. System Scans Appearing as Package Updates
**Severity**: 🔴 CRITICAL
**User Impact**: HIGH
**ETHOS Violations**: #1, #5
**Problem**: System scan results (CPU, memory, processes, uptime) are appearing on Updates page as LOW severity package updates.
**User Report**: "On the Updates tab, the top 6-7 'updates' are system specs, not system packages. They are HD details or processes, or partition sizes."
**Root Cause**: `handleScanSystem` also calls `ReportLog()` storing in `update_logs` table.
**Location**:
- Agent: `aggregator-agent/cmd/agent/subsystem_handlers.go:207-211`
```go
// Report the scan log (WRONG - this goes to update_logs table)
if err := reportLogWithAck(apiClient, cfg, ackTracker, logReport); err != nil {
log.Printf("Failed to report scan log: %v\n", err)
}
```
**Correct Behavior**: System scans should ONLY report to `/api/v1/agents/:id/metrics` endpoint.
**Fix Required**:
1. Comment out/remove the `ReportLog` call in `handleScanSystem` (lines 207-211)
2. Verify `ReportMetrics` call is working
3. Register missing route for GET endpoint if needed
---
### 3. Duplicate "Scan All" Entries in History
**Severity**: 🔴 CRITICAL
**User Impact**: MEDIUM
**ETHOS Violations**: #1 (duplicate history entries), #4 (not idempotent)
**Problem**: When triggering a full system scan (`handleScanUpdatesV2`), users see TWO entries:
- One generic "scan updates" collective entry
- Plus individual entries for each subsystem
**Root Cause**: `handleScanUpdatesV2` creates a collective log (lines 44-57) while orchestrator also logs individual scan results via individual handlers.
**Location**:
- Agent: `aggregator-agent/cmd/agent/subsystem_handlers.go:44-63`
```go
// Create scan log entry with subsystem metadata (COLLECTIVE)
logReport := client.LogReport{
CommandID: commandID,
Action: "scan_updates",
Result: map[bool]string{true: "success", false: "failure"}[exitCode == 0],
// ...
}
// Report the scan log
if err := reportLogWithAck(apiClient, cfg, ackTracker, logReport); err != nil {
log.Printf("Failed to report scan log: %v\n", err)
}
```
**Fix Required**:
1. Comment out lines 44-63 (remove collective logging from handleScanUpdatesV2)
2. Keep individual subsystem logging (lines 60, 121, 209, 291)
**Verification**: After fix, only individual subsystem entries should appear (scan_docker, scan_storage, scan_system, etc.)
---
## 🟡 HIGH PRIORITY ISSUES (Technical Debt)
### 4. Missing Route Registration for Storage Metrics Endpoint
**Severity**: 🟡 HIGH
**Impact**: Storage page empty
**ETHOS Violations**: #3 (Assume Failure), #4 (Idempotency - retry won't work without route)
**Problem**: Backend has handler functions but routes are not registered. Agent cannot report storage metrics.
**Location**:
- Handler exists: `aggregator-server/internal/api/handlers/storage_metrics.go:26,75`
- **Missing**: Route registration in router setup
**Handlers Without Routes**:
```go
// Exists but not wired to HTTP routes:
func (h *StorageMetricsHandler) ReportStorageMetrics(c *gin.Context) // POST
func (h *StorageMetricsHandler) GetStorageMetrics(c *gin.Context) // GET
```
**Fix Required**:
Find route registration file (likely `cmd/server/main.go` or `internal/api/server.go`) and add:
```go
agentGroup := router.Group("/api/v1/agents", middleware...)
agentGroup.POST("/:id/storage-metrics", storageMetricsHandler.ReportStorageMetrics)
agentGroup.GET("/:id/storage-metrics", storageMetricsHandler.GetStorageMetrics)
```
---
### 5. Route Registration for Metrics Endpoint
**Severity**: 🟡 HIGH
**Impact**: System page potentially empty
**Problem**: Similar to #4, `/api/v1/agents/:id/metrics` endpoint may not be registered.
**Location**: Need to verify routes exist for system metrics reporting.
---
### 6. Database Migration Not Applied
**Severity**: 🟡 HIGH
**Impact**: Subsystem column doesn't exist, subsystem queries will fail
**Problem**: Migration `022_add_subsystem_to_logs.up.sql` created but not run. Server code references `subsystem` column which doesn't exist.
**Files**:
- Created: `aggregator-server/internal/database/migrations/022_add_subsystem_to_logs.up.sql`
- Referenced: `aggregator-server/internal/models/update.go:61`
- Referenced: `aggregator-server/internal/api/handlers/updates.go:226-230`
**Verification**:
```sql
\d update_logs
-- Should show: subsystem | varchar(50) |
```
**Fix Required**:
```bash
cd aggregator-server
go run cmd/server/main.go -migrate
```
---
## 🟢 MEDIUM PRIORITY ISSUES (Code Quality)
### 7. Frontend File Duplication - Marketing Fluff Naming
**Severity**: 🟢 MEDIUM
**ETHOS Violations**: #5 (No Marketing Fluff - "enhanced" is banned), Technical Debt
**Problem**: Duplicate files with marketing fluff naming.
**Files**:
- `aggregator-web/src/components/AgentUpdates.tsx` (236 lines - old/simple version)
- `aggregator-web/src/components/AgentUpdatesEnhanced.tsx` (567 lines - current version)
- `aggregator-web/src/components/AgentUpdate.tsx` (Agent binary updater - legitimate)
**ETHOS Violation**:
From ETHOS.md line 67: **Banned Words**: enhanced, enterprise-ready, seamless, robust, production-ready, revolutionary, etc.
**Quote from ETHOS**:
> "We are building an 'honest' tool for technical users, not pitching a product. Fluff hides meaning and creates enterprise BS."
**Fix Required**:
```bash
# Remove old duplicate
cd aggregator-web/src/components
rm AgentUpdates.tsx
# Rename to remove marketing fluff
mv AgentUpdatesEnhanced.tsx AgentUpdates.tsx
# Search and replace all imports
grep -r "AgentUpdatesEnhanced" src/ --include="*.tsx" --include="*.ts"
# Replace with "AgentUpdates"
```
**Verification**: Application builds and runs with renamed component.
---
### 8. Backend V2 Naming Pattern - Poor Refactoring
**Severity**: 🟢 MEDIUM
**ETHOS Violations**: #5 (No BS), Technical Debt
**Problem**: `handleScanUpdatesV2` suggests V1 exists or poor refactoring.
**Location**: `aggregator-agent/cmd/agent/subsystem_handlers.go:28`
**Historical Context**: Likely created during orchestrator refactor. Old version should have been removed/replaced, not versioned.
**Quote from ETHOS** (line 59-60):
> "Never use banned words or emojis in logs or code. We are building an 'honest' tool..."
**Fix Required**:
1. Check if `handleScanUpdates` (V1) exists anywhere
2. If V1 doesn't exist: rename `handleScanUpdatesV2` to `handleScanUpdates`
3. Update all references in command routing
---
## Original Issues (Already Fixed)
### ✅ Command Status Bug (Priority 1 - FIXED)
**File**: `aggregator-server/internal/api/handlers/agents.go:446`
**Problem**: `MarkCommandSent()` error was not checked. Commands returned to agent but stayed in 'pending' status, causing infinite re-delivery.
**Fix Applied**:
1. Added `GetStuckCommands()` query to recover stuck commands
2. Modified check-in handler to recover commands older than 5 minutes
3. Added proper error handling with [HISTORY] logging
4. Changed source from "web_ui" to "manual" to match DB constraint
**Verification**: Build successful, ready for testing
---
### ✅ Issue #3 - Subsystem Tracking (Priority 2 - IMPLEMENTED)
**Status**: Backend implementation complete, pending database migration
**Files Modified**:
1. Migration created: `022_add_subsystem_to_logs.up.sql`
2. Models updated: `UpdateLog` and `UpdateLogRequest` with `Subsystem` field
3. Backend handlers updated to extract subsystem from action
4. Agent client updated to send subsystem from metadata
5. Query functions added: `GetLogsByAgentAndSubsystem()`, `GetSubsystemStats()`
**Pending**:
1. Run database migration
2. Verify frontend receives subsystem data
3. Test all 7 subsystems independently
---
## Complete Fix Sequence
### Phase 1: Critical User-Facing Bugs (MUST DO NOW)
1. ✅ Fix #1: Comment out ReportLog in handleScanStorage (lines 119-123)
2. ✅ Fix #2: Comment out ReportLog in handleScanSystem (lines 207-211)
3. ✅ Fix #3: Comment out collective logging in handleScanUpdatesV2 (lines 44-63)
4. ✅ Fix #4: Register storage-metrics routes
5. ✅ Fix #5: Register metrics routes
### Phase 2: Database & Technical Debt
6. ✅ Fix #6: Run migration 022_add_subsystem_to_logs
7. ✅ Fix #7: Remove AgentUpdates.tsx, rename AgentUpdatesEnhanced.tsx
8. ✅ Fix #8: Remove V2 suffix from handleScanUpdates (if no V1 exists)
### Phase 3: Verification
9. Test storage scan - should appear ONLY on Storage page
10. Test system scan - should appear ONLY on System page
11. Test full scan - should show individual subsystem entries only
12. Verify history shows proper subsystem names
---
## ETHOS Compliance Checklist
For each fix, we must verify:
- [ ] **ETHOS #1**: All errors logged with context, no `/dev/null`
- [ ] **ETHOS #2**: No new unauthenticated endpoints
- [ ] **ETHOS #3**: Fallback paths exist (retry logic, circuit breakers)
- [ ] **ETHOS #4**: Idempotency verified (run 3x safely)
- [ ] **ETHOS #5**: No marketing fluff (no "enhanced", "robust", etc.)
- [ ] **Pre-Integration**: History logging added, security review, tests
---
## Files to Delete/Rename
### Delete These Files:
- `aggregator-web/src/components/AgentUpdates.tsx` (236 lines, old version)
### Rename These Files:
- `aggregator-agent/cmd/agent/subsystem_handlers.go:28` - rename `handleScanUpdatesV2``handleScanUpdates`
- `aggregator-web/src/components/AgentUpdatesEnhanced.tsx``AgentUpdates.tsx`
### Lines to Comment Out:
- `aggregator-agent/cmd/agent/subsystem_handlers.go:44-63` (collective logging)
- `aggregator-agent/cmd/agent/subsystem_handlers.go:119-123` (ReportLog in storage)
- `aggregator-agent/cmd/agent/subsystem_handlers.go:207-211` (ReportLog in system)
### Routes to Add:
- POST `/api/v1/agents/:id/storage-metrics`
- GET `/api/v1/agents/:id/storage-metrics`
- Verify GET `/api/v1/agents/:id/metrics` exists
---
## Session Documentation Requirements
As per ETHOS.md: **Every session must identify and document**:
1. **New Technical Debt**:
- Route registration missing (assumed but not implemented)
- Duplicate frontend files (poor refactoring)
- V2 naming pattern (poor version control)
2. **Deferred Features**:
- Frontend subsystem icons and display names
- Comprehensive testing of all 7 subsystems
3. **Known Issues**:
- Database migration not applied in test environment
- Storage/System pages empty due to missing routes
4. **Architecture Decisions**:
- Decision to keep both collective and individual scan patterns
- Justification: Different user intents (full audit vs single check)
---
## Conclusion
**Total Issues**: 8 (3 critical, 4 high, 1 medium)
**Fixes Required**: 8 code changes, 3 deletions, 2 renames
**Estimated Time**: 2-3 hours for all fixes and verification
**Status**: Ready for implementation
**Next Action**: Implement Phase 1 fixes (critical user-facing bugs) immediately.
---
**Document Maintained By**: Development Team
**Last Updated**: 2025-12-19
**Session**: Issue #3 Implementation & Command Recovery Fix