14 KiB
RedFlag UI and Critical Fixes - Implementation Plan
Date: 2025-11-10 Version: v0.1.23.4 → v0.1.23.5 Status: Investigation Complete, Implementation Ready
Executive Summary
Based on investigation of the three critical issues identified, here's the complete breakdown of what's happening and what needs to be fixed.
Issue #1: Scan Updates Quirk - INVESTIGATION COMPLETE ✅
Symptoms
- Disk/boot metrics (44% used) appearing as "approve/reject" updates in UI
- Old monolithic logic intercepting new subsystem scanners
Investigation Results
Agent-Side: ✅ CORRECT
- Orchestrator scanners correctly call the right endpoints:
- Storage Scanner →
ReportMetrics()(✅ Correct) - System Scanner →
ReportMetrics()(✅ Correct) - Update Scanners (APT, DNF, Docker, etc.) →
ReportUpdates()(✅ Correct)
- Storage Scanner →
Server-Side Handlers: ✅ CORRECT
ReportUpdateshandler (updates.go:67) stores inupdate_eventstableReportMetricshandler (metrics.go:31) stores inmetricstable- Both handlers properly separated and functioning
Root Cause Identified:
The old monolithic handleScanUpdates function (main.go:985-1153) still exists in the codebase. While it's not currently registered in the command switch statement (which uses handleScanUpdatesV2 correctly), there are two possibilities:
- Old data in the database from before the subsystem refactor
- Windows service code (service/windows.go) uses old version constant (0.1.16) and may have different logic
Fix Required
Option A - Database Cleanup (Quick Fix):
-- Check for misclassified data
SELECT package_type, COUNT(*) as count
FROM update_events
WHERE package_type IN ('storage', 'system')
GROUP BY package_type;
-- If found, move to metrics table or delete old data
Option B - Code Cleanup (Recommended):
- Delete the old
handleScanUpdatesfunction (lines 985-1153 in main.go) - Update Windows service version constant to match (0.1.23)
- Verify no other references to old function
Priority: Medium (data issue, not functional bug) Risk: Low (cleanup operation)
Issue #2: UI Version Display Missing
Current State
WebUI only shows major version (0.1.23), not full octet (0.1.23.4)
Implementation Needed
File: aggregator-web/src/pages/Dashboard.tsx
Agent Card View - Add version display:
// Add to agent card display
<AgentCard>
...
<div className="agent-version">
<span className="label">Version:</span>
<span className="value">{agent.current_version || 'Unknown'}</span>
</div>
</AgentCard>
Agent Details View - Add full version string:
// Add to details panel
<AgentDetails>
...
<DetailRow>
<Label>Agent Version</Label>
<Value>{agent.current_version || agent.config_version || 'Unknown'}</Value>
</DetailRow>
</AgentDetails>
API Data Available:
- The backend already populates
current_versionfield in API response - May need to ensure full version string (with octet) is stored and returned
Tasks
- Verify backend returns full version string with octet
- Update Agent Card to display version
- Update Agent Details page to display version prominently
- Consider adding version to agent list table view
Priority: Low (cosmetic, but important for debugging) Risk: Very Low (UI only)
Issue #3: Same-Version Installation Logic
Current Logic
// In update handler (pseudo-code)
if version < current {
return error("downgrade not allowed")
}
// What about version == current? ❓
Use Cases
Scenario A: Agent Reinstall
- Agent needs to reinstall same version (config corruption, binary issues)
- Should allow:
version == current
Scenario B: Accidental Update Click
- User clicks update but agent already on that version
- Should we allow, block, or warn?
Decision Options
Option A: Allow Same-Version (Recommended)
- Supports reinstall scenario
- No security risk (same version)
- Simple implementation: change
version < currenttoversion <= current - Prevents unnecessary support tickets
Option B: Block Same-Version
- Prevents no-op updates
- May frustrate users trying to reinstall
- Requires workaround documentation
Option C: Warning + Allow
if version == current {
log.Printf("Warning: Agent %s already on version %s, proceeding with reinstall", agentID, version)
}
if version < current {
return error("downgrade not allowed")
}
Implementation Location
Agent-Side:
File: aggregator-agent/cmd/agent/subsystem_handlers.go
Function: handleUpdateAgent() (lines 346-536)
Current version check:
// Somewhere in the update logic (needs to be added)
currentVersion := cfg.AgentVersion
targetVersion := params["version"]
if compareVersions(targetVersion, currentVersion) <= 0 {
// Handle same version or downgrade
}
Server-Side:
File: aggregator-server/internal/api/handlers/agent_build.go
Check version constraints before sending update command.
Recommendation
Option A - Allow same-version installations
Reasons:
- Reinstall is a valid use case
- No security implications
- Easiest to implement and document
- User expectation: "Update" button should work even if already on version
Tasks
- Define version comparison logic
- Add check in agent update handler (allow ==, block <)
- Add logging for same-version reinstalls
- Update UI to show appropriate messages
Priority: Low (edge case) Risk: Very Low (no security impact)
Phase 2: Middleware Version Upgrade Fix
Current Status
- Phase 1 (Build Orchestrator): 90% complete
- Phase 2 (Middleware): Starting
Known Issues
- Version Upgrade Catch-22: Middleware blocks updates due to version check
- Update-Aware Middleware: Need to detect upgrading agents and relax constraints
- Command Processing: Need complete implementation
Implementation Plan
1. Update-Aware Middleware
- Detect when agent is in update process
- Relax machine ID binding during upgrade
- Restore binding after completion
2. Same-Version Logic
- Implement decision from Issue #3 above
- Update agent and server validation
3. End-to-End Testing
- Test flow: 0.1.23.4 → 0.1.23.5
- Verify signature verification
- Validate subsystem persistence
- Confirm agent continues operations post-update
Tasks
- Implement middleware version upgrade detection
- Add nonce validation for replay protection
- Implement same-version installation logic
- Test complete update cycle
- Verify signature verification
Priority: High (blocks Phase 2 completion) Risk: Medium (need to ensure security not compromised)
Build Orchestrator Status (Phase 1 - 90% Complete)
Completed ✅
- Signed binary generation (build_orchestrator.go)
- Ed25519 signing integration (SignFile())
- Generic binary signing (Option 2 approach)
- Download handler serves signed binaries
- Config separation (config.json not embedded)
Remaining ⏳
- Agent update flow testing (0.1.23.4 → 0.1.23.5)
- End-to-end verification
- Signature verification on agent side (placeholder in place)
Ready for Cleanup
The following dead code should be removed:
TLSConfigstruct in config.go (lines 23-29)- Docker artifact generation in agent_builder.go
- Old config fields:
CertFile,KeyFile,CAFile
Phase 3: Security Hardening
Tasks
- Remove JWT secret logging (debug mode only)
- Implement per-server JWT secrets (not shared)
- Clean dead code (TLSConfig, Docker fields)
- Consider kernel keyring config protection
Token Security Decision
Status: Sliding window refresh tokens are adequate
- Machine ID binding prevents cross-machine token reuse
- Token theft requires filesystem access (already compromised)
- True rotation deferred to v0.3.0
Priority: Medium Risk: Low (current implementation adequate)
Testing Checklist
Agent Update Flow Test
- Bump version to 0.1.23.5
- Build signed binary for 0.1.23.5
- Test update from 0.1.23.4 → 0.1.23.5
- Verify signature verification works
- Confirm agent restarts successfully
- Validate subsystems still enabled post-update
- Verify metrics still reporting correctly
- Check update_events table for corruption
UI Display Test
- Version shows on agent card
- Version shows on agent details page
- Version updates after agent update
Subsystem Tests
- Storage scan reports to metrics table
- System scan reports to metrics table
- APT scan reports to update_events table
- Docker scan reports to update_events table
Database Queries for Investigation
Check for Misclassified Data
-- Query 1: Check for storage/system data in update_events
SELECT package_type, COUNT(*) as count
FROM update_events
WHERE package_type IN ('storage', 'system', 'disk', 'boot')
GROUP BY package_type;
-- Query 2: Check metrics table for package update data
SELECT package_type, COUNT(*) as count
FROM metrics
WHERE package_type IN ('apt', 'dnf', 'docker', 'windows', 'winget')
GROUP BY package_type;
-- Query 3: Check agent_subsystems configuration
SELECT name, enabled, auto_run
FROM agent_subsystems
WHERE name IN ('storage', 'system', 'updates');
Cleanup Queries (If Needed)
-- Move or delete misclassified data
-- BACKUP FIRST!
-- Check how many records
SELECT COUNT(*) FROM update_events
WHERE package_type = 'storage';
-- Delete (or move to metrics table)
DELETE FROM update_events
WHERE package_type IN ('storage', 'system')
AND created_at < NOW() - INTERVAL '7 days';
Code Locations Reference
Agent-Side
aggregator-agent/cmd/agent/main.go- Command routing (line 864-882)aggregator-agent/cmd/agent/subsystem_handlers.go- Scan handlersaggregator-agent/cmd/agent/main.go:985- OLDhandleScanUpdates(delete)aggregator-agent/internal/service/windows.go:32- Old version constant (update)
API Handlers
aggregator-server/internal/api/handlers/updates.go:67- ReportUpdatesaggregator-server/internal/api/handlers/metrics.go:31- ReportMetricsaggregator-server/internal/api/handlers/agent_build.go- Update logic
WebUI
aggregator-web/src/pages/Dashboard.tsx- Agent card and detailsaggregator-web/src/pages/settings/AgentManagement.tsx- Version display
Database Tables
update_events- Package updates (apt, dnf, docker, etc.)metrics- System metrics (storage, system, cpu, memory)agent_subsystems- Subsystem configuration
Recommended Implementation Order
Week 1 (Critical Fixes)
- Database Investigation - Run queries to check for misclassified data
- UI Version Display - Add version to agent cards and details (easy win)
- Same-Version Logic Decision - Make decision and implement
- Test Update Flow - 0.1.23.4 → 0.1.23.5
Week 2 (Phase 2 Completion)
- Middleware Version Upgrade - Implement detection logic
- Security Hardening - JWT logging, per-server secrets
- Code Cleanup - Remove old handleScanUpdates function
- Documentation - Update all docs for v0.2.0
Week 3 (Polish)
- Token Rotation (Nice-to-have) - Implement true rotation
- Enhanced UI - Improve metrics display
- Testing - Full integration test suite
Risk Assessment
| Issue | Priority | Risk | Effort |
|---|---|---|---|
| Scan Updates Quirk | Medium | Low | 2 hours |
| UI Version Display | Low | Very Low | 1 hour |
| Same-Version Logic | Low | Very Low | 1 hour |
| Middleware Upgrade | High | Medium | 4 hours |
| Agent Update Test | High | Medium | 3 hours |
| Security Hardening | Medium | Low | 4 hours |
Decision Log
Decision 1: Same-Version Installations
Status: Pending Options: Allow / Block / Warn Recommendation: Allow (supports reinstall use case)
Decision 2: Token Rotation Priority
Status: Defer to v0.3.0 Rationale: Machine ID binding provides adequate security Decision: Defer - sliding window sufficient
Decision 3: UI Version Display Location
Status: Pending Options: Card only / Details only / Both Recommendation: Both for maximum visibility
Decision 4: Scan Updates Fix Approach
Status: Pending Options: Database cleanup / Code cleanup Recommendation: Both - cleanup old data AND remove dead code
Next Steps
Immediate (Today)
- ☐ Check database for misclassified data using queries above
- ☐ Make decisions on Same-Version logic (Allow/Block)
- ☐ Decide on token rotation (now vs defer)
- ☐ Run test update flow
This Week
- ☐ Implement UI version display
- ☐ Implement same-version installation logic
- ☐ Complete middleware version upgrade
- ☐ Remove JWT secret logging
Next Week
- ☐ Full integration testing
- ☐ Update documentation
- ☐ Prepare v0.2.0 release
Notes
Build Orchestrator Misalignment - RESOLVED ✅
- Originally generating Docker configs, installer expecting native binaries
- Fixed: Now generates signed native binaries per version/platform
- Signed packages stored in database
- Download endpoint serves correct binaries
Version Upgrade Catch-22 - IN PROGRESS ⚠️
- Middleware blocks updates due to machine ID binding
- Need update-aware middleware to detect upgrading agents
- Nonce validation needed for replay protection
Token Security - ADEQUATE ✅
- Sliding window refresh tokens sufficient
- Machine ID binding prevents cross-machine token reuse
- True rotation nice-to-have but not critical for v0.2.0
Document Version: 1.0 Last Updated: 2025-11-10 Next Review: After critical fixes completed Owner: @Fimeg Collaborator: Kimi-k2 (Infrastructure Analysis)