469 lines
14 KiB
Markdown
469 lines
14 KiB
Markdown
# RedFlag UI and Critical Fixes - Implementation Plan
|
|
**Date:** 2025-11-10
|
|
**Version:** v0.1.23.4 → v0.1.23.5
|
|
**Status:** Investigation Complete, Implementation Ready
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
Based on investigation of the three critical issues identified, here's the complete breakdown of what's happening and what needs to be fixed.
|
|
|
|
---
|
|
|
|
## Issue #1: Scan Updates Quirk - INVESTIGATION COMPLETE ✅
|
|
|
|
### Symptoms
|
|
- Disk/boot metrics (44% used) appearing as "approve/reject" updates in UI
|
|
- Old monolithic logic intercepting new subsystem scanners
|
|
|
|
### Investigation Results
|
|
|
|
**Agent-Side**: ✅ CORRECT
|
|
- Orchestrator scanners correctly call the right endpoints:
|
|
- **Storage Scanner** → `ReportMetrics()` (✅ Correct)
|
|
- **System Scanner** → `ReportMetrics()` (✅ Correct)
|
|
- **Update Scanners** (APT, DNF, Docker, etc.) → `ReportUpdates()` (✅ Correct)
|
|
|
|
**Server-Side Handlers**: ✅ CORRECT
|
|
- `ReportUpdates` handler (updates.go:67) stores in `update_events` table
|
|
- `ReportMetrics` handler (metrics.go:31) stores in `metrics` table
|
|
- Both handlers properly separated and functioning
|
|
|
|
**Root Cause Identified**:
|
|
The old monolithic `handleScanUpdates` function (main.go:985-1153) still exists in the codebase. While it's not currently registered in the command switch statement (which uses `handleScanUpdatesV2` correctly), there are two possibilities:
|
|
|
|
1. **Old data** in the database from before the subsystem refactor
|
|
2. **Windows service code** (service/windows.go) uses old version constant (0.1.16) and may have different logic
|
|
|
|
### Fix Required
|
|
|
|
**Option A - Database Cleanup (Quick Fix)**:
|
|
```sql
|
|
-- Check for misclassified data
|
|
SELECT package_type, COUNT(*) as count
|
|
FROM update_events
|
|
WHERE package_type IN ('storage', 'system')
|
|
GROUP BY package_type;
|
|
|
|
-- If found, move to metrics table or delete old data
|
|
```
|
|
|
|
**Option B - Code Cleanup (Recommended)**:
|
|
1. Delete the old `handleScanUpdates` function (lines 985-1153 in main.go)
|
|
2. Update Windows service version constant to match (0.1.23)
|
|
3. Verify no other references to old function
|
|
|
|
**Priority**: Medium (data issue, not functional bug)
|
|
**Risk**: Low (cleanup operation)
|
|
|
|
---
|
|
|
|
## Issue #2: UI Version Display Missing
|
|
|
|
### Current State
|
|
WebUI only shows major version (0.1.23), not full octet (0.1.23.4)
|
|
|
|
### Implementation Needed
|
|
|
|
**File**: `aggregator-web/src/pages/Dashboard.tsx`
|
|
|
|
**Agent Card View** - Add version display:
|
|
```typescript
|
|
// Add to agent card display
|
|
<AgentCard>
|
|
...
|
|
<div className="agent-version">
|
|
<span className="label">Version:</span>
|
|
<span className="value">{agent.current_version || 'Unknown'}</span>
|
|
</div>
|
|
</AgentCard>
|
|
```
|
|
|
|
**Agent Details View** - Add full version string:
|
|
```typescript
|
|
// Add to details panel
|
|
<AgentDetails>
|
|
...
|
|
<DetailRow>
|
|
<Label>Agent Version</Label>
|
|
<Value>{agent.current_version || agent.config_version || 'Unknown'}</Value>
|
|
</DetailRow>
|
|
</AgentDetails>
|
|
```
|
|
|
|
**API Data Available**:
|
|
- The backend already populates `current_version` field in API response
|
|
- May need to ensure full version string (with octet) is stored and returned
|
|
|
|
### Tasks
|
|
1. Verify backend returns full version string with octet
|
|
2. Update Agent Card to display version
|
|
3. Update Agent Details page to display version prominently
|
|
4. Consider adding version to agent list table view
|
|
|
|
**Priority**: Low (cosmetic, but important for debugging)
|
|
**Risk**: Very Low (UI only)
|
|
|
|
---
|
|
|
|
## Issue #3: Same-Version Installation Logic
|
|
|
|
### Current Logic
|
|
```go
|
|
// In update handler (pseudo-code)
|
|
if version < current {
|
|
return error("downgrade not allowed")
|
|
}
|
|
// What about version == current? ❓
|
|
```
|
|
|
|
### Use Cases
|
|
|
|
**Scenario A: Agent Reinstall**
|
|
- Agent needs to reinstall same version (config corruption, binary issues)
|
|
- Should allow: `version == current`
|
|
|
|
**Scenario B: Accidental Update Click**
|
|
- User clicks update but agent already on that version
|
|
- Should we allow, block, or warn?
|
|
|
|
### Decision Options
|
|
|
|
**Option A: Allow Same-Version (Recommended)**
|
|
- Supports reinstall scenario
|
|
- No security risk (same version)
|
|
- Simple implementation: change `version < current` to `version <= current`
|
|
- Prevents unnecessary support tickets
|
|
|
|
**Option B: Block Same-Version**
|
|
- Prevents no-op updates
|
|
- May frustrate users trying to reinstall
|
|
- Requires workaround documentation
|
|
|
|
**Option C: Warning + Allow**
|
|
```go
|
|
if version == current {
|
|
log.Printf("Warning: Agent %s already on version %s, proceeding with reinstall", agentID, version)
|
|
}
|
|
if version < current {
|
|
return error("downgrade not allowed")
|
|
}
|
|
```
|
|
|
|
### Implementation Location
|
|
|
|
**Agent-Side**:
|
|
File: `aggregator-agent/cmd/agent/subsystem_handlers.go`
|
|
Function: `handleUpdateAgent()` (lines 346-536)
|
|
|
|
Current version check:
|
|
```go
|
|
// Somewhere in the update logic (needs to be added)
|
|
currentVersion := cfg.AgentVersion
|
|
targetVersion := params["version"]
|
|
|
|
if compareVersions(targetVersion, currentVersion) <= 0 {
|
|
// Handle same version or downgrade
|
|
}
|
|
```
|
|
|
|
**Server-Side**:
|
|
File: `aggregator-server/internal/api/handlers/agent_build.go`
|
|
|
|
Check version constraints before sending update command.
|
|
|
|
### Recommendation
|
|
**Option A - Allow same-version installations**
|
|
|
|
Reasons:
|
|
1. Reinstall is a valid use case
|
|
2. No security implications
|
|
3. Easiest to implement and document
|
|
4. User expectation: "Update" button should work even if already on version
|
|
|
|
### Tasks
|
|
1. Define version comparison logic
|
|
2. Add check in agent update handler (allow ==, block <)
|
|
3. Add logging for same-version reinstalls
|
|
4. Update UI to show appropriate messages
|
|
|
|
**Priority**: Low (edge case)
|
|
**Risk**: Very Low (no security impact)
|
|
|
|
---
|
|
|
|
## Phase 2: Middleware Version Upgrade Fix
|
|
|
|
### Current Status
|
|
- Phase 1 (Build Orchestrator): 90% complete
|
|
- Phase 2 (Middleware): Starting
|
|
|
|
### Known Issues
|
|
1. **Version Upgrade Catch-22**: Middleware blocks updates due to version check
|
|
2. **Update-Aware Middleware**: Need to detect upgrading agents and relax constraints
|
|
3. **Command Processing**: Need complete implementation
|
|
|
|
### Implementation Plan
|
|
|
|
**1. Update-Aware Middleware**
|
|
- Detect when agent is in update process
|
|
- Relax machine ID binding during upgrade
|
|
- Restore binding after completion
|
|
|
|
**2. Same-Version Logic**
|
|
- Implement decision from Issue #3 above
|
|
- Update agent and server validation
|
|
|
|
**3. End-to-End Testing**
|
|
- Test flow: 0.1.23.4 → 0.1.23.5
|
|
- Verify signature verification
|
|
- Validate subsystem persistence
|
|
- Confirm agent continues operations post-update
|
|
|
|
### Tasks
|
|
1. Implement middleware version upgrade detection
|
|
2. Add nonce validation for replay protection
|
|
3. Implement same-version installation logic
|
|
4. Test complete update cycle
|
|
5. Verify signature verification
|
|
|
|
**Priority**: High (blocks Phase 2 completion)
|
|
**Risk**: Medium (need to ensure security not compromised)
|
|
|
|
---
|
|
|
|
## Build Orchestrator Status (Phase 1 - 90% Complete)
|
|
|
|
### Completed ✅
|
|
1. Signed binary generation (build_orchestrator.go)
|
|
2. Ed25519 signing integration (SignFile())
|
|
3. Generic binary signing (Option 2 approach)
|
|
4. Download handler serves signed binaries
|
|
5. Config separation (config.json not embedded)
|
|
|
|
### Remaining ⏳
|
|
1. Agent update flow testing (0.1.23.4 → 0.1.23.5)
|
|
2. End-to-end verification
|
|
3. Signature verification on agent side (placeholder in place)
|
|
|
|
### Ready for Cleanup
|
|
The following dead code should be removed:
|
|
- `TLSConfig` struct in config.go (lines 23-29)
|
|
- Docker artifact generation in agent_builder.go
|
|
- Old config fields: `CertFile`, `KeyFile`, `CAFile`
|
|
|
|
---
|
|
|
|
## Phase 3: Security Hardening
|
|
|
|
### Tasks
|
|
1. Remove JWT secret logging (debug mode only)
|
|
2. Implement per-server JWT secrets (not shared)
|
|
3. Clean dead code (TLSConfig, Docker fields)
|
|
4. Consider kernel keyring config protection
|
|
|
|
### Token Security Decision
|
|
**Status**: Sliding window refresh tokens are adequate
|
|
- Machine ID binding prevents cross-machine token reuse
|
|
- Token theft requires filesystem access (already compromised)
|
|
- True rotation deferred to v0.3.0
|
|
|
|
**Priority**: Medium
|
|
**Risk**: Low (current implementation adequate)
|
|
|
|
---
|
|
|
|
## Testing Checklist
|
|
|
|
### Agent Update Flow Test
|
|
- [ ] Bump version to 0.1.23.5
|
|
- [ ] Build signed binary for 0.1.23.5
|
|
- [ ] Test update from 0.1.23.4 → 0.1.23.5
|
|
- [ ] Verify signature verification works
|
|
- [ ] Confirm agent restarts successfully
|
|
- [ ] Validate subsystems still enabled post-update
|
|
- [ ] Verify metrics still reporting correctly
|
|
- [ ] Check update_events table for corruption
|
|
|
|
### UI Display Test
|
|
- [ ] Version shows on agent card
|
|
- [ ] Version shows on agent details page
|
|
- [ ] Version updates after agent update
|
|
|
|
### Subsystem Tests
|
|
- [ ] Storage scan reports to metrics table
|
|
- [ ] System scan reports to metrics table
|
|
- [ ] APT scan reports to update_events table
|
|
- [ ] Docker scan reports to update_events table
|
|
|
|
---
|
|
|
|
## Database Queries for Investigation
|
|
|
|
### Check for Misclassified Data
|
|
```sql
|
|
-- Query 1: Check for storage/system data in update_events
|
|
SELECT package_type, COUNT(*) as count
|
|
FROM update_events
|
|
WHERE package_type IN ('storage', 'system', 'disk', 'boot')
|
|
GROUP BY package_type;
|
|
|
|
-- Query 2: Check metrics table for package update data
|
|
SELECT package_type, COUNT(*) as count
|
|
FROM metrics
|
|
WHERE package_type IN ('apt', 'dnf', 'docker', 'windows', 'winget')
|
|
GROUP BY package_type;
|
|
|
|
-- Query 3: Check agent_subsystems configuration
|
|
SELECT name, enabled, auto_run
|
|
FROM agent_subsystems
|
|
WHERE name IN ('storage', 'system', 'updates');
|
|
```
|
|
|
|
### Cleanup Queries (If Needed)
|
|
```sql
|
|
-- Move or delete misclassified data
|
|
-- BACKUP FIRST!
|
|
|
|
-- Check how many records
|
|
SELECT COUNT(*) FROM update_events
|
|
WHERE package_type = 'storage';
|
|
|
|
-- Delete (or move to metrics table)
|
|
DELETE FROM update_events
|
|
WHERE package_type IN ('storage', 'system')
|
|
AND created_at < NOW() - INTERVAL '7 days';
|
|
```
|
|
|
|
---
|
|
|
|
## Code Locations Reference
|
|
|
|
### Agent-Side
|
|
- `aggregator-agent/cmd/agent/main.go` - Command routing (line 864-882)
|
|
- `aggregator-agent/cmd/agent/subsystem_handlers.go` - Scan handlers
|
|
- `aggregator-agent/cmd/agent/main.go:985` - OLD `handleScanUpdates` (delete)
|
|
- `aggregator-agent/internal/service/windows.go:32` - Old version constant (update)
|
|
|
|
### API Handlers
|
|
- `aggregator-server/internal/api/handlers/updates.go:67` - ReportUpdates
|
|
- `aggregator-server/internal/api/handlers/metrics.go:31` - ReportMetrics
|
|
- `aggregator-server/internal/api/handlers/agent_build.go` - Update logic
|
|
|
|
### WebUI
|
|
- `aggregator-web/src/pages/Dashboard.tsx` - Agent card and details
|
|
- `aggregator-web/src/pages/settings/AgentManagement.tsx` - Version display
|
|
|
|
### Database Tables
|
|
- `update_events` - Package updates (apt, dnf, docker, etc.)
|
|
- `metrics` - System metrics (storage, system, cpu, memory)
|
|
- `agent_subsystems` - Subsystem configuration
|
|
|
|
---
|
|
|
|
## Recommended Implementation Order
|
|
|
|
### Week 1 (Critical Fixes)
|
|
1. **Database Investigation** - Run queries to check for misclassified data
|
|
2. **UI Version Display** - Add version to agent cards and details (easy win)
|
|
3. **Same-Version Logic Decision** - Make decision and implement
|
|
4. **Test Update Flow** - 0.1.23.4 → 0.1.23.5
|
|
|
|
### Week 2 (Phase 2 Completion)
|
|
5. **Middleware Version Upgrade** - Implement detection logic
|
|
6. **Security Hardening** - JWT logging, per-server secrets
|
|
7. **Code Cleanup** - Remove old handleScanUpdates function
|
|
8. **Documentation** - Update all docs for v0.2.0
|
|
|
|
### Week 3 (Polish)
|
|
9. **Token Rotation** (Nice-to-have) - Implement true rotation
|
|
10. **Enhanced UI** - Improve metrics display
|
|
11. **Testing** - Full integration test suite
|
|
|
|
---
|
|
|
|
## Risk Assessment
|
|
|
|
| Issue | Priority | Risk | Effort |
|
|
|-------|----------|------|--------|
|
|
| Scan Updates Quirk | Medium | Low | 2 hours |
|
|
| UI Version Display | Low | Very Low | 1 hour |
|
|
| Same-Version Logic | Low | Very Low | 1 hour |
|
|
| Middleware Upgrade | High | Medium | 4 hours |
|
|
| Agent Update Test | High | Medium | 3 hours |
|
|
| Security Hardening | Medium | Low | 4 hours |
|
|
|
|
---
|
|
|
|
## Decision Log
|
|
|
|
### Decision 1: Same-Version Installations
|
|
**Status**: Pending
|
|
**Options**: Allow / Block / Warn
|
|
**Recommendation**: **Allow** (supports reinstall use case)
|
|
|
|
### Decision 2: Token Rotation Priority
|
|
**Status**: Defer to v0.3.0
|
|
**Rationale**: Machine ID binding provides adequate security
|
|
**Decision**: **Defer** - sliding window sufficient
|
|
|
|
### Decision 3: UI Version Display Location
|
|
**Status**: Pending
|
|
**Options**: Card only / Details only / Both
|
|
**Recommendation**: **Both** for maximum visibility
|
|
|
|
### Decision 4: Scan Updates Fix Approach
|
|
**Status**: Pending
|
|
**Options**: Database cleanup / Code cleanup
|
|
**Recommendation**: **Both** - cleanup old data AND remove dead code
|
|
|
|
---
|
|
|
|
## Next Steps
|
|
|
|
### Immediate (Today)
|
|
1. ☐ Check database for misclassified data using queries above
|
|
2. ☐ Make decisions on Same-Version logic (Allow/Block)
|
|
3. ☐ Decide on token rotation (now vs defer)
|
|
4. ☐ Run test update flow
|
|
|
|
### This Week
|
|
5. ☐ Implement UI version display
|
|
6. ☐ Implement same-version installation logic
|
|
7. ☐ Complete middleware version upgrade
|
|
8. ☐ Remove JWT secret logging
|
|
|
|
### Next Week
|
|
9. ☐ Full integration testing
|
|
10. ☐ Update documentation
|
|
11. ☐ Prepare v0.2.0 release
|
|
|
|
---
|
|
|
|
## Notes
|
|
|
|
**Build Orchestrator Misalignment - RESOLVED** ✅
|
|
- Originally generating Docker configs, installer expecting native binaries
|
|
- Fixed: Now generates signed native binaries per version/platform
|
|
- Signed packages stored in database
|
|
- Download endpoint serves correct binaries
|
|
|
|
**Version Upgrade Catch-22 - IN PROGRESS** ⚠️
|
|
- Middleware blocks updates due to machine ID binding
|
|
- Need update-aware middleware to detect upgrading agents
|
|
- Nonce validation needed for replay protection
|
|
|
|
**Token Security - ADEQUATE** ✅
|
|
- Sliding window refresh tokens sufficient
|
|
- Machine ID binding prevents cross-machine token reuse
|
|
- True rotation nice-to-have but not critical for v0.2.0
|
|
|
|
---
|
|
|
|
**Document Version**: 1.0
|
|
**Last Updated**: 2025-11-10
|
|
**Next Review**: After critical fixes completed
|
|
**Owner**: @Fimeg
|
|
**Collaborator**: Kimi-k2 (Infrastructure Analysis)
|