Add docs and project files - force for Culurien

This commit is contained in:
Fimeg
2026-03-28 20:46:24 -04:00
parent dc61797423
commit 484a7f77ce
343 changed files with 119530 additions and 0 deletions

View File

@@ -0,0 +1,468 @@
# RedFlag UI and Critical Fixes - Implementation Plan
**Date:** 2025-11-10
**Version:** v0.1.23.4 → v0.1.23.5
**Status:** Investigation Complete, Implementation Ready
---
## Executive Summary
Based on investigation of the three critical issues identified, here's the complete breakdown of what's happening and what needs to be fixed.
---
## Issue #1: Scan Updates Quirk - INVESTIGATION COMPLETE ✅
### Symptoms
- Disk/boot metrics (44% used) appearing as "approve/reject" updates in UI
- Old monolithic logic intercepting new subsystem scanners
### Investigation Results
**Agent-Side**: ✅ CORRECT
- Orchestrator scanners correctly call the right endpoints:
- **Storage Scanner** → `ReportMetrics()` (✅ Correct)
- **System Scanner** → `ReportMetrics()` (✅ Correct)
- **Update Scanners** (APT, DNF, Docker, etc.) → `ReportUpdates()` (✅ Correct)
**Server-Side Handlers**: ✅ CORRECT
- `ReportUpdates` handler (updates.go:67) stores in `update_events` table
- `ReportMetrics` handler (metrics.go:31) stores in `metrics` table
- Both handlers properly separated and functioning
**Root Cause Identified**:
The old monolithic `handleScanUpdates` function (main.go:985-1153) still exists in the codebase. While it's not currently registered in the command switch statement (which uses `handleScanUpdatesV2` correctly), there are two possibilities:
1. **Old data** in the database from before the subsystem refactor
2. **Windows service code** (service/windows.go) uses old version constant (0.1.16) and may have different logic
### Fix Required
**Option A - Database Cleanup (Quick Fix)**:
```sql
-- Check for misclassified data
SELECT package_type, COUNT(*) as count
FROM update_events
WHERE package_type IN ('storage', 'system')
GROUP BY package_type;
-- If found, move to metrics table or delete old data
```
**Option B - Code Cleanup (Recommended)**:
1. Delete the old `handleScanUpdates` function (lines 985-1153 in main.go)
2. Update Windows service version constant to match (0.1.23)
3. Verify no other references to old function
**Priority**: Medium (data issue, not functional bug)
**Risk**: Low (cleanup operation)
---
## Issue #2: UI Version Display Missing
### Current State
WebUI only shows major version (0.1.23), not full octet (0.1.23.4)
### Implementation Needed
**File**: `aggregator-web/src/pages/Dashboard.tsx`
**Agent Card View** - Add version display:
```typescript
// Add to agent card display
<AgentCard>
...
<div className="agent-version">
<span className="label">Version:</span>
<span className="value">{agent.current_version || 'Unknown'}</span>
</div>
</AgentCard>
```
**Agent Details View** - Add full version string:
```typescript
// Add to details panel
<AgentDetails>
...
<DetailRow>
<Label>Agent Version</Label>
<Value>{agent.current_version || agent.config_version || 'Unknown'}</Value>
</DetailRow>
</AgentDetails>
```
**API Data Available**:
- The backend already populates `current_version` field in API response
- May need to ensure full version string (with octet) is stored and returned
### Tasks
1. Verify backend returns full version string with octet
2. Update Agent Card to display version
3. Update Agent Details page to display version prominently
4. Consider adding version to agent list table view
**Priority**: Low (cosmetic, but important for debugging)
**Risk**: Very Low (UI only)
---
## Issue #3: Same-Version Installation Logic
### Current Logic
```go
// In update handler (pseudo-code)
if version < current {
return error("downgrade not allowed")
}
// What about version == current? ❓
```
### Use Cases
**Scenario A: Agent Reinstall**
- Agent needs to reinstall same version (config corruption, binary issues)
- Should allow: `version == current`
**Scenario B: Accidental Update Click**
- User clicks update but agent already on that version
- Should we allow, block, or warn?
### Decision Options
**Option A: Allow Same-Version (Recommended)**
- Supports reinstall scenario
- No security risk (same version)
- Simple implementation: change `version < current` to `version <= current`
- Prevents unnecessary support tickets
**Option B: Block Same-Version**
- Prevents no-op updates
- May frustrate users trying to reinstall
- Requires workaround documentation
**Option C: Warning + Allow**
```go
if version == current {
log.Printf("Warning: Agent %s already on version %s, proceeding with reinstall", agentID, version)
}
if version < current {
return error("downgrade not allowed")
}
```
### Implementation Location
**Agent-Side**:
File: `aggregator-agent/cmd/agent/subsystem_handlers.go`
Function: `handleUpdateAgent()` (lines 346-536)
Current version check:
```go
// Somewhere in the update logic (needs to be added)
currentVersion := cfg.AgentVersion
targetVersion := params["version"]
if compareVersions(targetVersion, currentVersion) <= 0 {
// Handle same version or downgrade
}
```
**Server-Side**:
File: `aggregator-server/internal/api/handlers/agent_build.go`
Check version constraints before sending update command.
### Recommendation
**Option A - Allow same-version installations**
Reasons:
1. Reinstall is a valid use case
2. No security implications
3. Easiest to implement and document
4. User expectation: "Update" button should work even if already on version
### Tasks
1. Define version comparison logic
2. Add check in agent update handler (allow ==, block <)
3. Add logging for same-version reinstalls
4. Update UI to show appropriate messages
**Priority**: Low (edge case)
**Risk**: Very Low (no security impact)
---
## Phase 2: Middleware Version Upgrade Fix
### Current Status
- Phase 1 (Build Orchestrator): 90% complete
- Phase 2 (Middleware): Starting
### Known Issues
1. **Version Upgrade Catch-22**: Middleware blocks updates due to version check
2. **Update-Aware Middleware**: Need to detect upgrading agents and relax constraints
3. **Command Processing**: Need complete implementation
### Implementation Plan
**1. Update-Aware Middleware**
- Detect when agent is in update process
- Relax machine ID binding during upgrade
- Restore binding after completion
**2. Same-Version Logic**
- Implement decision from Issue #3 above
- Update agent and server validation
**3. End-to-End Testing**
- Test flow: 0.1.23.4 → 0.1.23.5
- Verify signature verification
- Validate subsystem persistence
- Confirm agent continues operations post-update
### Tasks
1. Implement middleware version upgrade detection
2. Add nonce validation for replay protection
3. Implement same-version installation logic
4. Test complete update cycle
5. Verify signature verification
**Priority**: High (blocks Phase 2 completion)
**Risk**: Medium (need to ensure security not compromised)
---
## Build Orchestrator Status (Phase 1 - 90% Complete)
### Completed ✅
1. Signed binary generation (build_orchestrator.go)
2. Ed25519 signing integration (SignFile())
3. Generic binary signing (Option 2 approach)
4. Download handler serves signed binaries
5. Config separation (config.json not embedded)
### Remaining ⏳
1. Agent update flow testing (0.1.23.4 → 0.1.23.5)
2. End-to-end verification
3. Signature verification on agent side (placeholder in place)
### Ready for Cleanup
The following dead code should be removed:
- `TLSConfig` struct in config.go (lines 23-29)
- Docker artifact generation in agent_builder.go
- Old config fields: `CertFile`, `KeyFile`, `CAFile`
---
## Phase 3: Security Hardening
### Tasks
1. Remove JWT secret logging (debug mode only)
2. Implement per-server JWT secrets (not shared)
3. Clean dead code (TLSConfig, Docker fields)
4. Consider kernel keyring config protection
### Token Security Decision
**Status**: Sliding window refresh tokens are adequate
- Machine ID binding prevents cross-machine token reuse
- Token theft requires filesystem access (already compromised)
- True rotation deferred to v0.3.0
**Priority**: Medium
**Risk**: Low (current implementation adequate)
---
## Testing Checklist
### Agent Update Flow Test
- [ ] Bump version to 0.1.23.5
- [ ] Build signed binary for 0.1.23.5
- [ ] Test update from 0.1.23.4 → 0.1.23.5
- [ ] Verify signature verification works
- [ ] Confirm agent restarts successfully
- [ ] Validate subsystems still enabled post-update
- [ ] Verify metrics still reporting correctly
- [ ] Check update_events table for corruption
### UI Display Test
- [ ] Version shows on agent card
- [ ] Version shows on agent details page
- [ ] Version updates after agent update
### Subsystem Tests
- [ ] Storage scan reports to metrics table
- [ ] System scan reports to metrics table
- [ ] APT scan reports to update_events table
- [ ] Docker scan reports to update_events table
---
## Database Queries for Investigation
### Check for Misclassified Data
```sql
-- Query 1: Check for storage/system data in update_events
SELECT package_type, COUNT(*) as count
FROM update_events
WHERE package_type IN ('storage', 'system', 'disk', 'boot')
GROUP BY package_type;
-- Query 2: Check metrics table for package update data
SELECT package_type, COUNT(*) as count
FROM metrics
WHERE package_type IN ('apt', 'dnf', 'docker', 'windows', 'winget')
GROUP BY package_type;
-- Query 3: Check agent_subsystems configuration
SELECT name, enabled, auto_run
FROM agent_subsystems
WHERE name IN ('storage', 'system', 'updates');
```
### Cleanup Queries (If Needed)
```sql
-- Move or delete misclassified data
-- BACKUP FIRST!
-- Check how many records
SELECT COUNT(*) FROM update_events
WHERE package_type = 'storage';
-- Delete (or move to metrics table)
DELETE FROM update_events
WHERE package_type IN ('storage', 'system')
AND created_at < NOW() - INTERVAL '7 days';
```
---
## Code Locations Reference
### Agent-Side
- `aggregator-agent/cmd/agent/main.go` - Command routing (line 864-882)
- `aggregator-agent/cmd/agent/subsystem_handlers.go` - Scan handlers
- `aggregator-agent/cmd/agent/main.go:985` - OLD `handleScanUpdates` (delete)
- `aggregator-agent/internal/service/windows.go:32` - Old version constant (update)
### API Handlers
- `aggregator-server/internal/api/handlers/updates.go:67` - ReportUpdates
- `aggregator-server/internal/api/handlers/metrics.go:31` - ReportMetrics
- `aggregator-server/internal/api/handlers/agent_build.go` - Update logic
### WebUI
- `aggregator-web/src/pages/Dashboard.tsx` - Agent card and details
- `aggregator-web/src/pages/settings/AgentManagement.tsx` - Version display
### Database Tables
- `update_events` - Package updates (apt, dnf, docker, etc.)
- `metrics` - System metrics (storage, system, cpu, memory)
- `agent_subsystems` - Subsystem configuration
---
## Recommended Implementation Order
### Week 1 (Critical Fixes)
1. **Database Investigation** - Run queries to check for misclassified data
2. **UI Version Display** - Add version to agent cards and details (easy win)
3. **Same-Version Logic Decision** - Make decision and implement
4. **Test Update Flow** - 0.1.23.4 → 0.1.23.5
### Week 2 (Phase 2 Completion)
5. **Middleware Version Upgrade** - Implement detection logic
6. **Security Hardening** - JWT logging, per-server secrets
7. **Code Cleanup** - Remove old handleScanUpdates function
8. **Documentation** - Update all docs for v0.2.0
### Week 3 (Polish)
9. **Token Rotation** (Nice-to-have) - Implement true rotation
10. **Enhanced UI** - Improve metrics display
11. **Testing** - Full integration test suite
---
## Risk Assessment
| Issue | Priority | Risk | Effort |
|-------|----------|------|--------|
| Scan Updates Quirk | Medium | Low | 2 hours |
| UI Version Display | Low | Very Low | 1 hour |
| Same-Version Logic | Low | Very Low | 1 hour |
| Middleware Upgrade | High | Medium | 4 hours |
| Agent Update Test | High | Medium | 3 hours |
| Security Hardening | Medium | Low | 4 hours |
---
## Decision Log
### Decision 1: Same-Version Installations
**Status**: Pending
**Options**: Allow / Block / Warn
**Recommendation**: **Allow** (supports reinstall use case)
### Decision 2: Token Rotation Priority
**Status**: Defer to v0.3.0
**Rationale**: Machine ID binding provides adequate security
**Decision**: **Defer** - sliding window sufficient
### Decision 3: UI Version Display Location
**Status**: Pending
**Options**: Card only / Details only / Both
**Recommendation**: **Both** for maximum visibility
### Decision 4: Scan Updates Fix Approach
**Status**: Pending
**Options**: Database cleanup / Code cleanup
**Recommendation**: **Both** - cleanup old data AND remove dead code
---
## Next Steps
### Immediate (Today)
1. ☐ Check database for misclassified data using queries above
2. ☐ Make decisions on Same-Version logic (Allow/Block)
3. ☐ Decide on token rotation (now vs defer)
4. ☐ Run test update flow
### This Week
5. ☐ Implement UI version display
6. ☐ Implement same-version installation logic
7. ☐ Complete middleware version upgrade
8. ☐ Remove JWT secret logging
### Next Week
9. ☐ Full integration testing
10. ☐ Update documentation
11. ☐ Prepare v0.2.0 release
---
## Notes
**Build Orchestrator Misalignment - RESOLVED**
- Originally generating Docker configs, installer expecting native binaries
- Fixed: Now generates signed native binaries per version/platform
- Signed packages stored in database
- Download endpoint serves correct binaries
**Version Upgrade Catch-22 - IN PROGRESS** ⚠️
- Middleware blocks updates due to machine ID binding
- Need update-aware middleware to detect upgrading agents
- Nonce validation needed for replay protection
**Token Security - ADEQUATE**
- Sliding window refresh tokens sufficient
- Machine ID binding prevents cross-machine token reuse
- True rotation nice-to-have but not critical for v0.2.0
---
**Document Version**: 1.0
**Last Updated**: 2025-11-10
**Next Review**: After critical fixes completed
**Owner**: @Fimeg
**Collaborator**: Kimi-k2 (Infrastructure Analysis)